# Профилирование и оптимизация выполнения кода

__Автор задач: Блохин Н.В. (NVBlokhin@fa.ru)__

Материалы:
* Макрушин С.В. "Оптимизация выполнения кода, векторизация, Numba"
* IPython Cookbook, Second Edition (2018), глава 4
* https://ipython-books.github.io/43-profiling-your-code-line-by-line-with-line_profiler/

## Задачи для совместного разбора

1. Сгенерируйте массив `A` из `N=1млн` случайных целых чисел на отрезке от 0 до 1000. Пусть `B[i] = A[i] + 100`. Посчитайте среднее значение массива `B`.

2. Создайте таблицу 2млн строк и с 4 столбцами, заполненными случайными числами. Добавьте столбец `key`, которые содержит элементы из множества английских букв. Выберите из таблицы подмножество строк, для которых в столбце `key` указаны первые 5 английских букв.

In [None]:
import numpy as np
import pandas as pd
import string

N = 2_000_000
df = pd.DataFrame(np.random.randn(N, 4), columns=[f"col{i}" for i in range(4)])
df["key"] = np.random.choice(list(string.ascii_letters.lower()), N, replace=True)
df.head(2)

## Лабораторная работа 1

__При решении данных задач не подразумевается использования циклов или генераторов Python в ходе работы с пакетами `numpy` и `pandas`, если в задании не сказано обратного. Решения задач, в которых для обработки массивов `numpy` или структур `pandas` используются явные циклы (без согласования с преподавателем), могут быть признаны некорректными и не засчитаны.__

В файлах `recipes_sample.csv` и `reviews_sample.csv` находится информация об рецептах блюд и отзывах на эти рецепты соответственно. Загрузите данные из файлов в виде `pd.DataFrame` с названиями `recipes` и `reviews`. Обратите внимание на корректное считывание столбца(ов) с индексами. Приведите столбцы к нужным типам.

In [7]:
import pandas as pd
import numpy as np

In [69]:
recipes = pd.read_csv('./data/recipes_sample.csv', delimiter=',')
reviews = pd.read_csv('./data/reviews_sample.csv', delimiter=',')

In [70]:
recipes.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,


In [79]:
recipes.set_index('id',inplace=True)
recipes.head()


Unnamed: 0_level_0,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
44123,george s at the cove black bean soup,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
67664,healthy for them yogurt popsicles,10,91970,2003-07-26,,my children and their friends ask for my homem...,
38798,i can t believe it s spinach,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
35173,italian gut busters,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,


In [80]:
recipes.dtypes

name                      object
minutes                    int64
contributor_id             int64
submitted         datetime64[ns]
n_steps                  float64
description               object
n_ingredients            float64
dtype: object

In [81]:
recipes['submitted'].isna().unique()

array([False])

In [82]:
recipes['submitted'] = pd.to_datetime(recipes['submitted'])

In [84]:
recipes.head()

Unnamed: 0_level_0,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
44123,george s at the cove black bean soup,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
67664,healthy for them yogurt popsicles,10,91970,2003-07-26,,my children and their friends ask for my homem...,
38798,i can t believe it s spinach,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
35173,italian gut busters,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,


In [87]:
recipes.dtypes

name                      object
minutes                    int64
contributor_id             int64
submitted         datetime64[ns]
n_steps                  float64
description               object
n_ingredients            float64
dtype: object

In [89]:
recipes.describe()

Unnamed: 0,minutes,contributor_id,submitted,n_steps,n_ingredients
count,30000.0,30000.0,30000,18810.0,21120.0
mean,123.358133,5635901.0,2006-11-13 01:10:30.720000,9.805582,9.008286
min,0.0,1530.0,1999-08-06 00:00:00,1.0,1.0
25%,20.0,55964.5,2004-09-13 00:00:00,6.0,6.0
50%,40.0,169969.0,2007-01-26 00:00:00,9.0,9.0
75%,65.0,396078.0,2008-10-28 00:00:00,12.0,11.0
max,129615.0,2002248000.0,2018-08-15 00:00:00,88.0,34.0
std,1660.876602,100737300.0,,5.944155,3.715213


In [97]:
#pd.DataFrame.info(recipes) для ебланов
recipes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30000 entries, 44123 to 298512
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   name            30000 non-null  object        
 1   minutes         30000 non-null  int64         
 2   contributor_id  30000 non-null  int64         
 3   submitted       30000 non-null  datetime64[ns]
 4   n_steps         18810 non-null  float64       
 5   description     29377 non-null  object        
 6   n_ingredients   21120 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(2)
memory usage: 1.8+ MB


In [99]:
reviews.head()

Unnamed: 0.1,Unnamed: 0,user_id,recipe_id,date,rating,review
0,370476,21752,57993,2003-05-01,5,Last week whole sides of frozen salmon fillet ...
1,624300,431813,142201,2007-09-16,5,So simple and so tasty! I used a yellow capsi...
2,187037,400708,252013,2008-01-10,4,"Very nice breakfast HH, easy to make and yummy..."
3,706134,2001852463,404716,2017-12-11,5,These are a favorite for the holidays and so e...
4,312179,95810,129396,2008-03-14,5,Excellent soup! The tomato flavor is just gre...


In [103]:
d = {'a':1,'b':2}
d['a']

1

In [105]:
reviews.rename(columns={'Unnamed: 0':'id'},inplace=True)
reviews

Unnamed: 0,id,user_id,recipe_id,date,rating,review
0,370476,21752,57993,2003-05-01,5,Last week whole sides of frozen salmon fillet ...
1,624300,431813,142201,2007-09-16,5,So simple and so tasty! I used a yellow capsi...
2,187037,400708,252013,2008-01-10,4,"Very nice breakfast HH, easy to make and yummy..."
3,706134,2001852463,404716,2017-12-11,5,These are a favorite for the holidays and so e...
4,312179,95810,129396,2008-03-14,5,Excellent soup! The tomato flavor is just gre...
...,...,...,...,...,...,...
126691,1013457,1270706,335534,2009-05-17,4,This recipe was great! I made it last night. I...
126692,158736,2282344,8701,2012-06-03,0,This recipe is outstanding. I followed the rec...
126693,1059834,689540,222001,2008-04-08,5,"Well, we were not a crowd but it was a fabulou..."
126694,453285,2000242659,354979,2015-06-02,5,I have been a steak eater and dedicated BBQ gr...


In [113]:
reviews = pd.read_csv('./data/reviews_sample.csv',delimiter=',')
reviews.rename(columns={'Unnamed: 0':'id'},inplace=True)
reviews

Unnamed: 0,id,user_id,recipe_id,date,rating,review
0,370476,21752,57993,2003-05-01,5,Last week whole sides of frozen salmon fillet ...
1,624300,431813,142201,2007-09-16,5,So simple and so tasty! I used a yellow capsi...
2,187037,400708,252013,2008-01-10,4,"Very nice breakfast HH, easy to make and yummy..."
3,706134,2001852463,404716,2017-12-11,5,These are a favorite for the holidays and so e...
4,312179,95810,129396,2008-03-14,5,Excellent soup! The tomato flavor is just gre...
...,...,...,...,...,...,...
126691,1013457,1270706,335534,2009-05-17,4,This recipe was great! I made it last night. I...
126692,158736,2282344,8701,2012-06-03,0,This recipe is outstanding. I followed the rec...
126693,1059834,689540,222001,2008-04-08,5,"Well, we were not a crowd but it was a fabulou..."
126694,453285,2000242659,354979,2015-06-02,5,I have been a steak eater and dedicated BBQ gr...


In [114]:
reviews.set_index('id',inplace=True)
reviews

Unnamed: 0_level_0,user_id,recipe_id,date,rating,review
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
370476,21752,57993,2003-05-01,5,Last week whole sides of frozen salmon fillet ...
624300,431813,142201,2007-09-16,5,So simple and so tasty! I used a yellow capsi...
187037,400708,252013,2008-01-10,4,"Very nice breakfast HH, easy to make and yummy..."
706134,2001852463,404716,2017-12-11,5,These are a favorite for the holidays and so e...
312179,95810,129396,2008-03-14,5,Excellent soup! The tomato flavor is just gre...
...,...,...,...,...,...
1013457,1270706,335534,2009-05-17,4,This recipe was great! I made it last night. I...
158736,2282344,8701,2012-06-03,0,This recipe is outstanding. I followed the rec...
1059834,689540,222001,2008-04-08,5,"Well, we were not a crowd but it was a fabulou..."
453285,2000242659,354979,2015-06-02,5,I have been a steak eater and dedicated BBQ gr...


## Измерение времени выполнения кода

Создайте версию таблицы, содержащие строки строки для рецептов, которые были добавлены в 2010 году.

Реализуйте несколько вариантов функции подсчета средней длины полного описания рецепта для рецептов, добавленных в 2010 году. Полным описанием рецепта называется строка, полученная путем конкатенации названия и описания рецепта через пробел.

In [115]:
mask = recipes['submitted'].dt.year == 2010
recipes_2010 = recipes[mask]

In [118]:
recipes_2010.head()

Unnamed: 0_level_0,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
437637,just peachy cobbler,70,1085867,2010-09-17,10.0,all i can say is yummmmmm . . . a simple to ma...,10.0
437219,the heat spicy party mix,95,1682162,2010-09-13,,a spicy chex mix that will really warm your gu...,11.0
435816,iowa state fair sweet dough caramel cinnamon ...,80,17803,2010-08-24,29.0,this was the winning entry at the 2010 iowa st...,
428566,1 minute blueberries cream,2,1375473,2010-06-04,4.0,i was craving blueberry tonight but wanted non...,
416599,2 2 2 diet mocha,5,789314,2010-03-15,5.0,"while trying to come up with a satisfying ""sna...",7.0


№1\.1 С использованием метода `DataFrame.iterrows` таблицы:

- функция принимает на вход таблицу, содержащую рецепты за 2010 год;
    
- вычисление полного описания рецепта осуществляется внутри цикла по `iterrows` для каждой строки по отдельности.

In [121]:
a = 0
a += 1
a

1

In [123]:
recipes_2010.shape[0]

1538

In [126]:
def get_mean_len_A(df: pd.DataFrame) -> float:
    mean_len = 0
    for _,row in df.iterrows():
        mean_len += row['name'].__len__() + row['description'].__len__() + 1
    mean_len /= df.shape[0]
    return mean_len

In [131]:
%%time
get_mean_len_A(recipes_2010)

CPU times: total: 0 ns
Wall time: 28.7 ms


265.501300390117

№1\.2. С использованием метода `DataFrame.apply` таблицы:

- функция принимает на вход таблицу, содержащую рецепты за 2010 год;
    
- вызываете метод apply у таблицы; в качестве аргумента передаете функцию, которая возвращает длину полного описания для каждой строки;
    
- считаете среднюю длину описаний, вызвав соответствующий метод серии.

In [130]:
def get_mean_len_B(df: pd.DataFrame) -> float:
    mean_len = df['name'].apply(len) + df['description'].apply(len) + 1
    return mean_len.mean()

In [132]:
%%time
get_mean_len_B(recipes_2010)

CPU times: total: 0 ns
Wall time: 1.79 ms


265.501300390117

№1\.3. С использованием векторизованных методов серий `pd.Series`:

- функция принимает на вход таблицу, содержащую рецепты за 2010 год;
    
- при помощи векторизованной операции сложения получаете столбец с полным описанием;
    
- считаете длину каждого элемента столбца с полным описанием, воспользовавшись соответствующим строковым методом аксессора `.str`;
    
- считаете среднюю длину описаний, вызвав соответствующий метод серии.

In [136]:
def get_mean_len_C(df: pd.DataFrame) -> float:
    df['full_description'] = df['name'] + '_' + df['description']
    return (df['full_description']).str.len().mean()

In [137]:
get_mean_len_C(recipes_2010)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['full_description'] = df['name'] + '_' + df['description']


265.501300390117

№1.4 Проверьте, что результаты работы всех написанных функций корректны и совпадают. Измерьте выполнения всех написанных функций при помощи магических команд `time` и `timeit`.

## Анализ пошагового выполнения кода 

Вам предлагается воспользоваться функцией, которая собирает статистику о том, сколько отзывов содержат то или иное слово. 

In [141]:
reviews

Unnamed: 0_level_0,user_id,recipe_id,date,rating,review
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
370476,21752,57993,2003-05-01,5,Last week whole sides of frozen salmon fillet ...
624300,431813,142201,2007-09-16,5,So simple and so tasty! I used a yellow capsi...
187037,400708,252013,2008-01-10,4,"Very nice breakfast HH, easy to make and yummy..."
706134,2001852463,404716,2017-12-11,5,These are a favorite for the holidays and so e...
312179,95810,129396,2008-03-14,5,Excellent soup! The tomato flavor is just gre...
...,...,...,...,...,...
1013457,1270706,335534,2009-05-17,4,This recipe was great! I made it last night. I...
158736,2282344,8701,2012-06-03,0,This recipe is outstanding. I followed the rec...
1059834,689540,222001,2008-04-08,5,"Well, we were not a crowd but it was a fabulou..."
453285,2000242659,354979,2015-06-02,5,I have been a steak eater and dedicated BBQ gr...


In [164]:
import re


def get_word_reviews_count(df):
    word_reviews = {}
    for review_id, row in df.dropna(subset=["review"]).iterrows():
        review = row["review"]
        words = re.sub(r"[^A-Za-z\s]", "", review).split(" ")
        for word in words:
            if word.lower() not in word_reviews:
                word_reviews[word.lower()] = set()
            word_reviews[word.lower()].add(review_id)
    word_reviews_count = {}
    for _, row in df.dropna(subset=["review"]).iterrows():
        review = row["review"]
        words = re.sub(r"[^A-Za-z\s]", "", review).split(" ")
        for word in words:
            word_reviews_count[word.lower()] = len(word_reviews[word.lower()])
    return word_reviews_count

In [161]:
set([1,1,2,2,2,3])

{1, 2, 3}

In [170]:
def get_word_reviews_count_pizda(df):
    word_reviews = df.dropna(subset=["review"])["review"].str.replace(r"[^A-Za-z]", "").str.lower().str.split(' ').explode()
    word_reviews = word_reviews.groupby(word_reviews).agg(lambda x: set(x.index))
    return word_reviews.str.len()
get_word_reviews_count_pizda(reviews)

review
                    63579
\t\tthis                1
\t1~2~3                 1
\tcups,                 1
\tlet's                 1
                    ...  
™                       1
��                      7
����                    1
������                  1
����������������        1
Name: review, Length: 156323, dtype: int64

In [169]:
l2.to_dict()

{'': 63579,
 '\t\tthis': 1,
 '\t1~2~3': 1,
 '\tcups,': 1,
 "\tlet's": 1,
 '\n': 25,
 '\n\n': 7,
 '\n\n\ncommercially': 1,
 '\n\n\ni': 1,
 '\n\n\nin': 1,
 '\n\n\nwhen': 1,
 '\n\n2': 2,
 '\n\n2.': 1,
 '\n\n=[': 1,
 '\n\na': 2,
 '\n\nabout': 1,
 '\n\nabsolutely': 1,
 '\n\nactually,': 1,
 '\n\nadded': 1,
 '\n\nafter': 1,
 '\n\nall': 3,
 '\n\nalso': 1,
 '\n\nalso,': 2,
 '\n\nalthough': 1,
 '\n\nand': 1,
 '\n\nanticipating': 1,
 '\n\nanyway,': 2,
 '\n\nas': 2,
 '\n\nbecause': 1,
 '\n\nblack': 1,
 '\n\nbrown': 1,
 '\n\nbut': 4,
 '\n\nbut,': 1,
 "\n\ncan't": 2,
 '\n\ncombine': 1,
 '\n\ncooked': 1,
 '\n\ndefiantly': 1,
 '\n\ndelicious!': 1,
 '\n\ndo': 2,
 '\n\nevery': 1,
 '\n\neverything': 2,
 '\n\nexcellent': 1,
 '\n\nfirmness,': 1,
 '\n\nfirst': 1,
 '\n\nfor': 5,
 '\n\nfound': 1,
 '\n\nfreezes': 1,
 '\n\nfresh': 1,
 '\n\ngilles,': 1,
 '\n\ngoing': 1,
 '\n\ngood': 3,
 '\n\ngreat': 4,
 '\n\nhad': 2,
 '\n\nhave': 1,
 '\n\nhighcotton,': 1,
 '\n\nhope': 1,
 '\n\nhowever': 1,
 '\n\nhowever,': 2,
 '

In [140]:
get_word_reviews_count(reviews)

{'last': 4517,
 'week': 1489,
 'whole': 5540,
 'sides': 435,
 'of': 61867,
 'frozen': 2722,
 'salmon': 819,
 'fillet': 86,
 'was': 56972,
 'on': 28791,
 'sale': 255,
 'in': 43940,
 'my': 44544,
 'local': 565,
 'supermarket': 93,
 'so': 39441,
 'i': 101329,
 'bought': 1490,
 'tons': 184,
 'okay': 717,
 'only': 13679,
 '': 89125,
 'but': 36936,
 'total': 557,
 'weight': 290,
 'over': 8762,
 'pounds': 275,
 'this': 83593,
 'recipe': 54531,
 'is': 41236,
 'perfect': 8643,
 'for': 75829,
 'even': 8881,
 'though': 4791,
 'it': 73971,
 'calls': 513,
 'steaks': 434,
 'cut': 6416,
 'up': 14352,
 'the': 95894,
 'into': 6364,
 'individual': 304,
 'portions': 209,
 'and': 97007,
 'followed': 5450,
 'instructions': 971,
 'exactly': 4678,
 'im': 7768,
 'one': 15973,
 'those': 2408,
 'food': 3473,
 'combining': 82,
 'diets': 45,
 'left': 4958,
 'out': 22223,
 'white': 3493,
 'wine': 1580,
 'added': 19387,
 'just': 23483,
 'a': 84192,
 'dash': 617,
 'vinegar': 1641,
 'instead': 11221,
 'little': 14634

№2.1 Найдите узкие места в коде, проанализировав код функции по шагам, используя профайлер. Сохраните результаты работы профайлера в отдельную текстовую ячейку. Выпишите (словами), что в имеющемся коде реализовано неоптимально. 

№2.2  Оптимизируйте функцию и добейтесь значительного (как минимум, в 5 раз) прироста в скорости выполнения. Для демонстрации результата измерьте скорость выполнения оригинальной функции и функции, написанной вами.