Version 1.1.0

# Mean encodings

In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:

    1) Via KFold scheme;  
    2) Via Leave-one-out scheme;
    3) Via smoothing scheme;
    4) Via expanding mean scheme.

**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

### General tips

* Fill NANs in the encoding with `0.3343`.
* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

In [1]:
import pandas as pd
import numpy as np
from itertools import product
from grader import Grader

# Read data

In [2]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')

In [3]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


# Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [4]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)


In [5]:
grid.head()

Unnamed: 0,shop_id,item_id,date_block_num
0,59,22154,0
1,59,2552,0
2,59,2554,0
3,59,2555,0
4,59,2564,0


In [5]:
#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day': 'sum'}) \
                                             .rename(columns={'item_cnt_day': 'target'})

In [7]:
gb.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
0,0,30,1,31.0
1,0,31,1,11.0
2,0,32,0,6.0
3,0,32,1,10.0
4,0,33,0,3.0


In [6]:
gb.columns.values

array(['shop_id', 'item_id', 'date_block_num', 'target'], dtype=object)

In [6]:
# #fix column names
# gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

In [9]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0


In [23]:
all_data.target.value_counts()

0.0      9307309
1.0      1058825
2.0       265502
3.0       103227
4.0        53327
          ...   
369.0          1
370.0          1
371.0          1
373.0          1
228.0          1
Name: target, Length: 446, dtype: int64

In [24]:
all_data.target.mean()

0.3342730567123426

# Mean encodings without regularization

After we did the techinical work, we are ready to actually *mean encode* the desired `item_id` variable. 

Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. 

#### Method 1

In [10]:
item_id_target_mean = all_data.groupby('item_id').target.mean()

In [11]:
item_id_target_mean

item_id
0        0.020000
1        0.023810
2        0.019802
3        0.019802
4        0.020000
           ...   
22165    0.021277
22166    0.295918
22167    1.081081
22168    0.032967
22169    0.020833
Name: target, Length: 21807, dtype: float64

In [7]:
# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.4830386988621699


In [8]:
all_data['item_id'].map(item_id_target_mean)

139255      0.022222
141495      0.056834
144968      0.141176
142661      0.037383
138947      1.319042
              ...   
10768834    1.556793
10769024    0.581395
10769690    1.235589
10771216    0.295918
10770511    1.081081
Name: item_id, Length: 10913850, dtype: float64

In [13]:
item_id_target_mean_dict = dict(item_id_target_mean)
item_id_target_mean_dict

{0: 0.02,
 1: 0.023809523809523808,
 2: 0.019801980198019802,
 3: 0.019801980198019802,
 4: 0.02,
 5: 0.02,
 6: 0.02,
 7: 0.02,
 8: 0.019801980198019802,
 9: 0.0196078431372549,
 10: 0.02,
 11: 0.02,
 12: 0.021739130434782608,
 13: 0.02,
 14: 0.02,
 15: 0.02,
 16: 0.02,
 17: 0.02,
 18: 0.0196078431372549,
 19: 0.022222222222222223,
 20: 0.0196078431372549,
 21: 0.02,
 22: 0.02127659574468085,
 23: 0.02127659574468085,
 24: 0.02127659574468085,
 25: 0.0196078431372549,
 26: 0.019230769230769232,
 27: 0.056833558863328824,
 28: 0.1411764705882353,
 29: 0.037383177570093455,
 30: 1.355613238157041,
 31: 0.9312134977287476,
 32: 1.3190416141235812,
 33: 0.5271122320302648,
 34: 0.14610778443113773,
 35: 0.9446808510638298,
 36: 0.06913996627318718,
 37: 0.15311510031678988,
 38: 0.09984871406959153,
 39: 0.03867403314917127,
 40: 0.0709433962264151,
 41: 0.08582834331337326,
 42: 0.0705958549222798,
 43: 0.032846715328467155,
 44: 0.08677337826453244,
 45: 0.11097099621689786,
 46: 0.05844

In [14]:
all_data['item_id'].map(item_id_target_mean_dict)

139255      0.022222
141495      0.056834
144968      0.141176
142661      0.037383
138947      1.319042
              ...   
10768834    1.556793
10769024    0.581395
10769690    1.235589
10771216    0.295918
10770511    1.081081
Name: item_id, Length: 10913850, dtype: float64

#### Method 2

In [10]:
all_data.groupby('item_id').target.mean()

item_id
0        0.020000
1        0.023810
2        0.019802
3        0.019802
4        0.020000
           ...   
22165    0.021277
22166    0.295918
22167    1.081081
22168    0.032967
22169    0.020833
Name: target, Length: 21807, dtype: float64

In [26]:
all_data.groupby('item_id')['target'].transform('mean')

139255      0.022222
141495      0.056834
144968      0.141176
142661      0.037383
138947      1.319042
              ...   
10768834    1.556793
10769024    0.581395
10769690    1.235589
10771216    0.295918
10770511    1.081081
Name: target, Length: 10913850, dtype: float64

In [11]:
'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.4830386988621699


See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute correlation coefficient** between the encodings, that you will implement and **submit those to coursera**.

In [12]:
grader = Grader()

# 1. KFold scheme

Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

**Now it's your turn to write the code!** 

You may use 'Regularization' video as a reference for all further tasks.

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. 

1. Split your data in 5 folds with `sklearn.model_selection.KFold` with `shuffle=False` argument.
2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and  fill the current fold.

    *  See the **Method 1** from the example implementation. In particular learn what `map` and pd.Series.map functions do. They are pretty handy in many situations.

In [35]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0


In [None]:
all

In [59]:
all_data = all_data.drop('item_target_enc', axis=1)

In [13]:
from sklearn.model_selection import KFold

In [89]:
all_data['item_target_enc'] = 0.

In [91]:
all_data

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.0
141495,0,27,0,0.0,0.0
144968,0,28,0,0.0,0.0
142661,0,29,0,0.0,0.0
138947,0,32,0,6.0,0.0
...,...,...,...,...,...
10768834,59,22162,33,0.0,0.0
10769024,59,22163,33,0.0,0.0
10769690,59,22164,33,0.0,0.0
10771216,59,22166,33,0.0,0.0


In [63]:
all_data['item_target_enc'].loc[[0, 1]]

0    0.3343
1    0.3343
Name: item_target_enc, dtype: float64

In [64]:
all_data['item_target_enc'].loc[test_index] 

8731080     0.3343
8731081     0.3343
8731082     0.3343
8731083     0.3343
8731084     0.3343
             ...  
10913845    0.3343
10913846    0.3343
10913847    0.3343
10913848    0.3343
10913849    0.3343
Name: item_target_enc, Length: 2182770, dtype: float64

In [67]:
all_data['item_id'].loc[test_index].map(all_data.loc[train_index].groupby('item_id').target.mean())

8731080     1.352113
8731081     0.402192
8731082     0.133460
8731083     0.220000
8731084     0.343170
              ...   
10913845         NaN
10913846    0.020408
10913847         NaN
10913848    0.020202
10913849         NaN
Name: item_id, Length: 2182770, dtype: float64

In [75]:
all_data['item_target_enc'].loc[test_index] = all_data['item_id'].loc[test_index].map(all_data.loc[train_index].groupby('item_id').target.mean())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [81]:
all_data['item_id'].loc[test_index].map(all_data.loc[train_index].groupby('item_id').target.mean()).fillna(0.)

8731080     1.352113
8731081     0.402192
8731082     0.133460
8731083     0.220000
8731084     0.343170
              ...   
10913845    0.000000
10913846    0.020408
10913847    0.000000
10913848    0.020202
10913849    0.000000
Name: item_id, Length: 2182770, dtype: float64

In [97]:
all_data['item_target_enc'] += all_data['item_id'].loc[test_index].map(all_data.loc[train_index].groupby('item_id').target.mean()).fillna(0.)

In [98]:
all_data['item_target_enc'].loc[test_index]

8731080     1.352113
8731081     0.402192
8731082     0.133460
8731083     0.220000
8731084     0.343170
              ...   
10913845    0.000000
10913846    0.020408
10913847    0.000000
10913848    0.020202
10913849    0.000000
Name: item_target_enc, Length: 2182770, dtype: float64

In [99]:
testset = all_data['item_target_enc'].loc[test_index]

In [100]:
part1 = testset[:10]
part2 = testset[10:20]

In [101]:
part1

8731080    1.352113
8731081    0.402192
8731082    0.133460
8731083    0.220000
8731084    0.343170
8731085    0.333333
8731086    0.225000
8731087    1.393939
8731088    0.535354
8731089    1.969697
Name: item_target_enc, dtype: float64

In [102]:
part2

8731090    0.515152
8731091    1.181818
8731092    2.101010
8731093    1.353535
8731094    0.666948
8731095    0.290894
8731096    1.484848
8731097    1.686869
8731098    3.121212
8731099    0.029412
Name: item_target_enc, dtype: float64

In [103]:
pd.concat([part1, part2])

8731080    1.352113
8731081    0.402192
8731082    0.133460
8731083    0.220000
8731084    0.343170
8731085    0.333333
8731086    0.225000
8731087    1.393939
8731088    0.535354
8731089    1.969697
8731090    0.515152
8731091    1.181818
8731092    2.101010
8731093    1.353535
8731094    0.666948
8731095    0.290894
8731096    1.484848
8731097    1.686869
8731098    3.121212
8731099    0.029412
Name: item_target_enc, dtype: float64

In [65]:
all_data.loc[train_index].groupby('item_id').target.mean()

item_id
0        0.020000
1        0.023810
2        0.019802
3        0.019802
4        0.020000
           ...   
22165    0.021277
22166    0.349227
22167    1.229927
22168    0.032967
22169    0.020833
Name: target, Length: 19244, dtype: float64

In [105]:
kf = KFold(n_splits=5)

enc_feat_list = []
for train_index, test_index in kf.split(all_data):
    enc_feat_list.append(all_data['item_id'].loc[test_index].map(all_data.loc[train_index].groupby('item_id').target.mean()))

all_data['item_target_enc'] = pd.concat(enc_feat_list)

In [108]:
pd.concat(enc_feat_list)

0           0.084746
1           0.021583
2                NaN
3           0.056995
4           0.040307
              ...   
10913845         NaN
10913846    0.020408
10913847         NaN
10913848    0.020202
10913849         NaN
Name: item_id, Length: 10913850, dtype: float64

In [106]:
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

In [107]:
encoded_feature = all_data['item_target_enc'].values

# You will need to compute correlation like that
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('KFold_scheme', corr)

0.4166330182247348
Current answer for task KFold_scheme is: 0.4166330182247348


# 2. Leave-one-out scheme

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. 

To implement a faster version, note, that to calculate mean target value using all the objects but one *given object*, you can:

1. Calculate sum of the target values using all the objects.
2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. 

Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.

It is the most convenient to use `.transform` function as in **Method 2**.

In [33]:
all_data = all_data.drop('item_target_enc', axis=1)
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0


In [42]:
means_by_item_id = all_data.groupby('item_id').target.sum()
lens_by_item_i = all_data.groupby('item_id').target.count()

In [44]:
all_data['item_target_enc'] = (all_data['item_id'].map(means_by_item_id) - all_data['target']) / all_data['item_id'].map(lens_by_item_i - 1)

all_data['item_target_enc'].fillna(0.3343, inplace=True) 

encoded_feature = all_data['item_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Leave-one-out_scheme', corr)

0.4803848311293002
Current answer for task Leave-one-out_scheme is: 0.4803848311293002


In [32]:
all_data.item_target_enc.value_counts()

0.334273    9307309
0.334273    1058825
0.334273     265502
0.334273     103227
0.334273      53327
             ...   
0.334203          1
0.334228          1
0.334186          1
0.334240          1
0.334233          1
Name: item_target_enc, Length: 446, dtype: int64

# 3. Smoothing

Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset).

In [None]:
all_data = all_data.drop('item_target_enc', axis=1)

In [47]:
means_by_item_id = all_data.groupby('item_id').target.mean()
lens_by_item_i = all_data.groupby('item_id').target.count()

In [48]:
all_data['item_target_enc'] = (all_data['item_id'].map(means_by_item_id) * all_data['item_id'].map(lens_by_item_i) \
                               + 0.3343 * 100) / (all_data['item_id'].map(lens_by_item_i) + 100)

all_data['item_target_enc'].fillna(0.3343, inplace=True) 

encoded_feature = all_data['item_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Smoothing_scheme', corr)

0.4818198797097264
Current answer for task Smoothing_scheme is: 0.4818198797097264


# 4. Expanding mean scheme

Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas.

In [49]:
cumsum = all_data.groupby('item_id').target.cumsum() - all_data['target']
cumcnt = all_data.groupby('item_id').target.cumcount()

In [55]:
len(cumcnt)

10913850

In [52]:
cumsum.reset_index()

Unnamed: 0,index,target
0,139255,0.0
1,141495,0.0
2,144968,0.0
3,142661,0.0
4,138947,0.0
...,...,...
10913845,10768834,699.0
10913846,10769024,75.0
10913847,10769690,493.0
10913848,10771216,348.0


In [53]:
all_data['item_target_enc'] = cumsum / cumcnt

all_data

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,
141495,0,27,0,0.0,
144968,0,28,0,0.0,
142661,0,29,0,0.0,
138947,0,32,0,6.0,
...,...,...,...,...,...
10768834,59,22162,33,0.0,1.560268
10769024,59,22163,33,0.0,0.585938
10769690,59,22164,33,0.0,1.238693
10771216,59,22166,33,0.0,0.296170


In [56]:


all_data['item_target_enc'].fillna(0.3343, inplace=True) 

encoded_feature = all_data['item_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Expanding_mean_scheme', corr)

0.5025245211081697
Current answer for task Expanding_mean_scheme is: 0.5025245211081697


## Authorization & Submission
To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate token on this programming assignment page. Note: Token expires 30 minutes after generation.

In [109]:
STUDENT_EMAIL = ''
STUDENT_TOKEN = ''
grader.status()

You want to submit these numbers:
Task KFold_scheme: 0.4166330182247348
Task Leave-one-out_scheme: 0.4803848311293002
Task Smoothing_scheme: 0.4818198797097264
Task Expanding_mean_scheme: 0.5025245211081697


In [110]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
