Version 1.1.0

# Mean encodings

In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:

    1) Via KFold scheme;  
    2) Via Leave-one-out scheme;
    3) Via smoothing scheme;
    4) Via expanding mean scheme.

**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

### General tips

* Fill NANs in the encoding with `0.3343`.
* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

## The NaN Value

In [1]:
NAN_VALUE = 0.3343

## Silence NumPy Warnings

In [2]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

# Library Imports

In [3]:
# Python Standard Library
from itertools import product
import os

# From PyPi
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
from itertools import product
from grader import Grader

# Read data

In [4]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')

# The Motivation

The idea behind this is that we want to convert a categorical value (the item ID) into a numeric one so that we can use non-tree-based methods. But we already have One Hot Encoding, so why do we need this? Well, lets look at how many items we need to encode.

In [5]:
print("{:,}".format(len(sales.item_id.unique())))

21,807


This means we're going to have to add nearly 22K columns to the table, which brings up the [Curse of Dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) -- adding this many columns means we're going to need a lot more data for the model to work and will significantly increase computation time. Using the Mean Encoding means we'll have to add one more collumn, simplifying our computation and reducing the amount of data needed to fit the model.

# Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [6]:
group_by_columns = ['shop_id', 'item_id', 'date_block_num']

For every month, we create a gride from all shops/items combinations for the month. This uses [itertools.product](https://docs.python.org/3/library/itertools.html#itertools.product), which creates the cartesian product of the given collections.

In [7]:
grid = [] 
for block_num in sales['date_block_num'].unique():
    block = sales[sales['date_block_num']==block_num]
    cur_shops = block['shop_id'].unique()
    cur_items = block['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])), dtype='int32'))

Now turn the grid into a pandas dataframe.

In [8]:
grid = pd.DataFrame(np.vstack(grid), columns = group_by_columns, dtype = np.int32)

In [9]:
print(grid.head())

   shop_id  item_id  date_block_num
0       59    22154               0
1       59     2552               0
2       59     2554               0
3       59     2555               0
4       59     2564               0


In [10]:
print(grid.shape)

(10913850, 3)


The grid has all the items sold by each shop for each date-block. The number of rows isn't just shops x items x date-blocks because not every shop is in every date-block and not every shop sold every item (or even the same items every block).

Now we will use [groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) to group the data by shop_id, item_id and month (date_block_num) and then get the aggregated summed values for the item count per day (we're going to sum up the items sold per day to get a value for the month) and [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) the summed item count column to target.


In [11]:
grouped = sales.groupby(group_by_columns, as_index = False)
grouped = grouped["item_cnt_day"].sum()
grouped = grouped.rename(columns = {"item_cnt_day": "target"})

In [12]:
print(grouped.head())

   shop_id  item_id  date_block_num  target
0        0       30               1    31.0
1        0       31               1    11.0
2        0       32               0     6.0
3        0       32               1    10.0
4        0       33               0     3.0


___
Now join the aggregated data to the grid (with [merge](https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging)).

In [13]:
all_data = pd.merge(grid, grouped, how = 'left', on = group_by_columns).fillna(0)

In [14]:
print(all_data.head())

   shop_id  item_id  date_block_num  target
0       59    22154               0     1.0
1       59     2552               0     0.0
2       59     2554               0     0.0
3       59     2555               0     0.0
4       59     2564               0     0.0


___
Sort the data by the month, shop, and item.

In [15]:
all_data.sort_values(['date_block_num','shop_id','item_id'], inplace = True)

In [16]:
print(all_data.head())

        shop_id  item_id  date_block_num  target
139255        0       19               0     0.0
141495        0       27               0     0.0
144968        0       28               0     0.0
142661        0       29               0     0.0
138947        0       32               0     6.0


___
When we compare all_data to grouped, the difference might not be so obvious, they have the same columns and look pretty similar, but if you look at the all_data.target column you can see that there's a lot of os. That's because grouped only has the cases where there were sales, but all_data had cases where there weren't any sales for particular (shop_id, item_id, date_block_num) combination, so it filled in the 0s.

In [17]:
print("{:,}".format(all_data.shape[0] - grouped.shape[0]))

9,304,726


You can see that all_data had over 9M more rows than grouped did.

In [18]:
print(grouped[(grouped.shop_id == 0) & (grouped.item_id == 19) & (grouped.date_block_num == 0)])

Empty DataFrame
Columns: [shop_id, item_id, date_block_num, target]
Index: []


And grouped didn't have any entry for the first item in the previous `all_data` head-table, which is why the target value is 0.

# Mean Encodings Without Regularization

Now that we have done the technical work, we are ready to actually *mean encode* the desired `item_id` variable. 

Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. 

### Method 1: Calculate a Mapping: {item_id:target_mean}

First: we're going to calculate the mean count for each item.

In [19]:
item_id_target_mean = all_data.groupby('item_id').target.mean()

In our non-regularized case, we just [map]() the computed means to the `item_id`'s.

In [20]:
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

In our case, we are mapping a series (`item_id_target_mean`) to a column `item_id` in a data frame `all_data`. Wherever an item in the `item_id` column matches the index of our `item_id_target_mean` series it will replace the item with the value in the `item_id_target_mean` that matches the index.

Here's an example: let's look at the hea of the `item_id_target_mean` series.

In [21]:
print(dict(target_mean = item_id_target_mean.head()))

{'target_mean': item_id
0    0.020000
1    0.023810
2    0.019802
3    0.019802
4    0.020000
Name: target, dtype: float64}


___
So, let's look at index 1 -- its value is -.02381, so this means we would expect that all the items with ID1 would also have this value in the `item_target_end` column. Let's double-check.

In [22]:
print(all_data[all_data.item_id==1].head())

         shop_id  item_id  date_block_num  target  item_target_enc
5664585        2        1              15     0.0          0.02381
5651037        3        1              15     0.0          0.02381
5657811        4        1              15     0.0          0.02381
5671359        5        1              15     0.0          0.02381
5623941        6        1              15     0.0          0.02381


___
It looks right. Let's make sure.

In [23]:
assert all(all_data[all_data.item_id==1] == 0.0238095)

Well, this wasn't exhaustive, but at least that one item checks out.

#### Fill NaNs

We're given the value to fill in for the missing entries (0.3343) without explanation. I don't really know hwere it comes from. It's around, but not exactly the 84-percentile, but, anyway, let's use it (actually, if you chick it there aren't any NaN value, curious).

In [24]:
print(all_data.item_target_enc.hasnans)

False


So, this next line doens't seem to do anything, but is part of the given code.

In [25]:
all_data['item_target_enc'].fillna(NAN_VALUE, inplace=True) 

#### Print Correlation

Now we need to calculate the [Person Correlation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html) between our calculated mean and the target values. This [value](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) ranges from -1 to 1 and represents how much of a linear correlation there is between two variables: -1 means they're completely negatively correlative and +1 means they're completely positively correlated.

In [26]:
encoded_feature = all_data['item_target_enc'].values
first_correlation = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(first_correlation)

0.483038698862


___
Since our value is between 0 and 1 it does describe the target to some degree, albeit not perfectly.

### Method 2

Unlike the `.target.mean()` function, `transform` will return a dataframe with an index like in `all_data`. Basically, this single line of code is equivalent to the first lines from the Method 1.

In [27]:
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

#### Fill NaNs

In [28]:
all_data['item_target_enc'].fillna(NAN_VALUE, inplace=True)

#### Print Correlation

In [29]:
encoded_feature = all_data['item_target_enc'].values
second_correlation = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(second_correlation)
print(abs(first_correlation - second_correlation))

0.483038698862
0.0


See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute the correlation coefficient between the encodings** that you will implement and **submit those to coursera**.

In [30]:
grader = Grader()

# 1. KFold scheme

Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. 

1. Split your data in 5 folds with [`sklearn.model_selection.KFold`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) with `shuffle=False` argument.
2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and  fill the current fold.

See the **Method 1** from the example implementation. In particular learn what `map` and [pd.Series.map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) functions do. They are pretty handy in many situations.

In [32]:
from sklearn import model_selection

kf = model_selection.KFold(5, shuffle=False)
all_data['item_target_enc'] = np.nan

for tr_ind, val_ind in kf.split(all_data):
    X_tr, X_val = all_data.iloc[tr_ind], all_data.iloc[val_ind]
    all_data.loc[all_data.index[val_ind], 'item_target_enc'] = X_val['item_id'].map(X_tr.groupby('item_id').target.mean())
    
all_data['item_target_enc'].fillna(0.3343, inplace=True)
encoded_feature = all_data['item_target_enc'].values

____
You will need to compute correlation like that.

In [33]:
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('KFold_scheme', corr)

0.41645907128
Current answer for task KFold_scheme is: 0.41645907128


In [31]:
#folder = KFold(n_splits=5, shuffle=False)
#column = "item_id"
#encoded_column = column + "_mean_target"
#train_new = pd.DataFrame(index=all_data.index, columns=all_data.columns)
#train_new[encoded_column] = np.nan
#for training_index, validation_index in folder.split(all_data):
 #   x_train = all_data.iloc[training_index].copy()
  #  x_validation = all_data.iloc[validation_index].copy()
   # means = x_validation[column].map(x_train.groupby(column).target.mean())
    #x_validation[encoded_column] = means
    # train_new is a dataframe copy we made of the training data
    #train_new.iloc[validation_index] = x_validation
#train_new.fillna(NAN_VALUE, inplace=True)

In [32]:
#encoded_feature = train_new.item_id_mean_target.values

# 2. Leave-one-out scheme

Now, implement leave-one-out scheme. 
Note: if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. 

To implement a faster version, note, that to calculate mean target value using all the objects but one **given object**, you can:

1. Calculate sum of the target values using all the objects.
2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. 

Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.

It is the most convenient to use `.transform` function as in **Method 2**.

First, we calculate the sum and count.

In [39]:
leave_one_out_sum = all_data['item_id'].map(all_data.groupby('item_id').target.sum())
leave_one_out_count = all_data['item_id'].map(all_data.groupby('item_id').target.count())

In [36]:
print(all_data.head())

        shop_id  item_id  date_block_num  target  item_target_enc
139255        0       19               0     0.0         0.022727
141495        0       27               0     0.0         0.056911
144968        0       28               0     0.0         0.141414
142661        0       29               0     0.0         0.037500
138947        0       32               0     6.0         1.316088


___
Now, we'll calculate how many items there are once you leave one out.

In [37]:
all_data['item_target_enc'] = ((leave_one_out_sum - all_data['target']))/(leave_one_out_count-1)
all_data['item_target_enc'].fillna(0.3343, inplace=True)                                                
encoded_feature = all_data['item_target_enc'].values

In [38]:
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Leave-one-out_scheme', corr)

0.480384831129
Current answer for task Leave-one-out_scheme is: 0.480384831129


# 3. Smoothing

Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset).

In [40]:
alpha = 100
globalmean = 0.3343
train_new = all_data.copy()
nrows = train_new.groupby('item_id').size()
means = train_new.groupby('item_id').target.agg('mean')

score = (np.multiply(means,nrows)  + globalmean*alpha) / (nrows+alpha)
train_new['smooth'] = train_new['item_id']
train_new['smooth'] = train_new['smooth'].map(score)

In [41]:
encoded_feature = train_new['smooth'].values
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Smoothing_scheme', corr)

0.48181987971
Current answer for task Smoothing_scheme is: 0.48181987971


# 4. Expanding Mean Scheme

Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas. Where `cumsum` is the cumulative sum for each group and `cumcount` is the number of each item in each group.

In [44]:
cumsum = all_data.groupby('item_id').target.cumsum() - all_data['target']
cumcnt = all_data.groupby('item_id').cumcount()

2307657546.0


In [45]:
train_new["mean_target"] = cumsum/cumcnt
train_new['mean_target'].fillna(0.3343, inplace=True)
encoded_feature = train_new['mean_target'].values

In [46]:
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Expanding_mean_scheme', corr)

0.502524521108
Current answer for task Expanding_mean_scheme is: 0.502524521108


## Authorization & Submission
To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate token on this programming assignment page. Note: Token expires 30 minutes after generation.

In [47]:
STUDENT_EMAIL = 'valexanderma@gmail.com'
STUDENT_TOKEN = 'sVEpzsGDo17fE7I3'
grader.status()

You want to submit these numbers:
Task KFold_scheme: 0.41645907128
Task Leave-one-out_scheme: 0.480384831129
Task Smoothing_scheme: 0.48181987971
Task Expanding_mean_scheme: 0.502524521108


In [48]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
