# Generate auxiliary features

# TOC

* [1 Loading the data](#1-Loading-the-data)
* [2 Adding date features](#2-Adding-date-features)
* [3 Adding text features](#3-Adding-text-features)
* [4 Adding leakage features](#4-Adding-leakage-features)
* [5 The big merge](#5-The-big-merge)
* [6 Type casting](#6-Type-casting)
* [7 DataFrame trimming](#7-DataFrame-trimming)
* [8 Normalization](#8-Normalization)

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
import requests
import re
import nltk
import sklearn
import gc
import pandas as pd
import numpy as np
from sklearn import decomposition
from pathlib import Path
from dateutil.relativedelta import relativedelta

# 1 Loading the data

In [None]:
data_dir = Path('.').absolute().joinpath('data')

sales_train = pd.read_csv(data_dir.joinpath('sales_train.csv.gz'))
sales_test = pd.read_csv(data_dir.joinpath('test.csv.gz'))
items = pd.read_csv(data_dir.joinpath('items.csv'))
item_categories = pd.read_csv(data_dir.joinpath('item_categories.csv'))
shops = pd.read_csv(data_dir.joinpath('shops.csv'))

In [None]:
generated_data = Path('.').absolute().joinpath('generated_data')
data_aggregate = pd.read_hdf(generated_data.joinpath('data_aggregate.hdf'),
                             key='data_aggregate')

In [None]:
n_train_samples = data_aggregate.shape[0]

Cast the dates to actual dates for easier manipulation

In [None]:
sales_train.loc[:, 'date'] = pd.to_datetime(sales_train.loc[:, 'date'], format='%d.%m.%Y')

# 2 Adding date features

In [None]:
fig, ax = plt.subplots()
shop_id_both.loc[:, 'shop_id_count'].hist(ax=ax, bins=200)
ax.set_xlabel('shop_id_count')
ax.set_ylabel('count')
plt.tight_layout()

In [None]:
shop_id_both.loc[:, 'shop_id_count'].value_counts().describe()

It appears that the number of rows for each `shop_id` is well spread, and not clustering around a specific number

In [None]:
fig, ax = plt.subplots()
item_id_both.loc[:, 'item_id_count'].hist(ax=ax, bins=200)
ax.set_xlabel('item_id_count')
ax.set_ylabel('count')
plt.tight_layout()

In [None]:
item_id_both.loc[:, 'item_id_count'].value_counts().describe()

In [None]:
item_id_both.head()

### The ID

As we saw from the EDA, we saw that the `ID` was highly correlated to the `shop_id`, so we include it here. Item and shops without an ID will be given `-1` (although we could probably construct a more appropriate `ID` feature if we checked the feature more)

**NOTE**: We do an outer join here as some combinations of `shop_id` and `item_id` is only present in the test-set

In [None]:
on = ['shop_id', 'item_id']
id_df = pd.merge(sales_train.loc[:, on], sales_test, how='outer', on=['shop_id', 'item_id'])
id_df.loc[:,'ID'].fillna(-1, inplace=True)
id_df.loc[:,'ID'] = id_df.loc[:,'ID'].astype('int32')

### Additional leakage parameters

As the test set contains data after the train data, these will have a higher row number. Therefore, we could have added the row number as another leakage feature. However, as we has expanded the training set as we did when we added the aggregated features, we choose not to add this feature. In addition, we could be unlucky and have a test set which is shuffled with respect to the training set.

### Holidays

We will here generate the number of holidays in the previous month, the current month and the next month

In [None]:
def get_russian_holidays(year):
    """
    Returns a Series of Russian holidays in a given year
    
    Parameters
    ----------
    year : int
        The year to investigate
    
    Returns
    -------
    holidays : Series
        Series of the holidays on datetime64 format
    """
    
    url = f'https://www.timeanddate.com/holidays/russia/{year}'
    html = requests.get(url).content
    # A list is returned
    table_df = pd.read_html(html)[0]
    # Rename
    table_df = table_df.rename(columns={'Date': 'date'})
    holidays = pd.to_datetime(table_df['date'], format='%b %d')
    
    # Replace the year and cast to datetime
    holidays = holidays.apply(lambda x: x.replace(year=year))

    return holidays

In [None]:
def get_year_months_len(df):
    """
    Returns the number of entries grouped by year and month of the input data frame
    
    Parameters
    ----------
    df : DataFrame
        DataFrame with a column named 'date'
    
    Returns
    -------
    df : DataFrame
        The input DataFrame where the number of entries grouped by year and month
        is appended to the column named 'year_month_count' 
    """
    
    new_df = df.copy()
    
    new_df.loc[:, 'year'] = new_df.loc[:, 'date'].dt.year
    new_df.loc[:, 'month'] = new_df.loc[:, 'date'].dt.month
    
    df.loc[:, 'year_month_count'] = new_df.groupby(['year', 'month'])['date'].transform(len)
    
    return df

In [None]:
# NOTE: We include 2012 to get the first prev_holiday_count later
holiday_2012 = get_russian_holidays(2012).to_frame()
holiday_2013 = get_russian_holidays(2013).to_frame()
holiday_2014 = get_russian_holidays(2014).to_frame()
holiday_2015 = get_russian_holidays(2015).to_frame()
holidays = pd.concat([holiday_2012, holiday_2013, holiday_2014, holiday_2015])

In [None]:
holiday_count = get_year_months_len(holidays).rename(columns={'year_month_count': 'holiday_count'})

Let's now generate the previous month holidays count.
We can get that by increasing the month by one (if the holiday count of February was 1 and the holiday count of March was 2, the holiday count of March will be 1).

In [None]:
prev_holiday_count = holiday_count.copy()
prev_holiday_count.loc[:, 'date'] = prev_holiday_count.loc[:, 'date'] + pd.DateOffset(months=1)
prev_holiday_count = prev_holiday_count.rename(columns={'holiday_count': 'prev_holiday_count'})

Likewise, we can find the next month holiday count by subtracting the months by 1

In [None]:
next_holiday_count = holiday_count.copy()
next_holiday_count.loc[:, 'date'] = next_holiday_count.loc[:, 'date'] + pd.DateOffset(months=-1)
next_holiday_count = next_holiday_count.rename(columns={'holiday_count': 'next_holiday_count'})

We drop the `date` and create `year` and `month` features we can merge on. 

**NOTE**: In order to merge the date data smoothly afterwards, we should drop the resulting duplicates

In [None]:
def get_text_features(df, col, return_all=False):
    """
    Returns a new DataFrame with added text features
    
    Parameters
    -----------
    df : DataFrame
        The data frame to add the text features to
    col : str
        The column to obtain the text features from
    return_all : bool
        If True, intermediate columns will be returned
        
    Returns
    -------
    df_nlp : DataFrame
        The data frame with the added text features
        * {col}_clean - col column cleaned so that only alphabetical and numerical characters are present 
                        (only returned if return_all is True)
        * cyrillic_latin - column where cyrillic and latin letters has been separated 
                           (only returned if return_all is True)
        * cyrillic - column with only stemmed cyrillic words present (only returned if return_all is True)
        * latin - column with only stemmed latin words present (only returned if return_all is True)
        * {col}_nlp - combination of the cyrillic and latin column described above
        * {col}_cyrillic_words - cyrillic word count
        * {col}_latin_words - latin word count
        * {col}_total_words - total word count
    """
    
    df_nlp = df.copy()
    
    # First we clean the text by removing non-alphabetical characters and non-numeric characters
    
    df_nlp.loc[:, f'{col}_clean'] = \
    df_nlp.loc[:, f'{col}'].apply(lambda s: re.sub('[^а-яА-Яa-zA-Z0-9 ]', ' ', s))

    # Remove duplicated whitespaces
    df_nlp.loc[:, f'{col}_clean'] = \
        df_nlp.loc[:, f'{col}_clean'].apply(lambda s: re.sub(' +',' ', s))
    
    df_nlp.loc[:, 'cyrillic_latin'] = df_nlp.loc[:, f'{col}_clean'].apply(separate_cyrillic_latin)
    df_nlp.loc[:, 'cyrillic'] = df_nlp.loc[:, 'cyrillic_latin'].apply(lambda s: s.split('_SEP_')[0])
    df_nlp.loc[:, 'latin'] = df_nlp.loc[:, 'cyrillic_latin'].apply(lambda s: s.split('_SEP_')[1])
    
    df_nlp.loc[:, 'cyrillic'] = df_nlp.loc[:, 'cyrillic'].apply(russian_stemmer.stem)
    df_nlp.loc[:, 'latin'] = df_nlp.loc[:, 'latin'].apply(english_stemmer.stem)
    
    # Recombine words
    df_nlp.loc[:, f'{col}_nlp'] = df_nlp.loc[:, 'cyrillic'].str[:] + ' ' + df_nlp.loc[:, 'latin'].str[:]
    
    # We add the word count of each type together with the total.
    # The rationale for doing is
    # 1. It's possible that product with complex names are not sold as much
    # 2. In case there is a lot of English words in the product, it could be that it's less sellable in Russia
    # 3. Possible other reasons not mentioned here
    
    df_nlp.loc[:, f'{col}_cyrillic_words'] = \
        df_nlp.loc[:, 'cyrillic'].apply(lambda s: len(s.split(' ')) if s != '' else 0)
    df_nlp.loc[:, f'{col}_latin_words'] = \
        df_nlp.loc[:, 'latin'].apply(lambda s: len(s.split(' ')) if s != '' else 0)
    
    # NOTE: This is in fact an interaction feature
    df_nlp.loc[:, f'{col}_total_words'] = \
        df_nlp.loc[:, f'{col}_cyrillic_words'] + df_nlp.loc[:, f'{col}_latin_words']
    
    if not return_all:
        remove = [f'{col}_clean', 'cyrillic_latin', 'cyrillic', 'latin']
        df_nlp.drop(remove, axis=1, inplace=True)
    
    return df_nlp

In [None]:
item_nlp = get_text_features(items, 'item_name')
item_category_nlp = get_text_features(item_categories, 'item_category_name')
shop_nlp = get_text_features(shops, 'shop_name')

Check how many tokens we are dealing with

In [None]:
item_corpus = ' '.join(item_nlp.loc[:, 'item_name_nlp'].values)
item_corpus_tokens = nltk.word_tokenize(item_corpus)
print(f'Unique item_name_tokens {len(set(item_corpus_tokens))}')

In [None]:
item_category_corpus = ' '.join(item_category_nlp.loc[:, 'item_category_name_nlp'].values)
item_category_corpus_tokens = nltk.word_tokenize(item_category_corpus)
print(f'Unique item_category_name_tokens {len(set(item_category_corpus_tokens))}')

In [None]:
shop_corpus = ' '.join(shop_nlp.loc[:, 'shop_name_nlp'].values)
shop_corpus_tokens = nltk.word_tokenize(shop_corpus)
print(f'Unique shop_name_tokens {len(set(shop_corpus_tokens))}')

We should take care not to use all tokens as this may result in a [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). 
Let's see how the words are distributed

In [None]:
samples = 30

In [None]:
plt.figure()
fd_item = nltk.FreqDist(item_corpus_tokens)
fd_item.plot(samples, cumulative=False)
plt.tight_layout()

In [None]:
plt.figure()
fd_item_category = nltk.FreqDist(item_category_corpus_tokens)
fd_item_category.plot(samples, cumulative=False)
plt.tight_layout()

In [None]:
plt.figure()
fd_shop = nltk.FreqDist(shop_corpus_tokens)
fd_shop.plot(samples, cumulative=False)
plt.tight_layout()

We can see that a couple of words constitutes the most of the corpuses. In other words we can expect a high information gain from the first couple of features and diminishing returns as we add more words. We will from graphical inspection try with max features $35$ for TF-IDF for the item corpus, $25$ for the item category corpus and $10$ for the shop corpus. 

In [None]:
dates = sales_train.loc[:, ['date_block_num', 'date']]

# Add one date in place of the dates for prediction
# NOTE: The relativedelta module takes care of problems with dates ending with 28, 30, 31
next_month = dates.loc[:, 'date'].max() + relativedelta(months=1)
next_date_block_num = dates.loc[:, 'date_block_num'].max() + 1
test_month = pd.DataFrame({'date_block_num':[next_date_block_num], 'date':[next_month]})
dates = pd.concat([dates, test_month], axis=0)

In [None]:
dates['date'].max()

Recall from the EDA that we found out that the last date in dataset was `2015-10-31`, this means we are going to predict for `2015-11`. 

Further, we note that only the year and month data is present in the test dataset, meaning that using information on the day level does not make sense.

### Standard date features

We here add date features as seasonal trends are present in the dataset

In [None]:
dates.loc[:, 'year'] = dates.loc[:, 'date'].dt.year
dates.loc[:, 'month'] = dates.loc[:, 'date'].dt.month
dates.loc[:, 'days_in_month'] = dates.loc[:, 'date'].dt.days_in_month
dates.loc[:, 'quarter'] = dates.loc[:, 'date'].dt.quarter

In [None]:
corpus_aggregate = pd.merge(corpus_aggregate, item_nlp.drop_duplicates(), how='left', on='item_id')
corpus_aggregate = pd.merge(corpus_aggregate, item_category_nlp.drop_duplicates(), how='left', on='item_category_id')
corpus_aggregate = pd.merge(corpus_aggregate, shop_nlp.drop_duplicates(), how='left', on='shop_id')

We reduce to $20 %$ of original dimension

In [None]:
n_new_dimensions = int((n_item_features + n_item_category_features + n_shop_features)*.20)

**NOTE:** We normally fit on the training set and transform on the test set. However, since the whole set is available for the training set, such split does not make sense here.

In [None]:
tf_idf_cols = [col for col in corpus_aggregate if '_tf_idf_' in col]

nmf = decomposition.NMF(n_components=n_new_dimensions)
reduced_corpus = nmf.fit_transform(corpus_aggregate[tf_idf_cols])

In [None]:
nmf_cols = [f'nlp_nmf_{i}' for i in range(reduced_corpus.shape[1])]
reduced_corpus = pd.DataFrame(reduced_corpus, columns=nmf_cols)

In [None]:
corpus_aggregate = pd.concat([corpus_aggregate, reduced_corpus], axis=1)

In [None]:
if corpus_aggregate.isnull().any().any():
    raise AssertionError('NaNs were introduced in the corpus')
    
if corpus_aggregate.shape[0] > n_corpus_aggregate:
    raise AssertionError(f'The set was expanded: '
                         f'n_corpus_aggregate={n_corpus_aggregate} and '
                         f'corpus_aggregate.shape[0]={corpus_aggregate.shape[0]}')

In [None]:
del item_nlp
del item_category_nlp
del shop_nlp
del reduced_corpus
gc.collect()

# 4 Adding leakage features

The leakage features are features where we use information about the test set.

As both shop id and item id are features of the test set, and since these are not related to time, these are leakages.

### Number of ids in train and test

In [None]:
del shop_id_train
del shop_id_test
del item_id_train
del item_id_test

gc.collect()

# 5 The big merge

We are now ready to merge the different features into one big data frame

# 3 Adding text features

Taking into the possibility that the names are correlated to the target, we add some text features as well. We split `item_name`, `shop_name` and `item_category_name` into cyrillic and latin words. We will stem these, and then combine them again before fitting a TF-IDF model to them.

**NOTE**: The TF-IDF model does not care about the relative position of the words, so it is ok if the order is scrambled when recombining the words to sentences again.

We would now like to stem the words (ideally we would like to lemmatize the words, but it looks like the lemmatization for non-english languages are not as readily available at the moment).

**NOTE**: The stemmer casts to lowercase

In [None]:
russian_stemmer = nltk.stem.SnowballStemmer('russian')
english_stemmer = nltk.stem.SnowballStemmer('english')

In [None]:
def separate_cyrillic_latin(words):
    """
    Separates the cyrillic and latin words
    
    Notes
    -----
    This function does not conserve word order
    
    Parameters
    -----------
    words : str
        The string of words to be split
        
    Returns
    -------
    separated_words : str
        The words separated by _SEP_
        Cyrillic words are to the left of the separator, the latin to the right
    """
    
    words_split = words.split(' ')
    cyrillic_words = list()
    latin_words = list()
    
    for word in words_split:
        # https://stackoverflow.com/questions/48255244/python-check-if-a-string-contains-cyrillic-characters
        if re.search('[а-яА-Я]', word) is not None:
            cyrillic_words.append(word)
        else:
            latin_words.append(word)
    
    cyrillic_words = ' '.join(cyrillic_words)
    latin_words = ' '.join(latin_words)
    
    separated_words = f'{cyrillic_words}_SEP_{latin_words}'
    
    return separated_words

In [None]:
n_item_features = 35
n_item_category_features = 25
n_shop_features = 10

In [None]:
tf_idf_item_vec = sklearn.feature_extraction.text.TfidfVectorizer(max_features=n_item_features)
tf_idf_item = tf_idf_item_vec.fit_transform(item_nlp['item_name_nlp']).toarray()

In [None]:
tf_idf_item_category_vec = sklearn.feature_extraction.text.TfidfVectorizer(max_features=n_item_category_features)
tf_idf_item_category = tf_idf_item_category_vec.fit_transform(item_category_nlp['item_category_name_nlp']).toarray()

In [None]:
tf_idf_shop_vec = sklearn.feature_extraction.text.TfidfVectorizer(max_features=n_shop_features)
tf_idf_shop = tf_idf_shop_vec.fit_transform(shop_nlp['shop_name_nlp']).toarray()

Combine the TF-IDF results with the corresponding data frames

In [None]:
col_names = [f'item_tf_idf_{i}' for i in range(tf_idf_item.shape[1])]
tf_idf_item_df = pd.DataFrame(tf_idf_item, columns=col_names)
item_nlp = pd.concat([item_nlp, tf_idf_item_df], axis=1)
item_nlp.drop(['item_name', 'item_category_id', 'item_name_nlp'], axis=1, inplace=True)

In [None]:
col_names = [f'item_category_tf_idf_{i}' for i in range(tf_idf_item_category.shape[1])]
tf_idf_item_category_df = pd.DataFrame(tf_idf_item_category, columns=col_names)
item_category_nlp = pd.concat([item_category_nlp, tf_idf_item_category_df], axis=1)
item_category_nlp.drop(['item_category_name', 'item_category_name_nlp'], axis=1, inplace=True)

In [None]:
col_names = [f'shop_tf_idf_{i}' for i in range(tf_idf_shop.shape[1])]
tf_idf_shop_df = pd.DataFrame(tf_idf_shop, columns=col_names)
shop_nlp = pd.concat([shop_nlp, tf_idf_shop_df], axis=1)
shop_nlp.drop(['shop_name', 'shop_name_nlp'], axis=1, inplace=True)

## Dimensionality reduction

In order to save computational time, we would like to decrease the dimensionality of the corpus (as the tf-idf output is a relatively sparse matrix). As we will do this across item name, item category and shop name, we will gain the advantage of interaction between the words in our corpus.

Since the tf-idf outputs non-negative numbers, we can reduce the numbers using Non-negative Matrix Factorization.

In [None]:
shop_id_train = sales_train.loc[:, 'shop_id']
shop_id_test = sales_test.loc[:, 'shop_id']
shop_id_both = pd.concat([shop_id_train, shop_id_test], axis=0).to_frame()
shop_id_both.loc[:, 'shop_id_count'] = shop_id_both.groupby('shop_id')['shop_id'].transform(len)

# NOTE: Drop duplicated as we want to merge
shop_id_both.drop_duplicates(inplace=True)

In [None]:
item_id_train = sales_train.loc[:, 'item_id']
item_id_test = sales_test.loc[:, 'item_id']
item_id_both = pd.concat([item_id_train, item_id_test], axis=0).to_frame()
item_id_both.loc[:, 'item_id_count'] = item_id_both.groupby('item_id')['item_id'].transform(len)

# NOTE: Drop duplicated as we want to merge
item_id_both.drop_duplicates(inplace=True)

Out of curiosity we check how these are distributed

Item ID is a combination of `item_id` and `shop_id`. As before, if an ID doesn't exist for the combination, it will be assigned `-1`

In [None]:
all_data = pd.merge(all_data,
                    id_df.loc[:, ['item_id', 'shop_id', 'ID']].drop_duplicates(),
                    how='left',
                    on=['item_id', 'shop_id'])
all_data.loc[:, 'ID'].fillna(-1, inplace=True)
all_data.loc[:, 'ID'] = all_data.loc[:, 'ID'].astype('int32')

In [None]:
max_lag = max([int(col.split('_lag_')[-1]) for col in data_aggregate.columns if '_lag_' in col])
cols_wo_nan = [col for col in data_aggregate.columns if not ('month' in col and '_lag_' not in col)]

if all_data.loc[all_data.loc[:, 'date_block_num'] > max_lag, cols_wo_nan].isnull().any().any():
    raise AssertionError('NaNs were introduced in the data containing valid lags')
    
if all_data.shape[0] > n_train_samples:
    raise AssertionError(f'The set was expanded: '
                         f'n_train_samples={n_train_samples} and '
                         f'all_data.shape[0]={all_data.shape[0]}')

# 6 Type casting

In order to save resources, we downcast the types (as they by default are loaded as double)

In [None]:
def downcast_dtypes(df):
    """
    Downcasts float64 to float32 and int64 to int32
    
    Paramters
    ----------
    df : DataFrame
        The data frame to downcast
    
    Returns
    -------
    df : DataFrame
        The downcasted date frame
    """
    
    # Select columns to downcast
    float_cols = [c for c in df.columns if df.loc[:, c].dtype == 'float64']
    int_cols = [c for c in df.columns if df.loc[:, c].dtype == 'int64']
    
    # Downcast
    df.loc[:, float_cols] = df.loc[:, float_cols].astype(np.float32)
    df.loc[:, int_cols] = df.loc[:, int_cols].astype(np.int32)
    
    return df

In [None]:
all_data = downcast_dtypes(all_data)

We now store the full dataset in case we would like to modify it later on

In [None]:
sorted(all_data.columns)

In [None]:
all_data.to_hdf(generated_data.joinpath('all_data.hdf'), key='all_data')

# 7 DataFrame trimming

We can now start to trim the dataset for features and rows which are not needed.
Firstly, since we are lagging the features up to `max_lag`, we will throw away the first `max_lag` values in our data set

In [None]:
holidays

This looks correct. Let's merge these to a common data frame.

In [None]:
dates = pd.merge(dates.drop('date', axis=1).drop_duplicates(),
                 holidays.drop(['year', 'month'] ,axis=1),
                 how='left', 
                 on='date_block_num')

In [None]:
del holiday_count
del next_holiday_count
del prev_holiday_count
del holiday_2012
del holiday_2013
del holiday_2014
del holiday_2015
del holidays
gc.collect()

In [None]:
prev_holiday_count.loc[:, 'year'] = prev_holiday_count.loc[:, 'date'].dt.year
prev_holiday_count.loc[:, 'month'] = prev_holiday_count.loc[:, 'date'].dt.month
prev_holiday_count.drop(['date'], axis=1, inplace=True)
prev_holiday_count.drop_duplicates(inplace=True)

In [None]:
next_holiday_count.loc[:, 'year'] = next_holiday_count.loc[:, 'date'].dt.year
next_holiday_count.loc[:, 'month'] = next_holiday_count.loc[:, 'date'].dt.month
next_holiday_count.drop(['date'], axis=1, inplace=True)
next_holiday_count.drop_duplicates(inplace=True)

We merge the previous, current and next holiday count into one frame.
The resulting `NaN`s will be locations without vacations.
We start by merging with `dates`, as this contains all relevant `year`-`month` combinations

In [None]:
holidays = pd.merge(dates.loc[:, ['year', 'month']].drop_duplicates(),
                    holiday_count, how='left', on=['year', 'month']).fillna(0)
holidays = pd.merge(holidays, prev_holiday_count, how='left', on=['year', 'month']).fillna(0)
holidays = pd.merge(holidays, next_holiday_count, how='left', on=['year', 'month']).fillna(0)

# Re-shuffle the columns for better overview
holidays = holidays.loc[:, ['year', 'month', 'prev_holiday_count', 'holiday_count', 'next_holiday_count']]

# All columns can be integers
holidays = holidays.astype(np.int32)

# Sort by year and month for better overview
holidays.sort_values(['year', 'month'], inplace=True)
holidays.reset_index(inplace=True, drop=True)

# Add the date block number
holidays.loc[:, 'date_block_num'] = range(holidays.shape[0])

Inspect that we did the correct thing

In [None]:
all_data = pd.merge(data_aggregate, dates, how='left', on='date_block_num')

In [None]:
all_data = pd.merge(all_data, corpus_aggregate, how='left', on=['shop_id', 'item_id', 'item_category_id'])

In [None]:
all_data = pd.merge(all_data, shop_id_both, how='left', on='shop_id')

In [None]:
all_data = pd.merge(all_data, item_id_both, how='left', on='item_id')

In [None]:
trimmed_data = all_data.loc[all_data.loc[:, 'date_block_num'] > max_lag].copy()

We rename the target, so that the name matches the competition target name

In [None]:
old_target_name = 'month_shop_item_id_item_cnt_sum'
new_target_name = 'item_cnt_month'
trimmed_data.rename({old_target_name: new_target_name}, axis=1, inplace=True)

In [None]:
if trimmed_data.loc[trimmed_data.loc[:, 'date_block_num'] < 
                    trimmed_data.loc[:, 'date_block_num'].max(), new_target_name].isnull().any():
    raise AssertionError('NaNs were introduced in the target')

Next, we have some columns which has just been place-holders, and are now mean encoded

In [None]:
drop_cols = ['item_category_id', 
             'item_id', 
             'shop_id']

trimmed_data.drop(drop_cols, axis=1, inplace=True)

The columns which were used to create the lag is no longer needed

In [None]:
drop_cols = [col for col in trimmed_data.columns if 'month_' in col and '_lag_' not in col]
trimmed_data.drop(drop_cols, axis=1, inplace=True)

We reduced the dimensionality on the TF-IDF

In [None]:
drop_cols = [col for col in trimmed_data.columns if '_tf_idf_' in col]
trimmed_data.drop(drop_cols, axis=1, inplace=True)

In [None]:
sorted(trimmed_data.columns)

In [None]:
if trimmed_data.drop(new_target_name, axis=1).isnull().any().any():
    raise AssertionError('NaNs were introduced')

We recall that tree-based models does not depend on the normalization. This means that we can use this data-set as-is.

In [None]:
corpus_aggregate = data_aggregate.loc[:, ['shop_id', 'item_id', 'item_category_id']].copy()
corpus_aggregate.drop_duplicates(inplace=True)
n_corpus_aggregate = corpus_aggregate.shape[0]

In [None]:
holiday_count.loc[:, 'year'] = holiday_count.loc[:, 'date'].dt.year
holiday_count.loc[:, 'month'] = holiday_count.loc[:, 'date'].dt.month
holiday_count.drop(['date'], axis=1, inplace=True)
holiday_count.drop_duplicates(inplace=True)

In [None]:
trimmed_data.to_hdf(generated_data.joinpath('dt_data.hdf'), key='dt_data')

# 8 Normalization

If we are not using tree based methods, we should use normalize the data. Here we will use the `MaxMinScaler` on the ordinal features, and the `StandardScaler` (which keeps the information of the distribution intact) on the rest of the numerical features.

In a real world scenario we would be interested to fit a scale to a fold, and use the fitted scaling on the validation or the test fold as explained [here](https://stats.stackexchange.com/a/174865/132830). In this way we would not bias our scaling towards the test/validation-set, and we can assume that the generalization would perform better.

Here however, as we are in a competition setting, our primary goal is to optimize the loss metric of the competition. Thus, such a fallacy could be advantageous.

In [None]:
ordinal_features = ['ID',
                    'date_block_num',
                    'month',
                    'quarter',
                    'year']

In [None]:
max_min_scaler = sklearn.preprocessing.MinMaxScaler()
trimmed_data.loc[:, ordinal_features] = max_min_scaler.fit_transform(trimmed_data.loc[:, ordinal_features])

**NOTE**: We do not need to scale the target