<a href="https://colab.research.google.com/github/andresvir14/HWDCC_LFTK-/blob/master/Untitled.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook aims to show the pipeline that predicts Futures Sales based on C1 data, getting a score around 0.94 in the leader board (in both public and private parts).

Behind what is shown in this notebook there is a lot of more work done. Nevertheless, because of the amount of hypothesis tested about the behaviour of the predictions and the ways to get a better score, I going to show only the key points that worked best.

The notebook is organized as follows:

0. Initial steps (getting the libraries, functions and data) 
1. Basic pre-processing of data
2. Mean encodings
3. Lagged columns
4. Split of data
5. Model fitting through XGboost
6. Getting the file for submission




#0. Initial steps

Loading of libraries

In [1]:
# Import libraries
# Basic libraries
import pandas as pd
import numpy as np

# Preprocesing and feature extraction
from sklearn.model_selection import KFold
from itertools import product
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

# Modeling
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from xgboost import plot_importance

# Ploting
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns

# To deal with downloaded files
from google.colab import files

# Others
import time

  import pandas.util.testing as tm


Definition of functions

In [0]:
# Functions used

# Downcast variables 
def downcast(df):
  '''
    Change columns types from 64 to 32 bits
  '''
  float64columns = df.select_dtypes(['float64']).columns.tolist()
  int64columns = df.select_dtypes(['int64']).columns.tolist()
  df[float64columns] = df[float64columns].astype('float32')
  df[int64columns] = df[int64columns].astype('int32')
  return df


# Function that creates the lags of given columns (taken from he notebook of https://www.kaggle.com/dlarionov/feature-engineering-xgboost)
def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id', col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+ str(i)]
        shifted['date_block_num'] += i
        shifted[col+'_lag_'+ str(i)] = shifted[col+'_lag_'+ str(i)].astype('float16')
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df    

# Applying target (meen) encodings as https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series=None, 
                  tst_series=None, 
                  target=None, 
                  min_samples_leaf=1, 
                  smoothing=1,
                  noise_level=0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1)
    # Compute target mean 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
    # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    # Apply average function to all target data
    prior = target.mean()
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    # Apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=trn_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return pd.concat([add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)], axis = 0)

Loading of data

In [0]:
# Load de the data
transactions    = pd.read_csv('/content/drive/My Drive/DataScience/data/competitive-data-science-predict-future-sales/sales_train.csv')
items           = pd.read_csv('/content/drive/My Drive/DataScience/data/competitive-data-science-predict-future-sales/items.csv')
item_categories = pd.read_csv('/content/drive/My Drive/DataScience/data/competitive-data-science-predict-future-sales/item_categories.csv')
shops           = pd.read_csv('/content/drive/My Drive/DataScience/data/competitive-data-science-predict-future-sales/shops.csv')
transactions_test    = pd.read_csv('/content/drive/My Drive/DataScience/data/competitive-data-science-predict-future-sales/test.csv')


# 1. Basic pre-processing of data

In firts place, I did the data aggregation thanks to the use of part of the code from week four of the course.

In [0]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in transactions['date_block_num'].unique():
    cur_shops = transactions[transactions['date_block_num']==block_num]['shop_id'].unique()
    cur_items = transactions[transactions['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])), dtype = 'int16'))

# Turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols, dtype = np.int32)

# Get aggregated values of sales by shop_id, item_id
gb = transactions.groupby(index_cols).agg({'item_cnt_day':'sum'})
gb.columns = ['target']
gb.reset_index(inplace = True)

# Join aggregated data to the grid and sort the data
train = pd.merge(grid, gb, how='left',on=index_cols).fillna(0)
train['target'] = train['target'].clip(0,20)
train.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

del grid, gb

In [0]:
# Add items and shops to training data
train = pd.merge(train, items, on = ['item_id']) # Join item_name and item category
train = pd.merge(train, shops, on = ['shop_id']) # Join shop_name

In [0]:
# Set ID as index and join shops and items to test data
test = transactions_test.copy()
test.drop('ID', axis = 1, inplace = True)
test['date_block_num'] = 34

test = pd.merge(test, items, on = ['item_id']) # Join item_name and item category
test = pd.merge(test, shops, on = ['shop_id']) # Join shop_name

del shops, items, item_categories

In [0]:
# Concatenating train and test sets
train['from'] = 0
test['from'] = 1
data = pd.concat([train, test], axis = 0)
data.reset_index(inplace=True)

del train, test

# 3. Mean encondings

Mean encondings are a powerful tool to handle with high cardenality categorical features. Althougth, in the course were given several methods for compute regularized mean encondings( and I tryed Kfold, smothing and LOO, based on the code given in the course). I decided to applied and slightly change the code found in this notebook https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features. 

The computations of mean (target) encodings were done over the iten_target and pairs of columns as date_block-item_id and shop_id-item_id. 

In [10]:
#data['date_block_num_target_ME'] = target_encode(data[data['from'] == 0]["date_block_num"], 
#                         data[data['from'] == 1]["date_block_num"], 
#                         target=data[data['from'] == 0].target, 
#                         min_samples_leaf=100,
#                         smoothing=10,
#                         noise_level=0.01)

#data['shop_id_target_ME'] = target_encode(data[data['from'] == 0]["shop_id"], 
#                         data[data['from'] == 1]["shop_id"], 
#                         target=data[data['from'] == 0].target, 
#                         min_samples_leaf=100,
#                         smoothing=10,
#                         noise_level=0.01)

#data['item_category_id_target_ME'] = target_encode(data[data['from'] == 0]["item_category_id"], 
#                         data[data['from'] == 1]["item_category_id"], 
#                         target=data[data['from'] == 0].target, 
#                         min_samples_leaf=100,
#                         smoothing=10,
#                         noise_level=0.01)

data['item_id_target_ME'] = target_encode(data[data['from'] == 0]["item_id"], 
                            data[data['from'] == 1]["item_id"], 
                            target=data[data['from'] == 0].target, 
                            min_samples_leaf=100,
                            smoothing=10,
                            noise_level=0.01)


data['date_item'] = data['date_block_num'].astype(str) + data['item_id'].astype(str)
#data['date_shop'] = data['date_block_num'].astype(str) + data['shop_id'].astype(str)
#data['date_cat'] = data['date_block_num'].astype(str) + data['item_category_id'].astype(str)
#data['shop_cat'] = data['shop_id'].astype(str) + data['item_category_id'].astype(str)
data['shop_item'] = data['shop_id'].astype(str) + data['item_id'].astype(str)                         

data['date_item_target_ME'] = target_encode(data[data['from'] == 0]["date_item"], 
                            data[data['from'] == 1]["date_item"], 
                            target=data[data['from'] == 0].target, 
                            min_samples_leaf=100,
                            smoothing=10,
                            noise_level=0.01)

#data['date_shop_target_ME'] = target_encode(data[data['from'] == 0]["date_shop"], 
#                         data[data['from'] == 1]["date_shop"], 
#                         target=data[data['from'] == 0].target, 
#                         min_samples_leaf=100,
#                         smoothing=10,
#                         noise_level=0.01)

#data['date_cat_target_ME'] = target_encode(data[data['from'] == 0]["date_cat"], 
#                         data[data['from'] == 1]["date_cat"], 
#                         target=data[data['from'] == 0].target, 
#                         min_samples_leaf=100,
#                         smoothing=10,
#                         noise_level=0.01)

#data['shop_cat_target_ME'] = target_encode(data[data['from'] == 0]["shop_cat"], 
#                         data[data['from'] == 1]["shop_cat"], 
#                         target=data[data['from'] == 0].target, 
#                         min_samples_leaf=100,
#                         smoothing=10,
#                         noise_level=0.01)


data['shop_item_target_ME'] = target_encode(data[data['from'] == 0]["shop_item"], 
                         data[data['from'] == 1]["shop_item"], 
                         target=data[data['from'] == 0].target, 
                         min_samples_leaf=100,
                         smoothing=10,
                         noise_level=0.01)

#data.drop(['date_item', 'date_shop', 'date_cat', 'shop_cat', 'shop_item'], axis = 1, inplace = True)
data.drop(['date_item', 'shop_item'], axis = 1, inplace = True)

IndentationError: ignored

In [0]:
data.isnull().sum()

In [0]:
data.columns

In [0]:
# Lagged values
ts = time.time()

data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'target')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'date_block_num_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'shop_id_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'item_category_id_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'item_id_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'date_item_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'date_shop_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'date_cat_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'shop_cat_target_ME')
data = lag_feature(data, [1, 2, 3, 4, 5, 12], 'shop_item_target_ME')

data = downcast(data)

time.time() - ts

In [0]:
# Drop level 0 lagged features

#data.drop(['date_block_num_target_ME', 'shop_id_target_ME', 'item_category_id_target_ME', 'item_id_target_ME', 'date_item_target_ME', 'date_shop_target_ME', 'date_cat_target_ME', 'shop_cat_target_ME', 'shop_item_target_ME'], axis = 1, inplace = True) # Drop object columns
data = downcast(data)

In [0]:
data.fillna(0, inplace=True)
data.drop(['item_name', 'shop_name'], axis = 1, inplace = True) # Drop object columns

X_train = data[(data.date_block_num > 11) & (data.date_block_num < 33)].drop(['target'], axis=1)
Y_train = data[(data.date_block_num > 11) & (data.date_block_num < 33)]['target']
X_valid = data[data.date_block_num == 33].drop(['target'], axis=1)
Y_valid = data[data.date_block_num == 33]['target']
X_test = data[data.date_block_num == 34].drop(['target'], axis=1)

# del data

## First level models

linear regression and XGB model

In [0]:
ts = time.time()

model = XGBRegressor(
    tree_method = "gpu_hist",
    max_depth = 6, # Tree related parameter: determines how deeply each tree is allowed to grow during any boosting round
    min_child_weight = 600, # 
    colsample_bytree = 0.8, # percentage of features used per tree. High value can lead to overfitting
    n_estimators = 500, # number of trees you want to build.
    subsample = 0.80, # Boosting parameter: percentage of samples used per tree. Low value can lead to underfitting
    eta=0.3, #  Boosting parameter
    seed=123)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 10)

time.time() - ts


In [0]:
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feat_importances.nlargest(15).plot(kind='barh')

In [0]:
Y_test = model.predict(X_test).clip(0, 20)
testdata = X_test.copy()
testdata['Y_test'] = Y_test

# Add target variable to transactions_test and leave only ID and target
transactions_test.reset_index(inplace = True)
submition = pd.merge(transactions_test, testdata[['shop_id', 'item_id', 'Y_test']], on = ['shop_id', 'item_id'])
submition = submition[['ID', 'Y_test']]
submition.columns = ['ID', 'item_cnt_month']

In [0]:
# Fit in training + valida data

X_train_c = data[(data.date_block_num > 11) & (data.date_block_num < 34)].drop(['target'], axis=1)
Y_train_c = data[(data.date_block_num > 11) & (data.date_block_num < 34)]['target']
X_test = data[data.date_block_num == 34].drop(['target'], axis=1)

ts = time.time()
model.fit(
    X_train_c, 
    Y_train_c, 
    eval_metric="rmse")
time.time() - ts

In [0]:
Y_test = model.predict(X_test).clip(0, 20)
testdata = X_test.copy()
testdata['Y_test'] = Y_test

# Add target variable to transactions_test and leave only ID and target
#transactions_test.reset_index(inplace = True)
submition = pd.merge(transactions_test, testdata[['shop_id', 'item_id', 'Y_test']], on = ['shop_id', 'item_id'])
submition = submition[['ID', 'Y_test']]
submition.columns = ['ID', 'item_cnt_month']

In [0]:
# Export results
submition.to_csv('V21_SmoothKag_All11_ShopItemME34L0.csv', index=False)
files.download('V21_SmoothKag_All11_ShopItemME34L0.csv')