# Baseline Modeling 

In this notebook we will create validation schema and produce simple model running on it

## Loading Custom Modules

In this notebook, we will use pipelines and transformers from previous notebooks, so we need to intall it

In [1]:
%pip install ..\scripts -q
print("Instalation Complitted!")

Note: you may need to restart the kernel to use updated packages.
Instalation Complitted!


## Importing Modules

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly as px

from etl.transformers import * # dependencies for etl pipeline

from pickle import dump, load

## Importing Data

In [3]:
item_categories = pd.read_csv("../data/item_categories.csv")
shops = pd.read_csv("../data/shops.csv")
items = pd.read_csv("../data/items.csv")

sales_train = pd.read_csv("../data/sales_train.csv")
test = pd.read_csv("../data/test.csv", index_col=0)

## Loading Pipelines

In [4]:
etl_pipeline = load(open("../pipelines/etl_pipeline_v1.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline.pkl", "rb"))

## Data Preprocesing

We can use our pipelines for the data preprocessing, but before, lets merge them into the new pipeline

In [5]:
from sklearn.pipeline import Pipeline

etl_eda_pipeline = Pipeline([
	("etl", etl_pipeline),
	("eda", eda_pipeline)
])

etl_eda_pipeline

In [6]:
preprocessed_train = etl_eda_pipeline.fit_transform(sales_train)
preprocessed_train.head()

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
0,2013-01-02,0,999.0,1.0,"Ярославль ТЦ ""Альтаир""",59,ЯВЛЕНИЕ 2012 (BD),22154,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2013-01-03,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,2013-01-05,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,2013-01-06,0,1709.05,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE Who Do You Think We Are LP,2554,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,2013-01-15,0,1099.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),2555,Музыка - CD фирменного производства,56,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [7]:
preprocessed_train.describe()

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,weekday,month,year,is_NewYear,is_OctoberSales,price_category,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
count,2935772,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,...,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0
mean,2014-04-03 05:42:40.058750976,14.56987,890.7548,1.205446,3.365683,6.247721,2013.777,0.04770466,0.02323784,1.175418,...,0.001495348,0.00245455,0.02365477,0.001888089,0.09444739,0.01980092,0.08584693,0.07982602,0.1319605,0.5625754
min,2013-01-01 00:00:00,0.0,0.07,-22.0,0.0,1.0,2013.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2013-08-01 00:00:00,7.0,249.0,1.0,2.0,3.0,2013.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2014-03-04 00:00:00,14.0,399.0,1.0,4.0,6.0,2014.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2014-12-05 00:00:00,23.0,999.0,1.0,5.0,9.0,2014.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,2015-10-31 00:00:00,33.0,59200.0,343.0,6.0,12.0,2015.0,1.0,1.0,3.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
std,,9.42296,1720.51,1.691073,1.996799,3.536204,0.7684795,0.2131407,0.1506581,1.32364,...,0.03864081,0.04948259,0.1519711,0.04341112,0.2924502,0.1393157,0.2801379,0.2710237,0.3384479,0.496069


## Part 1. Validation Schema and General approach to Validation

In this part, I will show the principle,  which we will conduct validation and lay the foundation for future construction of a machine learning model

__Model Validation__

Since we are working with a time series, it is important to consider data from different time periods. Therefore, we will use the following steps to validate the model

1. We will split the dataset in an ~80:20 ratio into two datasets, the `training/validation dataset` and the `test dataset` by dividing them by sorted dates at the beginning of some month.
2. On the training set, using the `Expanding Window` technique, specifically the` sklearn.model_selection.TimeSeriesSplit` method, we will generate the training and validation datasets, train the models, and calculate the Mean Square Error (MSE) on these data.
3. Based on this validation, we will select the best hyperparameters and the best model
4. After selecting the best hyperparameters and model, we determine the final result on the `test set`

__Data Validation__

Data, that I'm about to provide to model, is created using EDA and DQC pipelines, it means that:

1. All datatypes are correct
2. There are no dublicates
3. Trehe are no missing values
4. There are no outliers
5. There are no target leakage, because new features for the object where created based on their own attributes without lags and with a little use of aggreagtion

### Train/Test

In [8]:
preprocessed_train = preprocessed_train.sort_values(by="date")
preprocessed_train["date"].quantile(0.80)

Timestamp('2015-01-06 00:00:00')

As we can see 80's percentile corresponds to begining of 2015, so we can split our dataset into 2 parts:

	train - before 2015.01.01
	test - after 2015.01.01

In [9]:
Xy_train = preprocessed_train[preprocessed_train["date"] < pd.Timestamp("2015.01.01")]
Xy_test = preprocessed_train[preprocessed_train["date"] >= pd.Timestamp("2015.01.01")]

### Feature Extraction Step

In this notebook we will focus on validation schema creating, so lets assume that pipelines, that we use for the data preprocessing produce useful features and we only need to drop features with incorrect types (like dates, text etc.)

For this task, we will write simple pipeline

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnDropper(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.columns_to_save = list()
    
    def fit(self, X, y=None):
        for feature in X.columns:
            if X[feature].dtype == np.dtype("int64") or X[feature].dtype == np.dtype("float64"):
                self.columns_to_save.append(feature)
        return self
                
    def transform(self, X, y=None):
        return X.loc[:, self.columns_to_save]

In [11]:
Xy_train

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
57384,2013-01-01,0,149.0,1.0,"Казань ТЦ ""ПаркХаус"" II",14,ТАКИЕ РАЗНЫЕ БЛИЗНЕЦЫ (регион),19548,Кино - DVD,40,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48401,2013-01-01,0,3889.5,1.0,"Калуга ТРЦ ""XXI век""",15,Win Home Basic 7 Russian Russia Only DVD,7814,Программы - Для дома и офиса,75,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74546,2013-01-01,0,349.0,1.0,"Химки ТЦ ""Мега""",54,ШАГ ВПЕРЕД 4,21808,Кино - DVD,40,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48405,2013-01-01,0,2290.0,1.0,"Калуга ТРЦ ""XXI век""",15,Win Pro 8 32-bit/64-bit Russian VUP Russia Onl...,7820,Программы - Для дома и офиса,75,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74531,2013-01-01,0,149.0,1.0,"Химки ТЦ ""Мега""",54,ШЕРЛОК. СЕЗОН 1 (BD),21856,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,2014-12-31,23,849.0,1.0,"Коломна ТЦ ""Рио""",16,"Disney. Infinity 2.0 (Marvel). Персонаж ""Желез...",2867,Игры - Аксессуары для игр,25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2315780,2014-12-31,23,1799.0,1.0,"Москва МТРЦ ""Афи Молл""",21,"Disney. Infinity 2.0 (Marvel). Набор ""2+1"": ""С...",2860,Игры - Аксессуары для игр,25,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2279605,2014-12-31,23,1799.0,1.0,"Уфа ТК ""Центральный""",52,"Sims 4 [PC, русская версия]",6503,Игры PC - Стандартные издания,30,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2293961,2014-12-31,23,699.0,1.0,"Москва ТЦ ""МЕГА Белая Дача II""",27,Кулон на цепочке Minecraft Creeper,13746,Подарки - Сувениры (в навеску),70,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [12]:
feature_extraction = ColumnDropper()

Xy_train_extracted_v1 = feature_extraction.fit_transform(Xy_train)
Xy_train_extracted_v1

Unnamed: 0,date_block_num,item_price,item_cnt_day,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
57384,0,149.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48401,0,3889.5,1.0,1,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74546,0,349.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48405,0,2290.0,1.0,1,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74531,0,149.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,1.0,2,12,2014,1,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2315780,23,1799.0,1.0,2,12,2014,1,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2279605,23,1799.0,1.0,2,12,2014,1,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2293961,23,699.0,1.0,2,12,2014,1,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Validation class

In [40]:
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import TimeSeriesSplit

class ModelValidation():
    
    def __init__(self, X, y, model, verbose=1):
        self.X = X
        self.y = y
        self.model = model
        self.verbose = verbose
        
        
    def validate(self, n_splits):
        self.scores = []
        
        tscv = TimeSeriesSplit(n_splits=n_splits)
        for i, (train_index, valid_index) in enumerate(tscv.split(self.X)):
            if self.verbose:
            	print(f"Model: {i}")
            X_train = self.X.iloc[train_index]
            y_train = self.y.iloc[train_index]
            
            X_valid = self.X.iloc[valid_index]
            y_valid = self.y.iloc[valid_index]
            
            self.model.fit(X_train, y_train)
            predictions = self.model.predict(X_valid)
            self.scores.append(root_mean_squared_error(y_valid, predictions))
        if self.verbose:
        	print("Validation Completed!")
        
        return self

In [14]:
X_train = Xy_train_extracted_v1.drop(["item_cnt_day"], axis="columns")
y_train = Xy_train_extracted_v1.loc[:, "item_cnt_day"]
X_train

Unnamed: 0,date_block_num,item_price,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
57384,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48401,0,3889.5,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74546,0,349.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48405,0,2290.0,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74531,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2315780,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2279605,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2293961,23,699.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [15]:
y_train

57384      1.0
48401      1.0
74546      1.0
48405      1.0
74531      1.0
          ... 
2270005    1.0
2315780    1.0
2279605    1.0
2293961    1.0
2202599    1.0
Name: item_cnt_day, Length: 2323364, dtype: float64

In [16]:
from sklearn.tree import DecisionTreeRegressor

validation = ModelValidation(X_train, y_train, DecisionTreeRegressor())
validation.validate(5)

Model: 0
Model: 1
Model: 2
Model: 3
Model: 4
Validation Completed!


<__main__.ModelValidation at 0x1faba3a87d0>

In [17]:
validation.scores

[np.float64(1.2878054796448561),
 np.float64(1.9261475723173263),
 np.float64(1.700165756089223),
 np.float64(1.7073583405845238),
 np.float64(1.9844201401453085)]

## Part 2. Model Building 

In this section, we will produce updated feature selection method and using more useful features will produce first models

## Feature Selection

### Important info!

As I've mentioned in previos notebooks, we dont have target in our dataset explicitly. Our task is to predict sales aggregated by month. So now we have two appraches on model learning which decide, which features to choose

- We will predict prices for items for every day, as we have in our dataset and them aggregate it by months. In this approach we need to:
	1. Find best features
	2. Learn Model on this features
	3. Write aggregation class for result aggregation

- We will predict data for aggregate data and have our target explicitly. In this approach we need to:
	1. Aggregate data by month
	3. Find best features
	4. Train model on these features


For both approaches for feature selection we will write voiting selector, which will use different algorithms for feature selection, and choose most promissing. Then selected features will be passed to Boruta in order to finally choose best features.

In [18]:
from sklearn.feature_selection import SelectKBest, r_regression, mutual_info_regression, f_regression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from itertools import compress

class VoitingSelector():
    
    def __init__ (self):
        self.votes = None
        self.selectors = {
            "pearson" : self._select_pearson,
            "vif" : self._select_vif,
            "mi" : self._select_mi,
            "anova" : self._select_anova
		}
        
    @staticmethod
    def _select_pearson(X, y=None, **kwargs):
        selector = SelectKBest(r_regression, k=kwargs.get("n_features_to_select", 20)).fit(X, y)
        return selector.get_feature_names_out()


    @staticmethod
    def _select_mi(X, y=None, **kwargs):
        selector = SelectKBest(mutual_info_regression, k=kwargs.get("n_features_to_select", 20)).fit(X, y)
        return selector.get_feature_names_out()
        
    
    @staticmethod
    def _select_vif(X, y=None, **kwargs):
        return [
           X.columns[feature_index]
           for feature_index in range(len(X.columns))
           if variance_inflation_factor(X.values, feature_index) <= kwargs.get("vif_threshold", 5)
       ]
 
    @staticmethod
    def _select_anova(X, y=None, **kwargs):
        selector = SelectKBest(f_regression, k=kwargs.get("n_features_to_select", 20)).fit(X, y)
        return selector.get_feature_names_out()
    
    def select(self, X, y, voting_threshold=0.5, **kwargs):
       votes = []
       for selector_name, selector_method in self.selectors.items():
           features_to_keep = selector_method(X, y, **kwargs)
           votes.append(
               pd.DataFrame([int(feature in features_to_keep) for feature in X.columns]).T
           )
           print(f"{selector_name} calculation completed!")
       self.votes = pd.concat(votes)
       self.votes.columns = X.columns
       self.votes.index = self.selectors.keys()
       features_to_keep = list(compress(X.columns, self.votes.mean(axis=0) >= voting_threshold))
       return X[features_to_keep]


### First approach: Raw data

In [19]:
sales_train

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.00,1.0
1,03.01.2013,0,25,2552,899.00,1.0
2,05.01.2013,0,25,2552,899.00,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.00,1.0
...,...,...,...,...,...,...
2935844,10.10.2015,33,25,7409,299.00,1.0
2935845,09.10.2015,33,25,7460,299.00,1.0
2935846,14.10.2015,33,25,7459,349.00,1.0
2935847,22.10.2015,33,25,7440,299.00,1.0


In [20]:
X_train

Unnamed: 0,date_block_num,item_price,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
57384,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48401,0,3889.5,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74546,0,349.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48405,0,2290.0,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74531,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2315780,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2279605,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2293961,23,699.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [21]:
selector = VoitingSelector()
features_to_keep = selector.select(X_train, y_train)

pearson calculation completed!


  vif = 1. / (1. - r_squared_i)


vif calculation completed!
mi calculation completed!
anova calculation completed!


In [22]:
features_to_keep

Unnamed: 0,date_block_num,item_price,weekday,month,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,...,city_name_Москва,group_Билеты (Цифра),group_Доставка товара,group_Игры PC,group_Кино,group_Музыка,group_Подарки,group_Чистые носители (штучные),shop_type_Digital,shop_type_Event
57384,0,149.0,1,1,0,0,0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
48401,0,3889.5,1,1,0,0,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74546,0,349.0,1,1,0,0,0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
48405,0,2290.0,1,1,0,0,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74531,0,149.0,1,1,0,0,0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,2,12,1,0,3,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2315780,23,1799.0,2,12,1,0,1,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2279605,23,1799.0,2,12,1,0,1,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2293961,23,699.0,2,12,1,0,3,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [25]:
features_to_keep.columns

Index(['date_block_num', 'item_price', 'weekday', 'month', 'is_NewYear',
       'is_OctoberSales', 'price_category', 'price_category_0',
       'price_category_1', 'price_category_2', 'city_name_Выездная',
       'city_name_Интернет-магазин', 'city_name_Москва',
       'group_Билеты (Цифра)', 'group_Доставка товара', 'group_Игры PC',
       'group_Кино', 'group_Музыка', 'group_Подарки',
       'group_Чистые носители (штучные)', 'shop_type_Digital',
       'shop_type_Event'],
      dtype='object')

Then, when we've found most promissing features, we will put reduced dataset into Borura Algorithm to finalyse set of best features.

In [23]:
%pip install Boruta




In [24]:
from boruta.boruta_py import BorutaPy
from sklearn.ensemble import RandomForestRegressor

boruta = BorutaPy(RandomForestRegressor(max_depth=5, n_jobs=-1), n_estimators="auto", verbose=2, random_state=52)
X_train_extracted = X_train.loc[:, features_to_keep.columns]

boruta.fit_transform(X_train_extracted , y=y_train, return_df=True)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	3
Tentative: 	6
Rejected: 	13
Iteration: 	9 / 100
Confirmed: 	3
Tentative: 	6
Rejected: 	13
Iteration: 	10 / 100
Confirmed: 	3
Tentative: 	6
Rejected: 	13
Iteration: 	11 / 100
Confirmed: 	3
Tentative: 	6
Rejected: 	13
Iteration: 	12 / 100
Confirmed: 	5
Tentative: 	4
Rejected: 	13
Iteration: 	13 / 100
Confirmed: 	5
Tentative: 	4
Rejected: 	13
Iteration: 	14 / 100
Confirmed: 	5
Tentative: 	4
Rejected: 	13
Iteration: 	15 / 100
Confirmed: 	5
Tentative: 	4
Rejected: 	13
Iteration: 	16 / 100
Confirmed: 	5
Tentative: 	4
Rejected: 	13
I

KeyboardInterrupt: 

In [89]:
from pickle import dump

dump(boruta, open("../utils/boruta_raw_data.pkl", "wb"))

### Second approach: Aggregated Data 

First, before finding best features, we need to aggregate our data by month. We already have our data pipelines, but afer aggregation, we also need to make sure, that our data is valid for pipeline. In order to make it possible to transform aggregated data with pipeline, we will fill `date` column with first days of a month (this imputation, will make `weekday` column useless, but we will delete it during feature selection anyway) 

In [24]:
date_range = pd.date_range(start="01.01.2013", periods=34, freq="MS")
date_blocks = [i for i in range(0, 34)]

dates_map = dict(zip(date_blocks, date_range))

aggregated_train = sales_train.drop(["date"], axis="columns")
aggregated_train = aggregated_train.groupby(["date_block_num", "shop_id", "item_id"]).agg({"item_price" : "mean", "item_cnt_day": "sum"}).reset_index()
aggregated_train["date"] = aggregated_train["date_block_num"].apply(lambda x : dates_map[x])

aggregated_train = etl_eda_pipeline.transform(aggregated_train)
aggregated_train

Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
0,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1,32,Кино - DVD,40,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0,347.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1 (BD),33,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0,247.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,10 ЛЕТ СПУСТЯ,35,Кино - DVD,40,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 МИЛЛИОНОВ ЕВРО,43,Кино - DVD,40,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0,128.5,2.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 лучших произведений классики (mp3-CD) (Dig...,51,Музыка - MP3,57,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609119,33,119.0,6.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR03-BC2,22087,Элементы питания,83,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1609120,33,119.0,2.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR06-BC2,22088,Элементы питания,83,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1609121,33,179.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL TURBO LR 03 2*BL,22091,Элементы питания,83,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1609122,33,629.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Энциклопедия Adventure Time,22100,"Книги - Артбуки, энциклопедии",42,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Also we need to modify `is_NewYear` and `is_OctoberSales` features, to make them meaningful. This features will be true, if sales are in December or October.

In [25]:
aggregated_train["is_NewYear"] = aggregated_train["date"].apply(lambda x : 1 if x.month == 12 else 0)
aggregated_train["is_OctoberSales"] = aggregated_train["date"].apply(lambda x : 1 if x.month == 10 else 0)

As I've mentioned before this aggregation will make `weekday` feature senseless, so we can drop it.

In [26]:
aggregated_train = aggregated_train.drop("weekday", axis="columns")
aggregated_train

Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
0,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1,32,Кино - DVD,40,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0,347.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1 (BD),33,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0,247.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,10 ЛЕТ СПУСТЯ,35,Кино - DVD,40,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 МИЛЛИОНОВ ЕВРО,43,Кино - DVD,40,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0,128.5,2.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 лучших произведений классики (mp3-CD) (Dig...,51,Музыка - MP3,57,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609119,33,119.0,6.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR03-BC2,22087,Элементы питания,83,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1609120,33,119.0,2.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR06-BC2,22088,Элементы питания,83,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1609121,33,179.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL TURBO LR 03 2*BL,22091,Элементы питания,83,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1609122,33,629.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Энциклопедия Adventure Time,22100,"Книги - Артбуки, энциклопедии",42,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Now, we will create new train-valid and test sets. We will try to predict last month that we have based on the information about previos months

In [27]:
Xy_train = aggregated_train[aggregated_train["date_block_num"] < 33]
Xy_test = aggregated_train[aggregated_train["date_block_num"] == 33]

Next, we will preprocess features and delete that features, which are not numerical

In [28]:
column_dropper = ColumnDropper()
Xy_train = column_dropper.fit_transform(Xy_train)
Xy_train.columns

Index(['date_block_num', 'item_price', 'item_cnt_day', 'month', 'year',
       'is_NewYear', 'is_OctoberSales', 'price_category', 'price_category_0',
       'price_category_1', 'price_category_2', 'price_category_3',
       'city_name_Адыгея', 'city_name_Балашиха', 'city_name_Волжский',
       'city_name_Вологда', 'city_name_Воронеж', 'city_name_Выездная',
       'city_name_Жуковский', 'city_name_Интернет-магазин', 'city_name_Казань',
       'city_name_Калуга', 'city_name_Коломна', 'city_name_Красноярск',
       'city_name_Курск', 'city_name_Москва', 'city_name_Мытищи',
       'city_name_Н.Новгород', 'city_name_Новосибирск', 'city_name_Омск',
       'city_name_РостовНаДону', 'city_name_СПб', 'city_name_Самара',
       'city_name_Сергиев', 'city_name_Сургут', 'city_name_Томск',
       'city_name_Тюмень', 'city_name_Уфа', 'city_name_Химки',
       'city_name_Цифровой', 'city_name_Чехов', 'city_name_Якутск',
       'city_name_Ярославль', 'group_PC', 'group_Аксессуары',
       'group_Билет

In [29]:
X_train = Xy_train.drop("item_cnt_day", axis="columns")
y_train = Xy_train.loc[:, "item_cnt_day"]


voiting_selector = VoitingSelector()
features_to_keep_agg = voiting_selector.select(X_train, y_train)
features_to_keep_agg

pearson calculation completed!


  vif = 1. / (1. - r_squared_i)


vif calculation completed!
mi calculation completed!
anova calculation completed!


Unnamed: 0,item_price,month,is_NewYear,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name_Выездная,city_name_Москва,...,group_Доставка товара,group_Игры,group_Игры PC,group_Карты оплаты,group_Кино,group_Музыка,group_Служебные,group_Чистые носители (штучные),shop_type_Digital,shop_type_Event
0,221.0,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,347.0,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,247.0,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,221.0,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,128.5,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1577588,119.0,9,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1577589,119.0,9,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1577590,179.0,9,0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1577591,629.0,9,0,3,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
features_to_keep_agg.columns

Index(['item_price', 'month', 'is_NewYear', 'price_category',
       'price_category_0', 'price_category_1', 'price_category_2',
       'price_category_3', 'city_name_Выездная', 'city_name_Москва',
       'city_name_Цифровой', 'group_Билеты (Цифра)', 'group_Доставка товара',
       'group_Игры', 'group_Игры PC', 'group_Карты оплаты', 'group_Кино',
       'group_Музыка', 'group_Служебные', 'group_Чистые носители (штучные)',
       'shop_type_Digital', 'shop_type_Event'],
      dtype='object')

In [31]:
from boruta.boruta_py import BorutaPy
from sklearn.ensemble import RandomForestRegressor

boruta = BorutaPy(RandomForestRegressor(max_depth=5, n_jobs=-1), n_estimators="auto", verbose=2, random_state=52)

boruta.fit_transform(X_train.loc[:, features_to_keep_agg.columns] , y=y_train, return_df=True)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	9 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	10 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	11 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	12 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	13 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	14 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	15 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
Iteration: 	16 / 100
Confirmed: 	5
Tentative: 	1
Rejected: 	16
I

Unnamed: 0,item_price,month,city_name_Москва,group_Доставка товара,group_Игры PC
0,221.0,1,0.0,0.0,0.0
1,347.0,1,0.0,0.0,0.0
2,247.0,1,0.0,0.0,0.0
3,221.0,1,0.0,0.0,0.0
4,128.5,1,0.0,0.0,0.0
...,...,...,...,...,...
1577588,119.0,9,0.0,0.0,0.0
1577589,119.0,9,0.0,0.0,0.0
1577590,179.0,9,0.0,0.0,0.0
1577591,629.0,9,0.0,0.0,0.0


In [32]:
feature_ranking = pd.DataFrame({"features" : X_train.loc[:, features_to_keep_agg.columns].columns, "ranking" : boruta.ranking_} ).sort_values(by="ranking")

final_features = feature_ranking[feature_ranking["ranking"] <= 6]["features"].values

In [33]:
dump(boruta, open("../utils/boruta_aggregated_data.pkl", "wb"))

### Model Building 

After choosing most useful features for future models, we can try few models, tune them, and test results.

In [44]:
import numpy as np
from hyperopt import hp, tpe, fmin, Trials
from sklearn.ensemble import RandomForestRegressor

space = {
    'n_estimators': hp.choice('n_estimators', range(50, 500)),
    'max_depth': hp.choice('max_depth', range(5, 50)),
    'min_samples_split': hp.uniform('min_samples_split', 0.1, 1.0),
    'min_samples_leaf': hp.choice('min_samples_leaf', range(1, 10)),
}

def objective(params):
    regressor = RandomForestRegressor(**params, n_jobs=-1)
    score = ModelValidation(X_train.loc[:, final_features], y_train, regressor, verbose=0).validate(5)
    return sum(score.scores) / len(score.scores)

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)

print("Best hyperparams:", best)

100%|██████████| 50/50 [3:10:14<00:00, 228.30s/trial, best loss: 5.44078582417489]   
Best hyperparams: {'max_depth': np.int64(31), 'min_samples_leaf': np.int64(3), 'min_samples_split': np.float64(0.11135298742795507), 'n_estimators': np.int64(2)}
