# Baseline Modeling 

In this notebook we will create validation schema and produce simple model running on it

## Loading Custom Modules

In this notebook, we will use pipelines and transformers from previous notebooks, so we need to intall it

In [1]:
%pip install ..\scripts -q
print("Instalation Complited!")

Note: you may need to restart the kernel to use updated packages.
Instalation Complited!


## Importing Modules

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly as px

from etl.transformers import * # dependencies for etl pipeline

from pickle import dump, load

## Importing Data

In [3]:
item_categories = pd.read_csv("../data/item_categories.csv")
shops = pd.read_csv("../data/shops.csv")
items = pd.read_csv("../data/items.csv")

sales_train = pd.read_csv("../data/sales_train.csv")
test = pd.read_csv("../data/test.csv", index_col=0)

## Loading Pipelines

In [4]:
etl_pipeline = load(open("../pipelines/etl_pipeline_v1.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline.pkl", "rb"))

## Data Preprocesing

We can use our pipelines for the data preprocessing, but before, lets merge them into the new pipeline

In [88]:
from sklearn.pipeline import Pipeline

etl_eda_pipeline = Pipeline([
	("etl", etl_pipeline),
	("eda", eda_pipeline)
])

etl_eda_pipeline

In [90]:
preprocessed_train = etl_eda_pipeline.fit_transform(sales_train)
preprocessed_train.head()

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,2013-01-02,0,999.0,1.0,"Ярославль ТЦ ""Альтаир""",59,ЯВЛЕНИЕ 2012 (BD),22154,Кино - Blu-Ray,37,...,0,0,3,0.0,0.0,0.0,1.0,1.163508,1.100231,1.222551
1,2013-01-03,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0,0,3,0.0,0.0,0.0,1.0,1.27888,1.023471,1.26651
2,2013-01-05,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0,0,3,0.0,0.0,0.0,1.0,1.27888,1.023471,1.26651
3,2013-01-06,0,1709.05,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE Who Do You Think We Are LP,2554,Музыка - Винил,58,...,0,0,1,0.0,1.0,0.0,0.0,1.27888,1.023471,1.26651
4,2013-01-15,0,1099.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),2555,Музыка - CD фирменного производства,56,...,0,0,3,0.0,0.0,0.0,1.0,1.27888,1.023471,1.26651


In [7]:
preprocessed_train.describe()

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,weekday,month,year,is_NewYear,is_OctoberSales,price_category,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
count,2935772,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,...,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0,2935772.0
mean,2014-04-03 05:42:40.058750976,14.56987,890.7548,1.205446,3.365683,6.247721,2013.777,0.04770466,0.02323784,1.175418,...,0.001495348,0.00245455,0.02365477,0.001888089,0.09444739,0.01980092,0.08584693,0.07982602,0.1319605,0.5625754
min,2013-01-01 00:00:00,0.0,0.07,-22.0,0.0,1.0,2013.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2013-08-01 00:00:00,7.0,249.0,1.0,2.0,3.0,2013.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2014-03-04 00:00:00,14.0,399.0,1.0,4.0,6.0,2014.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2014-12-05 00:00:00,23.0,999.0,1.0,5.0,9.0,2014.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,2015-10-31 00:00:00,33.0,59200.0,343.0,6.0,12.0,2015.0,1.0,1.0,3.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
std,,9.42296,1720.51,1.691073,1.996799,3.536204,0.7684795,0.2131407,0.1506581,1.32364,...,0.03864081,0.04948259,0.1519711,0.04341112,0.2924502,0.1393157,0.2801379,0.2710237,0.3384479,0.496069


## Part 1. Validation Schema and General approach to Validation

In this part, I will show the principle,  which we will conduct validation and lay the foundation for future construction of a machine learning model

__Model Validation__

Since we are working with a time series, it is important to consider data from different time periods. Therefore, we will use the following steps to validate the model

1. We will split the dataset in an ~80:20 ratio into two datasets, the `training/validation dataset` and the `test dataset` by dividing them by sorted dates at the beginning of some month.
2. On the training set, using the `Expanding Window` technique, specifically the` sklearn.model_selection.TimeSeriesSplit` method, we will generate the training and validation datasets, train the models, and calculate the Mean Square Error (MSE) on these data.
3. Based on this validation, we will select the best hyperparameters and the best model
4. After selecting the best hyperparameters and model, we determine the final result on the `test set`

__Data Validation__

Data, that I'm about to provide to model, is created using EDA and DQC pipelines, it means that:

1. All datatypes are correct
2. There are no dublicates
3. Trehe are no missing values
4. There are no outliers
5. There are no target leakage, because new features for the object where created based on their own attributes without lags and with a little use of aggreagtion

### Train/Test

In [56]:
preprocessed_train = preprocessed_train.sort_values(by="date")
preprocessed_train["date"].quantile(0.80)

Timestamp('2015-01-06 00:00:00')

As we can see 80's percentile corresponds to begining of 2015, so we can split our dataset into 2 parts:

	train - before 2015.01.01
	test - after 2015.01.01

In [9]:
Xy_train = preprocessed_train[preprocessed_train["date"] < pd.Timestamp("2015.01.01")]
Xy_test = preprocessed_train[preprocessed_train["date"] >= pd.Timestamp("2015.01.01")]

### Feature Extraction Step

In this notebook we will focus on validation schema creating, so lets assume that pipelines, that we use for the data preprocessing produce useful features and we only need to drop features with incorrect types (like dates, text etc.)

For this task, we will write simple pipeline

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnDropper(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.columns_to_save = list()
    
    def fit(self, X, y=None):
        for feature in X.columns:
            print(feature)
            if X[feature].dtype == np.dtype("int64") or X[feature].dtype == np.dtype("float64"):
                self.columns_to_save.append(feature)
            print(feature)
        return self
                
    def transform(self, X, y=None):
        return X.loc[:, self.columns_to_save]

In [11]:
Xy_train

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
57384,2013-01-01,0,149.0,1.0,"Казань ТЦ ""ПаркХаус"" II",14,ТАКИЕ РАЗНЫЕ БЛИЗНЕЦЫ (регион),19548,Кино - DVD,40,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48401,2013-01-01,0,3889.5,1.0,"Калуга ТРЦ ""XXI век""",15,Win Home Basic 7 Russian Russia Only DVD,7814,Программы - Для дома и офиса,75,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74546,2013-01-01,0,349.0,1.0,"Химки ТЦ ""Мега""",54,ШАГ ВПЕРЕД 4,21808,Кино - DVD,40,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48405,2013-01-01,0,2290.0,1.0,"Калуга ТРЦ ""XXI век""",15,Win Pro 8 32-bit/64-bit Russian VUP Russia Onl...,7820,Программы - Для дома и офиса,75,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74531,2013-01-01,0,149.0,1.0,"Химки ТЦ ""Мега""",54,ШЕРЛОК. СЕЗОН 1 (BD),21856,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,2014-12-31,23,849.0,1.0,"Коломна ТЦ ""Рио""",16,"Disney. Infinity 2.0 (Marvel). Персонаж ""Желез...",2867,Игры - Аксессуары для игр,25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2315780,2014-12-31,23,1799.0,1.0,"Москва МТРЦ ""Афи Молл""",21,"Disney. Infinity 2.0 (Marvel). Набор ""2+1"": ""С...",2860,Игры - Аксессуары для игр,25,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2279605,2014-12-31,23,1799.0,1.0,"Уфа ТК ""Центральный""",52,"Sims 4 [PC, русская версия]",6503,Игры PC - Стандартные издания,30,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2293961,2014-12-31,23,699.0,1.0,"Москва ТЦ ""МЕГА Белая Дача II""",27,Кулон на цепочке Minecraft Creeper,13746,Подарки - Сувениры (в навеску),70,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [11]:
feature_extraction = ColumnDropper()

Xy_train_extracted_v1 = feature_extraction.fit_transform(Xy_train)
Xy_train_extracted_v1

Unnamed: 0,date_block_num,item_price,item_cnt_day,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
57384,0,149.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48401,0,3889.5,1.0,1,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74546,0,349.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48405,0,2290.0,1.0,1,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74531,0,149.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,1.0,2,12,2014,1,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2315780,23,1799.0,1.0,2,12,2014,1,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2279605,23,1799.0,1.0,2,12,2014,1,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2293961,23,699.0,1.0,2,12,2014,1,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Validation class

In [6]:
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import TimeSeriesSplit

class ModelValidation():
    
    def __init__(self, X, y, model, verbose=1):
        self.X = X
        self.y = y
        self.model = model
        self.verbose = verbose
        
        
    def validate(self, n_splits):
        self.scores = []
        
        tscv = TimeSeriesSplit(n_splits=n_splits)
        for i, (train_index, valid_index) in enumerate(tscv.split(self.X)):
            if self.verbose:
            	print(f"Model: {i}")
            X_train = self.X.iloc[train_index]
            y_train = self.y.iloc[train_index]
            
            X_valid = self.X.iloc[valid_index]
            y_valid = self.y.iloc[valid_index]
            
            self.model.fit(X_train, y_train)
            predictions = self.model.predict(X_valid)
            self.scores.append(root_mean_squared_error(y_valid, predictions))
        if self.verbose:
        	print("Validation Completed!")
        
        return self

In [13]:
X_train = Xy_train_extracted_v1.drop(["item_cnt_day"], axis="columns")
y_train = Xy_train_extracted_v1.loc[:, "item_cnt_day"]
X_train

Unnamed: 0,date_block_num,item_price,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,...,group_Чистые носители (штучные),group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ
57384,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48401,0,3889.5,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74546,0,349.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48405,0,2290.0,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
74531,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2315780,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2279605,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2293961,23,699.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [14]:
y_train

57384      1.0
48401      1.0
74546      1.0
48405      1.0
74531      1.0
          ... 
2270005    1.0
2315780    1.0
2279605    1.0
2293961    1.0
2202599    1.0
Name: item_cnt_day, Length: 2323364, dtype: float64

In [15]:
from sklearn.tree import DecisionTreeRegressor

validation = ModelValidation(X_train, y_train, DecisionTreeRegressor())
validation.validate(5)

Model: 0
Model: 1
Model: 2
Model: 3
Model: 4
Validation Completed!


<__main__.ModelValidation at 0x13ab1181f10>

In [16]:
validation.scores

[np.float64(1.2868056579133087),
 np.float64(1.919671249345645),
 np.float64(1.731964342615631),
 np.float64(1.8685851312721657),
 np.float64(2.040391856139683)]

## Part 2. Model Building 

In this section, we will produce updated feature selection method and using more useful features will produce first models

## Feature Selection

### Important info!

As I've mentioned in previos notebooks, we dont have target in our dataset explicitly. Our task is to predict sales aggregated by month. So now we have two appraches on model learning which decide, which features to choose

- We will predict prices for items for every day, as we have in our dataset and them aggregate it by months. In this approach we need to:
	1. Find best features
	2. Learn Model on this features
	3. Write aggregation class for result aggregation

- We will predict data for aggregate data and have our target explicitly. In this approach we need to:
	1. Aggregate data by month
	3. Find best features
	4. Train model on these features

For this task and this dataset, I think, it will be better to choose second approach and there are few reasons for this decision:
1. Our raw data is very rare and have a lot of "missing values" in the dates. After aggregation by month, dataset decreased only by __~1.8__!
2. Based on the previous take, we also can conclude, that it will be hard to create useful lags, which are very promissing features
3. Also raw data is disbalanced. We just have ones for the target in the most of records and it will be harder for models to predict such values
4. And finally, all errors, which model will produce, will be summed up together, which will increase error even more

So in this notebook, we will mostly focus on the second approach

For both approaches for feature selection we will write voiting selector, which will use different algorithms for feature selection, and choose most promissing. Then selected features will be passed to Boruta in order to finally choose best features.

In [7]:
from sklearn.feature_selection import SelectKBest, r_regression, mutual_info_regression, f_regression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from itertools import compress

class VoitingSelector():
    
    def __init__ (self):
        self.votes = None
        self.selectors = {
            "pearson" : self._select_pearson,
            "vif" : self._select_vif,
            "mi" : self._select_mi,
            "anova" : self._select_anova
		}
        
    @staticmethod
    def _select_pearson(X, y=None, **kwargs):
        selector = SelectKBest(r_regression, k=kwargs.get("n_features_to_select", 20)).fit(X, y)
        return selector.get_feature_names_out()


    @staticmethod
    def _select_mi(X, y=None, **kwargs):
        selector = SelectKBest(mutual_info_regression, k=kwargs.get("n_features_to_select", 20)).fit(X, y)
        return selector.get_feature_names_out()
        
    
    @staticmethod
    def _select_vif(X, y=None, **kwargs):
        return [
           X.columns[feature_index]
           for feature_index in range(len(X.columns))
           if variance_inflation_factor(X.values, feature_index) <= kwargs.get("vif_threshold", 5)
       ]
 
    @staticmethod
    def _select_anova(X, y=None, **kwargs):
        selector = SelectKBest(f_regression, k=kwargs.get("n_features_to_select", 20)).fit(X, y)
        return selector.get_feature_names_out()
    
    def select(self, X, y, voting_threshold=0.5, **kwargs):
       votes = []
       for selector_name, selector_method in self.selectors.items():
           features_to_keep = selector_method(X, y, **kwargs)
           votes.append(
               pd.DataFrame([int(feature in features_to_keep) for feature in X.columns]).T
           )
           print(f"{selector_name} calculation completed!")
       self.votes = pd.concat(votes)
       self.votes.columns = X.columns
       self.votes.index = self.selectors.keys()
       features_to_keep = list(compress(X.columns, self.votes.mean(axis=0) >= voting_threshold))
       return X[features_to_keep]


### Second Approach: Aggregated data

First, before finding best features, we need to aggregate our data by month. We already have our data pipelines, but afer aggregation, we also need to make sure, that our data is valid for pipeline. In order to make it possible to transform aggregated data with pipeline, we will fill `date` column with first days of a month (this imputation, will make `weekday` column useless, but we will delete it during feature selection anyway) 

In [8]:
date_range = pd.date_range(start="01.01.2013", periods=34, freq="MS")
date_blocks = [i for i in range(0, 34)]

dates_map = dict(zip(date_blocks, date_range))

sales_train = pd.read_csv("../data/sales_train.csv")
train = etl_pipeline.fit_transform(sales_train)
train

aggregated_train = train.drop(["date"], axis="columns")
aggregated_train = aggregated_train.groupby(["date_block_num", "shop_id", "item_id"]).agg({"item_price" : lambda x : x.mode()[0], "item_cnt_day": "sum"}).reset_index()
aggregated_train["date"] = aggregated_train["date_block_num"].apply(lambda x : dates_map[x])
aggregated_train

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,date
0,0,0,32,221.0,6.0,2013-01-01
1,0,0,33,347.0,3.0,2013-01-01
2,0,0,35,247.0,1.0,2013-01-01
3,0,0,43,221.0,1.0,2013-01-01
4,0,0,51,127.0,2.0,2013-01-01
...,...,...,...,...,...,...
1609108,33,59,22087,119.0,6.0,2015-10-01
1609109,33,59,22088,119.0,2.0,2015-10-01
1609110,33,59,22091,179.0,1.0,2015-10-01
1609111,33,59,22100,629.0,1.0,2015-10-01


In [9]:
etl_pipeline = load(open("../pipelines/etl_pipeline_v1.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline_agg.pkl", "rb"))

etl_eda_pipeline = Pipeline([
    ("etl", etl_pipeline),
    ("eda", eda_pipeline)
    ])

aggregated_train = etl_eda_pipeline.fit_transform(aggregated_train)
aggregated_train

(1608974, 5)
(1608974, 13)
(1608974, 13)


Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1,32,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
1,0,347.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1 (BD),33,Кино - Blu-Ray,37,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
2,0,247.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,10 ЛЕТ СПУСТЯ,35,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
3,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 МИЛЛИОНОВ ЕВРО,43,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
4,0,127.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 лучших произведений классики (mp3-CD) (Dig...,51,Музыка - MP3,57,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.379644,2.216136
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609108,33,119.0,6.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR03-BC2,22087,Элементы питания,83,...,0,1,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1609109,33,119.0,2.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR06-BC2,22088,Элементы питания,83,...,0,1,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1609110,33,179.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL TURBO LR 03 2*BL,22091,Элементы питания,83,...,0,1,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1609111,33,629.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Энциклопедия Adventure Time,22100,"Книги - Артбуки, энциклопедии",42,...,0,1,0,1.0,0.0,0.0,0.0,1.952846,1.730931,2.208729


In [10]:
aggregated_train["is_NewYear"] = aggregated_train["date"].apply(lambda x : 1 if x.month == 12 else 0)
aggregated_train["is_OctoberSales"] = aggregated_train["date"].apply(lambda x : 1 if x.month == 10 else 0)

As I've mentioned before this aggregation will make `weekday` feature senseless, so we can drop it.

In [11]:
aggregated_train = aggregated_train.drop("weekday", axis="columns")
aggregated_train

Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1,32,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
1,0,347.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1 (BD),33,Кино - Blu-Ray,37,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
2,0,247.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,10 ЛЕТ СПУСТЯ,35,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
3,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 МИЛЛИОНОВ ЕВРО,43,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
4,0,127.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 лучших произведений классики (mp3-CD) (Dig...,51,Музыка - MP3,57,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.379644,2.216136
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609108,33,119.0,6.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR03-BC2,22087,Элементы питания,83,...,0,1,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1609109,33,119.0,2.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR06-BC2,22088,Элементы питания,83,...,0,1,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1609110,33,179.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL TURBO LR 03 2*BL,22091,Элементы питания,83,...,0,1,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1609111,33,629.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Энциклопедия Adventure Time,22100,"Книги - Артбуки, энциклопедии",42,...,0,1,0,1.0,0.0,0.0,0.0,1.952846,1.730931,2.208729


Now, we will create new train-valid and test sets. We will try to predict last month that we have based on the information about previos months

In [12]:
Xy_train = aggregated_train[aggregated_train["date_block_num"] < 33]
Xy_test = aggregated_train[aggregated_train["date_block_num"] == 33]

In [13]:
Xy_train

Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1,32,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
1,0,347.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1 (BD),33,Кино - Blu-Ray,37,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
2,0,247.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,10 ЛЕТ СПУСТЯ,35,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
3,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 МИЛЛИОНОВ ЕВРО,43,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
4,0,127.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 лучших произведений классики (mp3-CD) (Dig...,51,Музыка - MP3,57,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.379644,2.216136
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1577578,32,119.0,3.0,2015-09-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR03-BC2,22087,Элементы питания,83,...,0,0,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1577579,32,119.0,1.0,2015-09-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR06-BC2,22088,Элементы питания,83,...,0,0,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1577580,32,179.0,3.0,2015-09-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL TURBO LR 03 2*BL,22091,Элементы питания,83,...,0,0,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1577581,32,629.0,1.0,2015-09-01,"Ярославль ТЦ ""Альтаир""",59,Энциклопедия Adventure Time,22100,"Книги - Артбуки, энциклопедии",42,...,0,0,0,1.0,0.0,0.0,0.0,1.952846,1.730931,2.208729


Next, we will preprocess features and delete that features, which are not numerical

In [14]:
column_dropper = ColumnDropper()
Xy_train = column_dropper.fit_transform(Xy_train)
Xy_train.columns

date_block_num
date_block_num
item_price
item_price
item_cnt_day
item_cnt_day
date
date
shop_name
shop_name
shop_id
shop_id
item_name
item_name
item_id
item_id
item_category_name
item_category_name
item_category_id
item_category_id
item_price_lag_1
item_price_lag_1
item_cnt_day_lag_1
item_cnt_day_lag_1
item_price_lag_2
item_price_lag_2
item_cnt_day_lag_2
item_cnt_day_lag_2
item_price_lag_3
item_price_lag_3
item_cnt_day_lag_3
item_cnt_day_lag_3
item_price_lag_4
item_price_lag_4
item_cnt_day_lag_4
item_cnt_day_lag_4
month
month
year
year
is_NewYear
is_NewYear
is_OctoberSales
is_OctoberSales
price_category
price_category
price_category_0
price_category_0
price_category_1
price_category_1
price_category_2
price_category_2
price_category_3
price_category_3
city_name
city_name
group
group
shop_type
shop_type


Index(['date_block_num', 'item_price', 'item_cnt_day', 'item_price_lag_1',
       'item_cnt_day_lag_1', 'item_price_lag_2', 'item_cnt_day_lag_2',
       'item_price_lag_3', 'item_cnt_day_lag_3', 'item_price_lag_4',
       'item_cnt_day_lag_4', 'month', 'year', 'is_NewYear', 'is_OctoberSales',
       'price_category', 'price_category_0', 'price_category_1',
       'price_category_2', 'price_category_3', 'city_name', 'group',
       'shop_type'],
      dtype='object')

In [15]:
X_train = Xy_train.drop("item_cnt_day", axis="columns")
y_train = Xy_train.loc[:, "item_cnt_day"]

In [None]:
voiting_selector = VoitingSelector()
features_to_keep_agg = voiting_selector.select(X_train, y_train)
features_to_keep_agg

pearson calculation completed!


  vif = 1. / (1. - r_squared_i)


vif calculation completed!
mi calculation completed!
anova calculation completed!


Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,0,221.0,221.0,6.0,221.0,6.0,221.0,6.0,221.0,6.0,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
1,0,347.0,347.0,3.0,347.0,3.0,347.0,3.0,347.0,3.0,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
2,0,247.0,247.0,1.0,247.0,1.0,247.0,1.0,247.0,1.0,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
3,0,221.0,221.0,1.0,221.0,1.0,221.0,1.0,221.0,1.0,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
4,0,127.0,127.0,2.0,127.0,2.0,127.0,2.0,127.0,2.0,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.379644,2.216136
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1577578,32,119.0,119.0,2.0,119.0,5.0,119.0,1.0,119.0,2.0,...,0,0,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1577579,32,119.0,119.0,7.0,119.0,7.0,119.0,4.0,119.0,3.0,...,0,0,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1577580,32,179.0,159.0,1.0,139.0,1.0,139.0,10.0,109.0,1.0,...,0,0,1,0.0,1.0,0.0,0.0,1.952846,4.901018,2.208729
1577581,32,629.0,629.0,1.0,629.0,1.0,629.0,1.0,629.0,1.0,...,0,0,0,1.0,0.0,0.0,0.0,1.952846,1.730931,2.208729


In [None]:
features_to_keep_agg.columns

Index(['date_block_num', 'item_price', 'item_price_lag_1',
       'item_cnt_day_lag_1', 'item_price_lag_2', 'item_cnt_day_lag_2',
       'item_price_lag_3', 'item_cnt_day_lag_3', 'item_price_lag_4',
       'item_cnt_day_lag_4', 'month', 'is_NewYear', 'is_OctoberSales',
       'price_category', 'price_category_0', 'price_category_1',
       'price_category_2', 'price_category_3', 'city_name', 'group',
       'shop_type'],
      dtype='object')

In [None]:
from boruta.boruta_py import BorutaPy
from sklearn.ensemble import RandomForestRegressor

boruta = BorutaPy(RandomForestRegressor(max_depth=5, n_jobs=-1), n_estimators="auto", verbose=2, random_state=52)

X_train = boruta.fit_transform(X_train.loc[:, features_to_keep_agg.columns] , y=y_train, return_df=True)
X_train

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	9 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	10 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	11 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	12 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	13 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	14 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	15 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
Iteration: 	16 / 100
Confirmed: 	10
Tentative: 	4
Rejected: 	7
I

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group
0,0,221.0,221.0,6.0,221.0,6.0,221.0,6.0,221.0,6.0,1,0,1.741761
1,0,347.0,347.0,3.0,347.0,3.0,347.0,3.0,347.0,3.0,1,0,1.741761
2,0,247.0,247.0,1.0,247.0,1.0,247.0,1.0,247.0,1.0,1,0,1.741761
3,0,221.0,221.0,1.0,221.0,1.0,221.0,1.0,221.0,1.0,1,0,1.741761
4,0,127.0,127.0,2.0,127.0,2.0,127.0,2.0,127.0,2.0,1,0,1.379644
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1577578,32,119.0,119.0,2.0,119.0,5.0,119.0,1.0,119.0,2.0,9,0,4.901018
1577579,32,119.0,119.0,7.0,119.0,7.0,119.0,4.0,119.0,3.0,9,0,4.901018
1577580,32,179.0,159.0,1.0,139.0,1.0,139.0,10.0,109.0,1.0,9,0,4.901018
1577581,32,629.0,629.0,1.0,629.0,1.0,629.0,1.0,629.0,1.0,9,0,1.730931


## Hyperparameter Optimization  

In [16]:
features_to_keep_agg = ["date_block_num", "item_price", "item_price_lag_1", "item_cnt_day_lag_1", "item_price_lag_2",
                        "item_cnt_day_lag_2", "item_price_lag_3", "item_cnt_day_lag_3", "item_price_lag_4","item_cnt_day_lag_4", "month", "is_NewYear", "group"]

X_train = X_train.loc[:, features_to_keep_agg]
X_train

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group
0,0,221.0,221.0,6.0,221.0,6.0,221.0,6.0,221.0,6.0,1,0,1.741761
1,0,347.0,347.0,3.0,347.0,3.0,347.0,3.0,347.0,3.0,1,0,1.741761
2,0,247.0,247.0,1.0,247.0,1.0,247.0,1.0,247.0,1.0,1,0,1.741761
3,0,221.0,221.0,1.0,221.0,1.0,221.0,1.0,221.0,1.0,1,0,1.741761
4,0,127.0,127.0,2.0,127.0,2.0,127.0,2.0,127.0,2.0,1,0,1.379644
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1577578,32,119.0,119.0,2.0,119.0,5.0,119.0,1.0,119.0,2.0,9,0,4.901018
1577579,32,119.0,119.0,7.0,119.0,7.0,119.0,4.0,119.0,3.0,9,0,4.901018
1577580,32,179.0,159.0,1.0,139.0,1.0,139.0,10.0,109.0,1.0,9,0,4.901018
1577581,32,629.0,629.0,1.0,629.0,1.0,629.0,1.0,629.0,1.0,9,0,1.730931


In [23]:
import numpy as np
from hyperopt import hp, tpe, fmin, Trials
from xgboost import XGBRegressor

space = {
        'min_child_weight': hp.choice("min_child_weight", range(1, 20)),
        'gamma': hp.uniform("gamma ", 0.5, 10),
        'subsample': hp.uniform("subsample", 0.5, 1),
        'colsample_bytree': hp.uniform("colsample_bytree", 0.5, 1),
        'max_depth': hp.choice('max_depth', range(5, 1000))
        }

def objective(params):
    regressor = XGBRegressor(**params)
    score = ModelValidation(X_train, y_train, regressor, verbose=0).validate(5)
    return sum(score.scores) / len(score.scores)

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)

print("Best hyperparams:", best)

100%|██████████| 50/50 [1:01:12<00:00, 73.46s/trial, best loss: 4.1365404061424025]
Best hyperparams: {'colsample_bytree': np.float64(0.7071327683598174), 'gamma ': np.float64(7.2639443211886245), 'max_depth': np.int64(8), 'min_child_weight': np.int64(17), 'subsample': np.float64(0.9126552650241303)}


In [30]:
import numpy as np
from hyperopt import hp, tpe, fmin, Trials
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

space = {
    'max_depth': hp.choice('max_depth', range(5, 1000)),
    'min_samples_split': hp.choice('min_samples_split', range(2, 10)),
    'min_samples_leaf': hp.choice('min_samples_leaf', range(1, 10)),
}

def objective(params):
    regressor = DecisionTreeRegressor(**params)
    score = ModelValidation(X_train, y_train, regressor, verbose=0).validate(5)
    return sum(score.scores) / len(score.scores)

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)

print("Best hyperparams:", best)


100%|██████████| 50/50 [17:06<00:00, 20.52s/trial, best loss: 4.268770003904591]
Best hyperparams: {'max_depth': np.int64(630), 'min_samples_leaf': np.int64(8), 'min_samples_split': np.int64(3)}


In [None]:
import numpy as np
from hyperopt import hp, tpe, fmin, Trials
from sklearn.ensemble import RandomForestRegressorя

space = {
    'n_estimators' : hp.choice('n_estimators', range(5, 1000)),
    'max_depth': hp.choice('max_depth', range(5, 1000)),
    'min_samples_split': hp.choice('min_samples_split', range(2, 10)),
    'min_samples_leaf': hp.choice('min_samples_leaf', range(1, 10)),
}



def objective(params):
    regressor = RandomForestRegressor(**params, n_jobs=-1)
    score = ModelValidation(X_train, y_train, regressor, verbose=0).validate(5)
    return sum(score.scores) / len(score.scores)

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)

print("Best hyperparams:", best)


100%|██████████| 50/50 [35:41:57<00:00, 2570.35s/trial, best loss: 3.8769593703599083]  
Best hyperparams: {'max_depth': np.int64(432), 'min_samples_leaf': np.int64(2), 'min_samples_split': np.int64(7), 'n_estimators': np.int64(333)}


## Creating submitions

After testing different model, we can make our submits. As I said earlier, our test set is a little bit different from train set, so we need to perform some transformations.

In [17]:
test_X = pd.read_csv("../data/test.csv", index_col=0)
test_X

Unnamed: 0_level_0,shop_id,item_id
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5,5037
1,5,5320
2,5,5233
3,5,5232
4,5,5268
...,...,...
214195,45,18454
214196,45,16188
214197,45,15757
214198,45,19648


As you can see, in our test set we have only `shop_id` and `item_id` columns, which are aggregated by month, so to make it possible to put it in our previous pipelines, we need to add some columns:
1. `date_block_num` : as we predict next month from our train set, this column will have `34` for every record
2. `date` : we will just put first day of a month for this column
3. `item_price` : this is very important feature in our set, so we need to fill this values. I've decided to fill it with mode value from train set, and just put -1 for store-item pairs, which are unique for test set.

In [18]:
test_X["date_block_num"] = 34
test_X["date"] = pd.to_datetime("01.11.2014", dayfirst=True)

item_price_map = sales_train.loc[:, ["item_id", "shop_id", "item_price"]].groupby(["item_id", "shop_id"]).agg(lambda x : x.mode()[0]).to_dict()["item_price"]

test_X["item_price"] = test_X.apply(lambda x : item_price_map[(x["item_id"], x["shop_id"])] if (x["item_id"], x["shop_id"]) in item_price_map.keys() else -1, axis=1)
test_X

Unnamed: 0_level_0,shop_id,item_id,date_block_num,date,item_price
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,5,5037,34,2014-11-01,1999.0
1,5,5320,34,2014-11-01,-1.0
2,5,5233,34,2014-11-01,599.0
3,5,5232,34,2014-11-01,599.0
4,5,5268,34,2014-11-01,-1.0
...,...,...,...,...,...
214195,45,18454,34,2014-11-01,199.0
214196,45,16188,34,2014-11-01,-1.0
214197,45,15757,34,2014-11-01,199.0
214198,45,19648,34,2014-11-01,-1.0


Them, we can split our test set for two sets: with known items-shops (__*item_price != -1*__) and unknown (__*item_price == -1*__)

In [19]:
test_X_zeros = test_X[test_X["item_price"] == -1]
test_X_non_zeros = test_X[test_X["item_price"] != -1]

Next, order to use our pipelines, we have to merge our test data with train data. We need to do this in order to create lags and other features correctly. So we need to create "plug" feature `item_cnt_day` which will just have zeros and provide our dataset to correct shape.

In [20]:
test_X_non_zeros.loc[:, "item_cnt_day"] = 0
test_X_non_zeros

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_X_non_zeros.loc[:, "item_cnt_day"] = 0


Unnamed: 0_level_0,shop_id,item_id,date_block_num,date,item_price,item_cnt_day
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,5,5037,34,2014-11-01,1999.0,0
2,5,5233,34,2014-11-01,599.0,0
3,5,5232,34,2014-11-01,599.0,0
5,5,5039,34,2014-11-01,2599.0,0
6,5,5041,34,2014-11-01,3999.0,0
...,...,...,...,...,...,...
214192,45,4352,34,2014-11-01,1499.0,0
214193,45,18049,34,2014-11-01,299.0,0
214195,45,18454,34,2014-11-01,199.0,0
214197,45,15757,34,2014-11-01,199.0,0


Next, we aggregate our train set

In [21]:
date_range = pd.date_range(start="01.01.2013", periods=34, freq="MS")
date_blocks = [i for i in range(0, 34)]

dates_map = dict(zip(date_blocks, date_range))

sales_train = pd.read_csv("../data/sales_train.csv")
train = etl_pipeline.fit_transform(sales_train)
train

aggregated_train = train.drop(["date"], axis="columns")
aggregated_train = aggregated_train.groupby(["date_block_num", "shop_id", "item_id"]).agg({"item_price" : lambda x : x.mode()[0], "item_cnt_day": "sum"}).reset_index()
aggregated_train["date"] = aggregated_train["date_block_num"].apply(lambda x : dates_map[x])
aggregated_train

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,date
0,0,0,32,221.0,6.0,2013-01-01
1,0,0,33,347.0,3.0,2013-01-01
2,0,0,35,247.0,1.0,2013-01-01
3,0,0,43,221.0,1.0,2013-01-01
4,0,0,51,127.0,2.0,2013-01-01
...,...,...,...,...,...,...
1609108,33,59,22087,119.0,6.0,2015-10-01
1609109,33,59,22088,119.0,2.0,2015-10-01
1609110,33,59,22091,179.0,1.0,2015-10-01
1609111,33,59,22100,629.0,1.0,2015-10-01


And, finally, we can concatenate this to dataframes into one

In [22]:
merged_dfs = pd.concat([aggregated_train, test_X_non_zeros])
merged_dfs

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,date
0,0,0,32,221.0,6.0,2013-01-01
1,0,0,33,347.0,3.0,2013-01-01
2,0,0,35,247.0,1.0,2013-01-01
3,0,0,43,221.0,1.0,2013-01-01
4,0,0,51,127.0,2.0,2013-01-01
...,...,...,...,...,...,...
214192,34,45,4352,1499.0,0.0,2014-11-01
214193,34,45,18049,299.0,0.0,2014-11-01
214195,34,45,18454,199.0,0.0,2014-11-01
214197,34,45,15757,199.0,0.0,2014-11-01


After all this transformation, we can correctly transform our test set, using little bit modified pipelines

In [23]:
merged_dfs

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,date
0,0,0,32,221.0,6.0,2013-01-01
1,0,0,33,347.0,3.0,2013-01-01
2,0,0,35,247.0,1.0,2013-01-01
3,0,0,43,221.0,1.0,2013-01-01
4,0,0,51,127.0,2.0,2013-01-01
...,...,...,...,...,...,...
214192,34,45,4352,1499.0,0.0,2014-11-01
214193,34,45,18049,299.0,0.0,2014-11-01
214195,34,45,18454,199.0,0.0,2014-11-01
214197,34,45,15757,199.0,0.0,2014-11-01


In [24]:
test_preprocessing_pipeline = Pipeline([
	("etl", etl_eda_pipeline[0][1]),
	("dtypes", etl_eda_pipeline[0][-1]),
 	("eda", etl_eda_pipeline[1][1:-1])
])

test_preprocessing_pipeline

X_test = test_preprocessing_pipeline.transform(merged_dfs)
X_test

(1720517, 5)
(1720517, 13)
(1720517, 13)


Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,0,221.0,6.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1,32,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
1,0,347.0,3.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,1+1 (BD),33,Кино - Blu-Ray,37,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
2,0,247.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,10 ЛЕТ СПУСТЯ,35,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
3,0,221.0,1.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 МИЛЛИОНОВ ЕВРО,43,Кино - DVD,40,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.741761,2.216136
4,0,127.0,2.0,2013-01-01,"!Якутск Орджоникидзе, 56 фран",0,100 лучших произведений классики (mp3-CD) (Dig...,51,Музыка - MP3,57,...,0,0,1,0.0,1.0,0.0,0.0,2.255956,1.379644,2.216136
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,"LEGO Marvel Super Heroes [PS Vita, русские суб...",4352,Игры - PSVita,22,...,0,0,3,0.0,0.0,0.0,1.0,1.728329,2.563908,2.208729
214193,34,299.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,Резинки для плетения силиконовые Неон желтый N...,18049,Подарки - Сувениры (в навеску),70,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,2.563287,2.208729
214195,34,199.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,СБ. Союз 55,18454,Музыка - CD локального производства,55,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,1.379644,2.208729
214197,34,199.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,НОВИКОВ АЛЕКСАНДР Новая коллекция,15757,Музыка - CD локального производства,55,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,1.379644,2.208729


After all this transformation, we can choose only data from 34's month and choose most useful features, based on feature selection step, and this will create our final test set.

In [25]:
X_test = X_test[X_test["date_block_num"] == 34]
X_test

Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,34,1999.0,0.0,2014-11-01,"Вологда ТРЦ ""Мармелад""",5,"NHL 15 [PS3, русские субтитры]",5037,Игры - PS3,19,...,0,0,3,0.0,0.0,0.0,1.0,1.773768,2.563908,1.942374
2,34,599.0,0.0,2014-11-01,"Вологда ТРЦ ""Мармелад""",5,"Need for Speed Rivals (Essentials) [PS3, русск...",5233,Игры - PS3,19,...,0,0,0,1.0,0.0,0.0,0.0,1.773768,2.563908,1.942374
3,34,599.0,0.0,2014-11-01,"Вологда ТРЦ ""Мармелад""",5,"Need for Speed Rivals (Classics) [Xbox 360, ру...",5232,Игры - XBOX 360,23,...,0,0,0,1.0,0.0,0.0,0.0,1.773768,2.563908,1.942374
5,34,2599.0,0.0,2014-11-01,"Вологда ТРЦ ""Мармелад""",5,"NHL 15 [Xbox 360, русские субтитры]",5039,Игры - XBOX 360,23,...,0,0,3,0.0,0.0,0.0,1.0,1.773768,2.563908,1.942374
6,34,3999.0,0.0,2014-11-01,"Вологда ТРЦ ""Мармелад""",5,"NHL 16 [PS4, русские субтитры]",5041,Игры - PS4,20,...,0,0,3,0.0,0.0,0.0,1.0,1.773768,2.563908,1.942374
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,"LEGO Marvel Super Heroes [PS Vita, русские суб...",4352,Игры - PSVita,22,...,0,0,3,0.0,0.0,0.0,1.0,1.728329,2.563908,2.208729
214193,34,299.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,Резинки для плетения силиконовые Неон желтый N...,18049,Подарки - Сувениры (в навеску),70,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,2.563287,2.208729
214195,34,199.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,СБ. Союз 55,18454,Музыка - CD локального производства,55,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,1.379644,2.208729
214197,34,199.0,0.0,2014-11-01,"Самара ТЦ ""ПаркХаус""",45,НОВИКОВ АЛЕКСАНДР Новая коллекция,15757,Музыка - CD локального производства,55,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,1.379644,2.208729


In [26]:
column_dropper = ColumnDropper()
X_test = column_dropper.fit_transform(X_test).drop("item_cnt_day", axis="columns")
X_test

date_block_num
date_block_num
item_price
item_price
item_cnt_day
item_cnt_day
date
date
shop_name
shop_name
shop_id
shop_id
item_name
item_name
item_id
item_id
item_category_name
item_category_name
item_category_id
item_category_id
item_price_lag_1
item_price_lag_1
item_cnt_day_lag_1
item_cnt_day_lag_1
item_price_lag_2
item_price_lag_2
item_cnt_day_lag_2
item_cnt_day_lag_2
item_price_lag_3
item_price_lag_3
item_cnt_day_lag_3
item_cnt_day_lag_3
item_price_lag_4
item_price_lag_4
item_cnt_day_lag_4
item_cnt_day_lag_4
weekday
weekday
month
month
year
year
is_NewYear
is_NewYear
is_OctoberSales
is_OctoberSales
price_category
price_category
price_category_0
price_category_0
price_category_1
price_category_1
price_category_2
price_category_2
price_category_3
price_category_3
city_name
city_name
group
group
shop_type
shop_type


Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,34,1999.0,169.0,1.0,169.00,1.0,399.0,1.0,359.0,1.0,...,0,0,3,0.0,0.0,0.0,1.0,1.773768,2.563908,1.942374
2,34,599.0,399.0,1.0,415.92,1.0,699.0,1.0,698.5,1.0,...,0,0,0,1.0,0.0,0.0,0.0,1.773768,2.563908,1.942374
3,34,599.0,149.0,1.0,149.00,1.0,149.0,2.0,149.0,2.0,...,0,0,0,1.0,0.0,0.0,0.0,1.773768,2.563908,1.942374
5,34,2599.0,199.0,1.0,199.00,1.0,199.0,1.0,199.0,1.0,...,0,0,3,0.0,0.0,0.0,1.0,1.773768,2.563908,1.942374
6,34,3999.0,299.0,1.0,299.00,1.0,299.0,1.0,299.0,1.0,...,0,0,3,0.0,0.0,0.0,1.0,1.773768,2.563908,1.942374
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,549.0,1.0,549.00,3.0,549.0,3.0,549.0,3.0,...,0,0,3,0.0,0.0,0.0,1.0,1.728329,2.563908,2.208729
214193,34,299.0,999.0,1.0,999.00,1.0,999.0,1.0,999.0,1.0,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,2.563287,2.208729
214195,34,199.0,349.0,1.0,349.00,1.0,349.0,1.0,399.0,4.0,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,1.379644,2.208729
214197,34,199.0,699.0,1.0,699.00,2.0,749.0,1.0,749.0,2.0,...,0,0,1,0.0,1.0,0.0,0.0,1.728329,1.379644,2.208729


In [27]:
features = ["date_block_num", "item_price", "item_price_lag_1", "item_cnt_day_lag_1", "item_price_lag_2", "item_cnt_day_lag_2", "item_price_lag_3", "item_cnt_day_lag_3", "item_price_lag_4", "item_cnt_day_lag_4", "month", "is_NewYear", "group"]

X_test = X_test.loc[:, features]
X_test

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group
0,34,1999.0,169.0,1.0,169.00,1.0,399.0,1.0,359.0,1.0,11,0,2.563908
2,34,599.0,399.0,1.0,415.92,1.0,699.0,1.0,698.5,1.0,11,0,2.563908
3,34,599.0,149.0,1.0,149.00,1.0,149.0,2.0,149.0,2.0,11,0,2.563908
5,34,2599.0,199.0,1.0,199.00,1.0,199.0,1.0,199.0,1.0,11,0,2.563908
6,34,3999.0,299.0,1.0,299.00,1.0,299.0,1.0,299.0,1.0,11,0,2.563908
...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,549.0,1.0,549.00,3.0,549.0,3.0,549.0,3.0,11,0,2.563908
214193,34,299.0,999.0,1.0,999.00,1.0,999.0,1.0,999.0,1.0,11,0,2.563287
214195,34,199.0,349.0,1.0,349.00,1.0,349.0,1.0,399.0,4.0,11,0,1.379644
214197,34,199.0,699.0,1.0,699.00,2.0,749.0,1.0,749.0,2.0,11,0,1.379644


For submition, it is good to choose that items-shops from train, which oqqure in test set. And next perform all transformations.

In [28]:
etl_pipeline = load(open("../pipelines/etl_pipeline_v2.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline_agg.pkl", "rb"))

etl_eda_pipeline = Pipeline([
	("etl", etl_pipeline),
	("eda", eda_pipeline)
])


date_range = pd.date_range(start="01.01.2013", periods=34, freq="MS")
date_blocks = [i for i in range(0, 34)]

dates_map = dict(zip(date_blocks, date_range))

aggregated_train = sales_train.drop(["date"], axis="columns")
aggregated_train = aggregated_train.groupby(["date_block_num", "shop_id", "item_id"]).agg({"item_price" : lambda x : x.mode()[0], "item_cnt_day": "sum"}).reset_index()
aggregated_train["date"] = aggregated_train["date_block_num"].apply(lambda x : dates_map[x])

aggregated_train = etl_eda_pipeline.fit_transform(aggregated_train)
aggregated_train

(600062, 5)
(600062, 13)
(600062, 13)


Unnamed: 0,date_block_num,item_price,item_cnt_day,date,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,price_category_2,price_category_3,city_name,group,shop_type
0,0,499.0,1.0,2013-01-01,"Адыгея ТЦ ""Мега""",2,1+1 (BD),33,Кино - Blu-Ray,37,...,0,0,0,1.0,0.0,0.0,0.0,2.329050,1.916430,2.592228
1,0,3300.0,1.0,2013-01-01,"Адыгея ТЦ ""Мега""",2,1С:Бухгалтерия 8. Базовая версия,482,Программы - 1С:Предприятие 8,73,...,0,0,1,0.0,1.0,0.0,0.0,2.329050,2.811258,2.592228
2,0,600.0,1.0,2013-01-01,"Адыгея ТЦ ""Мега""",2,1С:Деньги 8,491,Программы - 1С:Предприятие 8,73,...,0,0,2,0.0,0.0,1.0,0.0,2.329050,2.811258,2.592228
3,0,3300.0,1.0,2013-01-01,"Адыгея ТЦ ""Мега""",2,1С:Упрощенка 8,839,Программы - 1С:Предприятие 8,73,...,0,0,1,0.0,1.0,0.0,0.0,2.329050,2.811258,2.592228
4,0,449.0,1.0,2013-01-01,"Адыгея ТЦ ""Мега""",2,3D Crystal Puzzle Замок XL,1007,Подарки - Развитие,67,...,0,0,2,0.0,0.0,1.0,0.0,2.329050,3.012172,2.592228
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
600154,33,119.0,6.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR03-BC2,22087,Элементы питания,83,...,0,1,0,1.0,0.0,0.0,0.0,2.282003,4.845829,2.592228
600155,33,119.0,2.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL LR06-BC2,22088,Элементы питания,83,...,0,1,0,1.0,0.0,0.0,0.0,2.282003,4.845829,2.592228
600156,33,179.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Элемент питания DURACELL TURBO LR 03 2*BL,22091,Элементы питания,83,...,0,1,0,1.0,0.0,0.0,0.0,2.282003,4.845829,2.592228
600157,33,629.0,1.0,2015-10-01,"Ярославль ТЦ ""Альтаир""",59,Энциклопедия Adventure Time,22100,"Книги - Артбуки, энциклопедии",42,...,0,1,2,0.0,0.0,1.0,0.0,2.282003,2.117657,2.592228


In [29]:
aggregated_train["is_NewYear"] = aggregated_train["date"].apply(lambda x : 1 if x.month == 12 else 0)
aggregated_train["is_OctoberSales"] = aggregated_train["date"].apply(lambda x : 1 if x.month == 10 else 0)

As I've mentioned before, this aggregation will make `weekday` feature senseless, so we can drop it.

In [30]:
X_train = aggregated_train.drop("item_cnt_day", axis="columns")
y_train = aggregated_train.loc[:, "item_cnt_day"]

In [32]:
X_train = X_train.loc[:, features]
X_train

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group
0,0,499.0,499.0,1.0,499.0,1.0,499.0,1.0,499.0,1.0,1,0,1.916430
1,0,3300.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,1,0,2.811258
2,0,600.0,600.0,1.0,600.0,1.0,600.0,1.0,600.0,1.0,1,0,2.811258
3,0,3300.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,1,0,2.811258
4,0,449.0,449.0,3.0,449.0,3.0,449.0,3.0,449.0,3.0,1,0,3.012172
...,...,...,...,...,...,...,...,...,...,...,...,...,...
600154,33,119.0,119.0,3.0,119.0,2.0,119.0,5.0,119.0,1.0,10,0,4.845829
600155,33,119.0,119.0,1.0,119.0,7.0,119.0,7.0,119.0,4.0,10,0,4.845829
600156,33,179.0,179.0,3.0,159.0,1.0,139.0,1.0,139.0,10.0,10,0,4.845829
600157,33,629.0,629.0,1.0,629.0,1.0,629.0,1.0,629.0,1.0,10,0,2.117657


Finally, we can train our model and predict results

### XGBRegressor 

In [33]:
from xgboost import XGBRegressor

xgb_model = XGBRegressor(colsample_bytree=np.float64(0.7071327683598174), gamma=np.float64(7.2639443211886245), max_depth=np.int64(8), min_child_weight=np.int64(17), subsample=np.float64(0.9126552650241303))
xgb_model.fit(X_train, y_train)

In [34]:
pred = pd.concat([pd.Series(0, index=test_X_zeros.index), pd.Series(xgb_model.predict(X_test), X_test.index)]).sort_index()
pred.name = "item_cnt_month"
pred.index.name = "ID"
pred = pred.apply(lambda x : x if x >= 0 else 0)
pred = pred.apply(lambda x : x if x <= 20 else 20) 
pred.to_csv("../utils/solution_xgboost.csv") #1.77607 score

### DecisionTreeRegressor

In [35]:
from sklearn.tree import DecisionTreeRegressor

dtr_model = DecisionTreeRegressor(max_depth=np.int64(630), min_samples_leaf=np.int64(8), min_samples_split=np.int64(3))
dtr_model.fit(X_train, y_train)

In [36]:
pred = pd.concat([pd.Series(0, index=test_X_zeros.index), pd.Series(dtr_model.predict(X_test), X_test.index)]).sort_index()
pred.name = "item_cnt_month"
pred.index.name = "ID"
pred = pred.apply(lambda x : x if x >= 0 else 0)
pred = pred.apply(lambda x : x if x <= 20 else 20) 
pred.to_csv("../utils/solution_dtr.csv") #1.80080 score

### RandomForestRegressor

In [47]:
from sklearn.ensemble import RandomForestRegressor

rfr_model = RandomForestRegressor(max_depth=np.int64(432), min_samples_leaf=np.int64(2), min_samples_split=np.int64(7), n_estimators=np.int64(333), n_jobs=-1)
rfr_model.fit(X_train, y_train)

In [37]:
pred = pd.concat([pd.Series(0, index=test_X_zeros.index), pd.Series(rfr_model.predict(X_test), X_test.index)]).sort_index()
pred.name = "item_cnt_month"
pred.index.name = "ID"
pred = pred.apply(lambda x : x if x >= 0 else 0)
pred = pred.apply(lambda x : x if x <= 20 else 20) 
pred.to_csv("../utils/solution_rfr.csv")#1.80790 score

### Saving preprocessed data and models

In [44]:
train_preprocessed = pd.concat([X_train, y_train], axis="columns")
train_preprocessed.to_csv("../data/train_preprocessed.csv")
train_preprocessed

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group,item_cnt_day
0,0,499.0,499.0,1.0,499.0,1.0,499.0,1.0,499.0,1.0,1,0,1.916430,1.0
1,0,3300.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,1,0,2.811258,1.0
2,0,600.0,600.0,1.0,600.0,1.0,600.0,1.0,600.0,1.0,1,0,2.811258,1.0
3,0,3300.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,3300.0,1.0,1,0,2.811258,1.0
4,0,449.0,449.0,3.0,449.0,3.0,449.0,3.0,449.0,3.0,1,0,3.012172,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
600154,33,119.0,119.0,3.0,119.0,2.0,119.0,5.0,119.0,1.0,10,0,4.845829,6.0
600155,33,119.0,119.0,1.0,119.0,7.0,119.0,7.0,119.0,4.0,10,0,4.845829,2.0
600156,33,179.0,179.0,3.0,159.0,1.0,139.0,1.0,139.0,10.0,10,0,4.845829,1.0
600157,33,629.0,629.0,1.0,629.0,1.0,629.0,1.0,629.0,1.0,10,0,2.117657,1.0


In [46]:
X_test.to_csv("../data/test_preprocessed.csv")
X_test

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group
0,34,1999.0,169.0,1.0,169.00,1.0,399.0,1.0,359.0,1.0,11,0,2.563908
2,34,599.0,399.0,1.0,415.92,1.0,699.0,1.0,698.5,1.0,11,0,2.563908
3,34,599.0,149.0,1.0,149.00,1.0,149.0,2.0,149.0,2.0,11,0,2.563908
5,34,2599.0,199.0,1.0,199.00,1.0,199.0,1.0,199.0,1.0,11,0,2.563908
6,34,3999.0,299.0,1.0,299.00,1.0,299.0,1.0,299.0,1.0,11,0,2.563908
...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,549.0,1.0,549.00,3.0,549.0,3.0,549.0,3.0,11,0,2.563908
214193,34,299.0,999.0,1.0,999.00,1.0,999.0,1.0,999.0,1.0,11,0,2.563287
214195,34,199.0,349.0,1.0,349.00,1.0,349.0,1.0,399.0,4.0,11,0,1.379644
214197,34,199.0,699.0,1.0,699.00,2.0,749.0,1.0,749.0,2.0,11,0,1.379644


In [48]:
from pickle import dump

dump(xgb_model, open("../models/xgb_model.pkl", "wb"))
dump(rfr_model, open("../models/rfr_model.pkl", "wb"))
dump(dtr_model, open("../models/dtr_model.pkl", "wb"))