# Baseline Modeling 

In this notebook we will create validation schema and produce simple model running on it

## Loading Custom Modules

In this notebook, we will use pipelines and transformers from previous notebooks, so we need to intall it

In [31]:
%pip install ..\scripts -q
print("Instalation Complitted!")

Note: you may need to restart the kernel to use updated packages.
Instalation Complitted!


## Importing Modules

In [32]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly as px

from etl.transformers import * # dependencies for etl pipeline

from pickle import dump, load

## Importing Data

In [33]:
item_categories = pd.read_csv("../data/item_categories.csv")
shops = pd.read_csv("../data/shops.csv")
items = pd.read_csv("../data/items.csv")

sales_train = pd.read_csv("../data/sales_train.csv")
test = pd.read_csv("../data/test.csv", index_col=0)

## Loading Pipelines

In [34]:
etl_pipeline = load(open("../pipelines/etl_pipeline_v1.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline.pkl", "rb"))

## Data Preprocesing

We can use our pipelines for the data preprocessing, but before, lets merge them into the new pipeline

In [35]:
from sklearn.pipeline import Pipeline

etl_eda_pipeline = Pipeline([
	("etl", etl_pipeline),
	("eda", eda_pipeline)
])

etl_eda_pipeline

In [36]:
preprocessed_train = etl_eda_pipeline.transform(sales_train)
preprocessed_train.head()

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ,still_opened
0,2013-01-02,0,999.0,1.0,"Ярославль ТЦ ""Альтаир""",59,ЯВЛЕНИЕ 2012 (BD),22154,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
1,2013-01-03,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
2,2013-01-05,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
3,2013-01-06,0,1709.05,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE Who Do You Think We Are LP,2554,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4,2013-01-15,0,1099.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),2555,Музыка - CD фирменного производства,56,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0


## Part 1. Validation Schema and General approach to Validation

In this part, I will show the principle by which we will conduct validation and lay the foundation for future construction of a machine learning model

__Model Validation__

Since we are working with a time series, it is important to consider data from different time periods. Therefore, we will use the following steps to validate the model

1. We will split the dataset in an ~80:20 ratio into two datasets, the `training/validation dataset` and the `test dataset` by dividing them by sorted dates at the beginning of some month.
2. On the training set, using the `Expanding Window` technique, specifically the` sklearn.model_selection.TimeSeriesSplit` method, we will generate the training and validation datasets, train the models, and calculate the Mean Square Error (MSE) on these data.
3. Based on this validation, we will select the best hyperparameters and the best model
4. After selecting the best hyperparameters and model, we determine the final result on the `test set`

__Data Validation__

Data, that I'm about to provide to model, is created using EDA and DQC pipelines, it means that:

1. All datatypes are correct
2. There are no dublicates
3. Trehe are no missing values
4. There are no outliers
5. There are no target leakage, because new features for the object where created based on their own attributes without lags and with a little use of aggreagtion

### Train/Test

In [37]:
preprocessed_train = preprocessed_train.sort_values(by="date")
preprocessed_train["date"].quantile(0.80)

Timestamp('2015-01-06 00:00:00')

As we can see 80's percentile corresponds to begining of 2015, so we can split our dataset into 2 parts:

	train - before 2015.01.01
	test - after 2015.01.01

In [24]:
Xy_train = preprocessed_train[preprocessed_train["date"] < pd.Timestamp("2015.01.01")]
Xy_test = preprocessed_train[preprocessed_train["date"] >= pd.Timestamp("2015.01.01")]

### Feature Extraction Step

In this notebook we will focus on validation schema creating, so lets assume that pipelines, that we use for the data preprocessing produce useful features and we only need to drop features with incorrect types (like dates, text etc.)

For this task, we will write simple pipeline

In [44]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnDropper(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.columns_to_save = list()
    
    def fit(self, X, y=None):
        for feature in X.columns:
            if X[feature].dtype == np.dtype("int64") or X[feature].dtype == np.dtype("float64"):
                self.columns_to_save.append(feature)
        return self
                
    def transform(self, X, y=None):
        return X.loc[:, self.columns_to_save]

In [45]:
Xy_train

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ,still_opened
57384,2013-01-01,0,149.0,1.0,"Казань ТЦ ""ПаркХаус"" II",14,ТАКИЕ РАЗНЫЕ БЛИЗНЕЦЫ (регион),19548,Кино - DVD,40,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
48401,2013-01-01,0,3889.5,1.0,"Калуга ТРЦ ""XXI век""",15,Win Home Basic 7 Russian Russia Only DVD,7814,Программы - Для дома и офиса,75,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
74546,2013-01-01,0,349.0,1.0,"Химки ТЦ ""Мега""",54,ШАГ ВПЕРЕД 4,21808,Кино - DVD,40,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
48405,2013-01-01,0,2290.0,1.0,"Калуга ТРЦ ""XXI век""",15,Win Pro 8 32-bit/64-bit Russian VUP Russia Onl...,7820,Программы - Для дома и офиса,75,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
74531,2013-01-01,0,149.0,1.0,"Химки ТЦ ""Мега""",54,ШЕРЛОК. СЕЗОН 1 (BD),21856,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,2014-12-31,23,849.0,1.0,"Коломна ТЦ ""Рио""",16,"Disney. Infinity 2.0 (Marvel). Персонаж ""Желез...",2867,Игры - Аксессуары для игр,25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2315780,2014-12-31,23,1799.0,1.0,"Москва МТРЦ ""Афи Молл""",21,"Disney. Infinity 2.0 (Marvel). Набор ""2+1"": ""С...",2860,Игры - Аксессуары для игр,25,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
2279605,2014-12-31,23,1799.0,1.0,"Уфа ТК ""Центральный""",52,"Sims 4 [PC, русская версия]",6503,Игры PC - Стандартные издания,30,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
2293961,2014-12-31,23,699.0,1.0,"Москва ТЦ ""МЕГА Белая Дача II""",27,Кулон на цепочке Minecraft Creeper,13746,Подарки - Сувениры (в навеску),70,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


In [46]:
feature_extraction = ColumnDropper()

Xy_train = feature_extraction.fit_transform(Xy_train)
Xy_train

Unnamed: 0,date_block_num,item_price,item_cnt_day,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,...,group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ,still_opened
57384,0,149.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
48401,0,3889.5,1.0,1,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
74546,0,349.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
48405,0,2290.0,1.0,1,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
74531,0,149.0,1.0,1,1,2013,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,1.0,2,12,2014,1,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2315780,23,1799.0,1.0,2,12,2014,1,0,1,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
2279605,23,1799.0,1.0,2,12,2014,1,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
2293961,23,699.0,1.0,2,12,2014,1,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


### Validation class

In [47]:
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import TimeSeriesSplit

class ModelValidation():
    
    def __init__(self, X, y, model):
        self.X = X
        self.y = y
        self.model = model
        
        
    def validate(self, n_splits):
        self.scores = []
        
        tscv = TimeSeriesSplit(n_splits=n_splits)
        for i, (train_index, valid_index) in enumerate(tscv.split(self.X)):
            print(f"Model: {i}")
            X_train = self.X.iloc[train_index]
            y_train = self.y.iloc[train_index]
            
            X_valid = self.X.iloc[valid_index]
            y_valid = self.y.iloc[valid_index]
            
            self.model.fit(X_train, y_train)
            predictions = self.model.predict(X_valid)
            self.scores.append(mean_absolute_error(y_valid, predictions))
            
        print("Validation Completed!")
        
        return self
        

In [48]:
X_train = Xy_train.drop(["item_cnt_day"], axis="columns")
y_train = Xy_train.loc[:, "item_cnt_day"]
X_train

Unnamed: 0,date_block_num,item_price,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,price_category_1,...,group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ,still_opened
57384,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
48401,0,3889.5,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
74546,0,349.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
48405,0,2290.0,1,1,2013,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
74531,0,149.0,1,1,2013,0,0,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270005,23,849.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2315780,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
2279605,23,1799.0,2,12,2014,1,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
2293961,23,699.0,2,12,2014,1,0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


In [49]:
y_train

57384      1.0
48401      1.0
74546      1.0
48405      1.0
74531      1.0
          ... 
2270005    1.0
2315780    1.0
2279605    1.0
2293961    1.0
2202599    1.0
Name: item_cnt_day, Length: 2323364, dtype: float64

In [51]:
from sklearn.tree import DecisionTreeRegressor

validation = ModelValidation(X_train, y_train, DecisionTreeRegressor())
validation.validate(5)

Model: 0
Model: 1
Model: 2
Model: 3
Model: 4
Validation Completed!


<__main__.ModelValidation at 0x1822ac83950>

In [52]:
validation.scores

[np.float64(0.259038155371974),
 np.float64(0.35397357613793556),
 np.float64(0.3487159600553108),
 np.float64(0.30281569325111535),
 np.float64(0.400746138866492)]