# Baseline Modeling 

In this notebook we will create validation schema and produce simple model running on it

## Loading Custom Modules

In this notebook, we will use pipelines and transformers from previous notebooks, so we need to intall it

In [1]:
%pip install ..\scripts -q
print("Instalation Complitted!")

Note: you may need to restart the kernel to use updated packages.
Instalation Complitted!


## Importing Modules

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly as px

from etl.transformers import * # dependencies for etl pipeline

from pickle import dump, load

## Importing Data

In [3]:
item_categories = pd.read_csv("../data/item_categories.csv")
shops = pd.read_csv("../data/shops.csv")
items = pd.read_csv("../data/items.csv")

sales_train = pd.read_csv("../data/sales_train.csv")
test = pd.read_csv("../data/test.csv", index_col=0)

## Loading Pipelines

In [4]:
etl_pipeline = load(open("../pipelines/etl_pipeline_v1.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline.pkl", "rb"))

## Data Preprocesing

We can use our pipelines for the data preprocessing, but before, lets merge them into the new pipeline

In [5]:
from sklearn.pipeline import Pipeline

etl_eda_pipeline = Pipeline([
	("etl", etl_pipeline),
	("eda", eda_pipeline)
])

etl_eda_pipeline

In [6]:
preprocessed_train = etl_eda_pipeline.transform(sales_train)
preprocessed_train.head()

Unnamed: 0,date,date_block_num,item_price,item_cnt_day,shop_name,shop_id,item_name,item_id,item_category_name,item_category_id,...,group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ,still_opened
0,2013-01-02,0,999.0,1.0,"Ярославль ТЦ ""Альтаир""",59,ЯВЛЕНИЕ 2012 (BD),22154,Кино - Blu-Ray,37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
1,2013-01-03,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
2,2013-01-05,0,899.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE The House Of Blue Light LP,2552,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
3,2013-01-06,0,1709.05,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE Who Do You Think We Are LP,2554,Музыка - Винил,58,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4,2013-01-15,0,1099.0,1.0,"Москва ТРК ""Атриум""",25,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),2555,Музыка - CD фирменного производства,56,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0


## Feature Extraction Step

In this notebook we will focus on validation schema creating, so lets assume that pipelines, that we use for the data preprocessing produce useful features and we only need to drop features with incorrect types (like dates, text etc.)

For this task, we will write simple pipeline

In [7]:
preprocessed_train[preprocessed_train.columns[4]].dtype

dtype('O')

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnDropper(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.columns_to_save = list()
    
    def fit(self, X, y=None):
        for feature in X.columns:
            if X[feature].dtype == np.dtype("int64") or X[feature].dtype == np.dtype("float64"):
                self.columns_to_save.append(feature)
        return self
                
    def transform(self, X, y=None):
        return X.loc[:, self.columns_to_save]

In [9]:
feature_extraction_pipeline = Pipeline([
	("etl_eda_pipeline", etl_eda_pipeline), 
 	("feature_selector", ColumnDropper())
])

preprocessed_data = feature_extraction_pipeline.fit_transform(sales_train)
preprocessed_data

Unnamed: 0,date_block_num,item_price,item_cnt_day,weekday,month,year,is_NewYear,is_OctoberSales,price_category,price_category_0,...,group_Элементы питания,shop_type_Digital,shop_type_Event,shop_type_Other,shop_type_МТРЦ,shop_type_ТК,shop_type_ТРК,shop_type_ТРЦ,shop_type_ТЦ,still_opened
0,0,999.00,1.0,2,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
1,0,899.00,1.0,3,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
2,0,899.00,1.0,5,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
3,0,1709.05,1.0,6,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4,0,1099.00,1.0,1,1,2013,0,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2935844,33,299.00,1.0,5,10,2015,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2935845,33,299.00,1.0,4,10,2015,0,1,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2935846,33,349.00,1.0,2,10,2015,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2935847,33,299.00,1.0,3,10,2015,0,0,0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1


## Simple Model

For baseline modeling I will choose `DecisionTreeRegression` model and train it

In [11]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model

## Model Validation

In this project, we will work with `Time Series Data`, so we need to apply corresponding model validation approach. In this notebook, I will use `TimeSeriesSplit` from `sklearn.model_selection` module. This method implements expanding window alorithm, which will help us to track long-time trends in data.


In [None]:
from sklearn.model_selection import TimeSeriesSplit

## Data Validation


## Train / Test / Validation Split