# Predeployment Rituals


In this notebook, I'll describe preparations for deployment, pipelines and other ideas.

## Package

First of all, I've created package called `price_pred` (you can find it in __scripts__ folder) with different classes, transformers and pipelines, which might be useful for making deployment process easier.

A bit of descriptions of different modules in this packages:

- `feature_selection.py` - module with class for choosing best features based on different statistics
- `pipelines.py` - module, which contains pipelines for train and test data
- `transformers.py` - module with different transformers that construct pipelines
- `validation.py` - modul with class for model validation

Modules `transformers.py` and `pipelines.py` were updated with new transformers and pipelines that weren't implemented for test data

## Pipeline

Next thing for discussion is test preprocessing. Raw test data looks like a DataFrame with only two columns: `item_id` and `shop_id`.

So before passing data into the model we need to preprocess it. As a solution for this challenge, we can implement pipelines for test data as we've done for train data, but a bit modified.

### Train

Inspite of the fact, that already have model trained on train data, we still need to do some preprocessing for train set before preprocessing test data, because some features in test are based on statistics and info about price and sales.

So, for train we will just clear it a bit and aggregate by months:

__Train Pipeline Scheme__:

` Train ` ---(ETL)--->  `Cleared Train` ---(Aggregation)---> `Aggregated Train`


### Test

Next, when we have our train preprocessed, we can preprocess test, and pass it into model

__Test Pipeline Scheme:__

1. `Test preprocessing` -> Add features from raw train set(e.g. item price, sales)
2. `Merge with Train` -> Merging Test and preprocessed train 
3. `Feature Extraction` -> Based on EDA, extract new features
4. `Feature Selection` -> Select Best featurea according to previously provided analysis + Boruta
5. `Test Set Extraction` -> Extract from dataset only test data

Using all this pipelines, we can get preprocessed test data that is ready to be passed into model

### Presentation

All this pipelines and dependent tranformers are implemented in `tranformers.py` and `pipelines.py` modules in __`price_pred`__ package. In this section I just show results of their work

In [9]:
%pip install price_predictions
print("Instalation Finished")

Note: you may need to restart the kernel to use updated packages.
Instalation Finished


In [None]:
from price_pred.pipelines import TrainPreprocessingPipeline
import pandas as pd

shops = pd.read_csv("../data/shops.csv")
items = pd.read_csv("../data/items.csv")
item_categories = pd.read_csv("../data/item_categories.csv")

sales_train = pd.read_csv("../data/sales_train.csv")
test = pd.read_csv("../data/test.csv")

merge_list = [(shops, "shop_id"), (items, "item_id"), (item_categories ,"item_category_id")]
unique_features = ["date", "shop_id", "item_id"]

feature_map = {"date" : "%d.%m.%Y",
    		   "date_block_num" : "int",
               "shop_id" : "O",
               "item_id" : "O",
               "item_price" : "float",
               "item_cnt_day" : "float",
               "shop_name" : "O",
               "item_name" : "O",
               "item_category_name" : "O", 
               "item_category_id" : "O"}

start_date = "01.01.2013"
periods = 34



preprocessed_train = TrainPreprocessingPipeline(unique_features, merge_list, feature_map, start_date, periods).pipeline.fit_transform(sales_train)
preprocessed_train

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,date
0,0,0,32,221.0,6.0,2013-01-01
1,0,0,33,347.0,3.0,2013-01-01
2,0,0,35,247.0,1.0,2013-01-01
3,0,0,43,221.0,1.0,2013-01-01
4,0,0,51,127.0,2.0,2013-01-01
...,...,...,...,...,...,...
1609108,33,59,22087,119.0,6.0,2015-10-01
1609109,33,59,22088,119.0,2.0,2015-10-01
1609110,33,59,22091,179.0,1.0,2015-10-01
1609111,33,59,22100,629.0,1.0,2015-10-01


In [11]:
from price_pred.pipelines import TestPreprocessingPipeline
from price_pred.transformers import *
import pandas as pd
from sklearn.pipeline import Pipeline
from pickle import load

features = ["date_block_num", "item_price", "item_price_lag_1", "item_cnt_day_lag_1", "item_price_lag_2", "item_cnt_day_lag_2", "item_price_lag_3", "item_cnt_day_lag_3", "item_price_lag_4", "item_cnt_day_lag_4",
            "month", "is_NewYear", "group"] + ["item_cnt_day"]

etl_pipeline = load(open("../pipelines/etl_pipeline_v1.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline_agg.pkl", "rb"))


etl_eda_pipeline = Pipeline([
    ("etl", etl_pipeline),
    ("eda", eda_pipeline)
    ])

preprocessed_test = TestPreprocessingPipeline(sales_train, start_date, periods, preprocessed_train, features, etl_eda_pipeline).pipeline.fit_transform(test)
preprocessed_test

(1720517, 5)
(1720517, 29)
(1720517, 29)


Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group,item_cnt_day
0,34,1999.0,169.0,1.0,169.00,1.0,399.0,1.0,359.0,1.0,1,0,2.394519,0.0
2,34,599.0,399.0,1.0,415.92,1.0,699.0,1.0,698.5,1.0,1,0,2.394519,0.0
3,34,599.0,149.0,1.0,149.00,1.0,149.0,2.0,149.0,2.0,1,0,2.394519,0.0
5,34,2599.0,199.0,1.0,199.00,1.0,199.0,1.0,199.0,1.0,1,0,2.394519,0.0
6,34,3999.0,299.0,1.0,299.00,1.0,299.0,1.0,299.0,1.0,1,0,2.394519,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,549.0,1.0,549.00,3.0,549.0,3.0,549.0,3.0,1,0,2.394519,0.0
214193,34,299.0,999.0,1.0,999.00,1.0,999.0,1.0,999.0,1.0,1,0,2.629247,0.0
214195,34,199.0,349.0,1.0,349.00,1.0,349.0,1.0,399.0,4.0,1,0,1.274670,0.0
214197,34,199.0,699.0,1.0,699.00,2.0,749.0,1.0,749.0,2.0,1,0,1.274670,0.0


In [12]:
test

Unnamed: 0,ID,shop_id,item_id,date_block_num,date,item_price,item_cnt_day
0,0,5,5037,34,2013-01-01,1999.0,0
1,1,5,5320,34,2013-01-01,-1.0,0
2,2,5,5233,34,2013-01-01,599.0,0
3,3,5,5232,34,2013-01-01,599.0,0
4,4,5,5268,34,2013-01-01,-1.0,0
...,...,...,...,...,...,...,...
214195,214195,45,18454,34,2013-01-01,199.0,0
214196,214196,45,16188,34,2013-01-01,-1.0,0
214197,214197,45,15757,34,2013-01-01,199.0,0
214198,214198,45,19648,34,2013-01-01,-1.0,0


In [13]:
preprocessed_test

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group,item_cnt_day
0,34,1999.0,169.0,1.0,169.00,1.0,399.0,1.0,359.0,1.0,1,0,2.394519,0.0
2,34,599.0,399.0,1.0,415.92,1.0,699.0,1.0,698.5,1.0,1,0,2.394519,0.0
3,34,599.0,149.0,1.0,149.00,1.0,149.0,2.0,149.0,2.0,1,0,2.394519,0.0
5,34,2599.0,199.0,1.0,199.00,1.0,199.0,1.0,199.0,1.0,1,0,2.394519,0.0
6,34,3999.0,299.0,1.0,299.00,1.0,299.0,1.0,299.0,1.0,1,0,2.394519,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,549.0,1.0,549.00,3.0,549.0,3.0,549.0,3.0,1,0,2.394519,0.0
214193,34,299.0,999.0,1.0,999.00,1.0,999.0,1.0,999.0,1.0,1,0,2.629247,0.0
214195,34,199.0,349.0,1.0,349.00,1.0,349.0,1.0,399.0,4.0,1,0,1.274670,0.0
214197,34,199.0,699.0,1.0,699.00,2.0,749.0,1.0,749.0,2.0,1,0,1.274670,0.0


In [14]:
model = load(open("../models/rfr_model.pkl", "rb"))
preprocessed_test_X = preprocessed_test.drop("item_cnt_day", axis="columns")
preprocessed_test_X

Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group
0,34,1999.0,169.0,1.0,169.00,1.0,399.0,1.0,359.0,1.0,1,0,2.394519
2,34,599.0,399.0,1.0,415.92,1.0,699.0,1.0,698.5,1.0,1,0,2.394519
3,34,599.0,149.0,1.0,149.00,1.0,149.0,2.0,149.0,2.0,1,0,2.394519
5,34,2599.0,199.0,1.0,199.00,1.0,199.0,1.0,199.0,1.0,1,0,2.394519
6,34,3999.0,299.0,1.0,299.00,1.0,299.0,1.0,299.0,1.0,1,0,2.394519
...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,549.0,1.0,549.00,3.0,549.0,3.0,549.0,3.0,1,0,2.394519
214193,34,299.0,999.0,1.0,999.00,1.0,999.0,1.0,999.0,1.0,1,0,2.629247
214195,34,199.0,349.0,1.0,349.00,1.0,349.0,1.0,399.0,4.0,1,0,1.274670
214197,34,199.0,699.0,1.0,699.00,2.0,749.0,1.0,749.0,2.0,1,0,1.274670


In [29]:
part_solution = pd.Series(model.predict(preprocessed_test_X), index=preprocessed_test_X.index)
solution = part_solution.reindex(test.index, fill_value=0)
solution


0         1.818427
1         0.000000
2         1.500610
3         1.688194
4         0.000000
            ...   
214195    1.292814
214196    0.000000
214197    1.705897
214198    0.000000
214199    1.303341
Length: 214200, dtype: float64

## DVC

For deployment purposes, we will not store our data localy. For this project, I've created cloud storage on Google Disk and connect it to DVC. You can see example below.

In [None]:
!dvc remote add -d myremote gdrive://1s6sFjCvbZRnmVlMlzyT9mZIZluNDPYvr

In [55]:
!dvc remote list

myremote	gdrive://1s6sFjCvbZRnmVlMlzyT9mZIZluNDPYvr


In the next step, we need to autorize to google account in order to be able to pull data from dvc. Google changed some policies and we need to use another method in order to connect to google drive storage. I found a simple script, which generates token for autorization. It creates token, which we need to set as Google Drive User Credentials File and it will work. Script could be executed using cell below

In [None]:
!python3 ../token_creator.py

In [7]:
!dvc remote modify myremote gdrive_user_credentials_file ../generated_token.json

Next, after we have all configs prepared, we can pull data from from drive.

In [10]:
!dvc pull

A       ..\data\data\
1 file added


In [12]:
from os import listdir

listdir("..\\data\\data")

['items.csv',
 'item_categories.csv',
 'sales_train.csv',
 'sample_submission.csv',
 'shops.csv',
 'test.csv',
 'testset.csv',
 'test_preprocessed.csv',
 'trainset.csv',
 'train_preprocessed.csv']

####  Note !

In previous cell you could notice, that data folders are nested. The reason for it is that I already have `data` folder in my project, so pulling data from drive ends with createing new `data` folder in my original `data` folder