# Predeployment Rituals


In this notebook, I'll describe preparations for deployment, pipelines and other ideas.

## Package

First of all, I've created package called `price_pred` (you can find it in __scripts__ folder) with different classes, transformers and pipelines, which might be useful for making deployment process easier.

A bit of descriptions of different modules in this packages:

- `feature_selection.py` - module with class for choosing best features based on different statistics
- `pipelines.py` - module, which contains pipelines for train and test data
- `transformers.py` - module with different transformers that construct pipelines
- `validation.py` - modul with class for model validation

Modules `transformers.py` and `pipelines.py` were updated with new transformers and pipelines that weren't implemented for test data

## Pipeline

Next thing for discussion is test preprocessing. Raw test data looks like a DataFrame with only two columns: `item_id` and `shop_id`.

So before passing data into the model we need to preprocess it. As a solution for this challenge, we can implement pipelines for test data as we've done for train data, but a bit modified.

### Train

Inspite of the fact, that already have model trained on train data, we still need to do some preprocessing for train set before preprocessing test data, because some features in test are based on statistics and info about price and sales.

So, for train we will just clear it a bit and aggregate by months:

__Train Pipeline Scheme__:

` Train ` ---(ETL)--->  `Cleared Train` ---(Aggregation)---> `Aggregated Train`


### Test

Next, when we have our train preprocessed, we can preprocess test, and pass it into model

__Test Pipeline Scheme:__

1. `Test preprocessing` -> Add features from raw train set(e.g. item price, sales)
2. `Merge with Train` -> Merging Test and preprocessed train 
3. `Feature Extraction` -> Based on EDA, extract new features
4. `Feature Selection` -> Select Best featurea according to previously provided analysis + Boruta
5. `Test Set Extraction` -> Extract from dataset only test data

Using all this pipelines, we can get preprocessed test data that is ready to be passed into model

### Presentation

All this pipelines and dependent tranformers are implemented in `tranformers.py` and `pipelines.py` modules in __`price_pred`__ package. In this section I just show results of their work

In [1]:
%pip install ../scripts/
print("Instalation Finished")

Processing c:\ds_project_milad_almasri\scriptsNote: you may need to restart the kernel to use updated packages.
Instalation Finished

  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: price_preditions
  Building wheel for price_preditions (pyproject.toml): started
  Building wheel for price_preditions (pyproject.toml): finished with status 'done'
  Created wheel for price_preditions: filename=price_preditions-0.2-py3-none-any.whl size=7450 sha256=4e1ac88db306d91c073c21ce34be4f31eced3dbbd42cd49e50b305fbf4180b3b
  Stored in directory: C:\Users\masam\AppData\Local\Temp\pip-ephem-wheel-cache-ffrbtnvt\wheels\60\78\ad\b600c730e29514e74fceb5c2911e807137ab0c0d5d3fb7fc21
Successfu

In [2]:
from price_pred.pipelines import TrainPreprocessingPipeline
import pandas as pd

shops = pd.read_csv("../data/shops.csv")
items = pd.read_csv("../data/items.csv")
item_categories = pd.read_csv("../data/item_categories.csv")

sales_train = pd.read_csv("../data/sales_train.csv")
test = pd.read_csv("../data/test.csv")

merge_list = [(shops, "shop_id"), (items, "item_id"), (item_categories ,"item_category_id")]
unique_features = ["date", "shop_id", "item_id"]

feature_map = {"date" : "%d.%m.%Y",
    		   "date_block_num" : "int",
               "shop_id" : "O",
               "item_id" : "O",
               "item_price" : "float",
               "item_cnt_day" : "float",
               "shop_name" : "O",
               "item_name" : "O",
               "item_category_name" : "O", 
               "item_category_id" : "O"}

start_date = "01.01.2013"
periods = 34



preprocessed_train = TrainPreprocessingPipeline(unique_features, merge_list, feature_map, start_date, periods).pipeline.fit_transform(sales_train)
preprocessed_train

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,date
0,0,0,32,221.0,6.0,2013-01-01
1,0,0,33,347.0,3.0,2013-01-01
2,0,0,35,247.0,1.0,2013-01-01
3,0,0,43,221.0,1.0,2013-01-01
4,0,0,51,127.0,2.0,2013-01-01
...,...,...,...,...,...,...
1609108,33,59,22087,119.0,6.0,2015-10-01
1609109,33,59,22088,119.0,2.0,2015-10-01
1609110,33,59,22091,179.0,1.0,2015-10-01
1609111,33,59,22100,629.0,1.0,2015-10-01


In [4]:
from price_pred.pipelines import TestPreprocessingPipeline
from price_pred.transformers import *
import pandas as pd
from sklearn.pipeline import Pipeline
from pickle import load

features = ["date_block_num", "item_price", "item_price_lag_1", "item_cnt_day_lag_1", "item_price_lag_2", "item_cnt_day_lag_2", "item_price_lag_3", "item_cnt_day_lag_3", "item_price_lag_4", "item_cnt_day_lag_4",
            "month", "is_NewYear", "group"] + ["item_cnt_day"]

etl_pipeline = load(open("../pipelines/etl_pipeline_v1.pkl", "rb"))
eda_pipeline = load(open("../pipelines/eda_pipeline_agg.pkl", "rb"))


etl_eda_pipeline = Pipeline([
    ("etl", etl_pipeline),
    ("eda", eda_pipeline)
    ])

preprocessed_test = TestPreprocessingPipeline(sales_train, start_date, periods, preprocessed_train, features, etl_eda_pipeline).pipeline.fit_transform(test)
preprocessed_test

(1720517, 5)
(1720517, 29)
(1720517, 29)


Unnamed: 0,date_block_num,item_price,item_price_lag_1,item_cnt_day_lag_1,item_price_lag_2,item_cnt_day_lag_2,item_price_lag_3,item_cnt_day_lag_3,item_price_lag_4,item_cnt_day_lag_4,month,is_NewYear,group,item_cnt_day
0,34,1999.0,169.0,1.0,169.00,1.0,399.0,1.0,359.0,1.0,1,0,2.394519,0.0
2,34,599.0,399.0,1.0,415.92,1.0,699.0,1.0,698.5,1.0,1,0,2.394519,0.0
3,34,599.0,149.0,1.0,149.00,1.0,149.0,2.0,149.0,2.0,1,0,2.394519,0.0
5,34,2599.0,199.0,1.0,199.00,1.0,199.0,1.0,199.0,1.0,1,0,2.394519,0.0
6,34,3999.0,299.0,1.0,299.00,1.0,299.0,1.0,299.0,1.0,1,0,2.394519,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214192,34,1499.0,549.0,1.0,549.00,3.0,549.0,3.0,549.0,3.0,1,0,2.394519,0.0
214193,34,299.0,999.0,1.0,999.00,1.0,999.0,1.0,999.0,1.0,1,0,2.629247,0.0
214195,34,199.0,349.0,1.0,349.00,1.0,349.0,1.0,399.0,4.0,1,0,1.274670,0.0
214197,34,199.0,699.0,1.0,699.00,2.0,749.0,1.0,749.0,2.0,1,0,1.274670,0.0


In [None]:
model = load(open("../models/rfr_model.pkl", "rb"))
preprocessed_test.drop("item_cnt_day", axis="columns", inplace=True)
model.predict(preprocessed_test)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- item_cnt_day
