## Challenge

In this challenge, we will work with the same [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) as in Week 5 Day 3 challenge, **Prediction of Sales**. The main goal is to create **pipeline** that covers all data preprocessing and modeling steps.


**TASK 1**: Build Pipeline which will end with regression model to predict `Item_Outlet_Sales` from the dataset. The pipeline should have following steps:

- split features to numerical and categorical (text)
- null value replacement
    - mean for numerical variables
    - the most frequent value for categorical
- creating dummy variables from categorical features
- Use PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after OneHotEncoder that outputs data in the SparseMatrix so we need to use **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
- select 3 best candidates from original numeric features using KBest
- Fit Ridge regression (default alpha is fine for now)

**TASK 2**: Tune parameters of models as well as preprocessing steps and find the best solution
- Try models: Random Forest, Gradient Boosting Regressor or Ridge Regression. We need to use the approach from the [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section **PIPELINE TUNING (ADVANCED VERSION)**, when we tried different scalers.

In [66]:
import numpy as np
import pandas as pd
import sklearn
import scipy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
import copy

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

In [67]:
data = pd.read_csv("regression_exercise.csv", header = 0)

In [69]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [70]:
# creating target variable
y = data["Item_Outlet_Sales"]
df = data.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

In [72]:
y

0       3735.1380
1        443.4228
2       2097.2700
3        732.3800
4        994.7052
          ...    
8518    2778.3834
8519     549.2850
8520    1193.1136
8521    1845.5976
8522     765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

Splitting to train and test set in the begining. We should always do this before Pipeline

In [58]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [59]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

In [60]:
y_train.shape

(6818,)

In [61]:
type(y_train)

pandas.core.series.Series

# Task I

### Split Features to numerical and categorical

In [30]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [31]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [32]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

### null value replacement

In [33]:
df[df.isnull().any(axis=1)].head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store
7,,Low Fat,0.12747,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3
8,16.2,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1
9,19.2,Regular,0.09445,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1
18,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3


In [19]:
#transformer_null = ColumnTransformer([('impute_mean', SimpleImputer(strategy='mean'), num_feats),
#                                      ('impute_most_freq', SimpleImputer(strategy='most_frequent'), cat_feats)])

In [34]:
null_replace_num = SimpleImputer(strategy="mean") 
null_replace_cat = SimpleImputer(strategy="most_frequent")

### Creating dummy variables

In [37]:
# use OneHotEncoder
ohe = OneHotEncoder()

### PCA to reduce number of dummy variables to 3 principal components

In [38]:
class ToDenseTransformer():

    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self

In [39]:
# don't forget ToDenseTransformer after one hot encoder
to_dense = ToDenseTransformer()
pca = PCA(n_components = 6)

### Select 3 best numeric features

In [40]:
# use SelectKBest
select_best = SelectKBest(k=3)

### Fitting models

In [41]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()

### Building Pipeline

In [42]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [43]:
#null_replace_num = SimpleImputer(strategy="mean") 
#null_replace_cat = SimpleImputer(strategy="most_frequent")

In [44]:
pipe_num = Pipeline([('num_feats',keep_num),('transformer_impute',null_replace_num),('select_best', select_best)])                  

In [45]:
pipe_cat = Pipeline([('num_feats',keep_cat),('transformer_impute',null_replace_cat),('encode', ohe),('dense',to_dense),('PCA', pca)])

In [46]:
pipe_concat = FeatureUnion([('num',pipe_num), ('cat',pipe_cat)])

In [47]:
pipe_main = Pipeline([('pipes',pipe_concat),('model',base_model)])

In [50]:
type(df_train)

pandas.core.frame.DataFrame

In [51]:
type(y_train)

pandas.core.series.Series

In [48]:
# model.score(df_test,y_test)
pipe_main.fit(df_train,y_train)

Pipeline(steps=[('pipes',
                 FeatureUnion(transformer_list=[('num',
                                                 Pipeline(steps=[('num_feats',
                                                                  FunctionTransformer(func=<function numFeat at 0x7f1e9512dca0>)),
                                                                 ('transformer_impute',
                                                                  SimpleImputer()),
                                                                 ('select_best',
                                                                  SelectKBest(k=3))])),
                                                ('cat',
                                                 Pipeline(steps=[('num_feats',
                                                                  FunctionTransformer(func=<function catFeat at 0x7f1e9512dd30>)),
                                                                 ('transformer_impute',
             

In [132]:
preds = pipe_main.predict(df_test)

In [133]:
pipe_main.score(df_test,y_test)

0.4742289783659418

# Task II

In [134]:
#TASK 2: Tune parameters of models as well as preprocessing steps and find the best solution
#Try models: Random Forest, Gradient Boosting Regressor or Ridge Regression. 
#We need to use the approach from the earlier article, in the section PIPELINE TUNING (ADVANCED VERSION), 
#when we tried different scalers.

In [135]:
# pipe_num = Pipeline([('num_feats',keep_num),('transformer_impute',null_replace_num),('select_best', select_best)])    
# pipe_cat = Pipeline([('num_feats',keep_cat),('transformer_impute',null_replace_cat),('encode', ohe),('dense',to_dense),('PCA', pca)])
# pipe_concat = FeatureUnion([('num',pipe_num), ('cat',pipe_cat)])
# pipe_main = Pipeline([('pipes',pipe_concat),('model',base_model)])

### Bst params for PCA and select K best

In [143]:
cat_features_to_test = np.arange(1, 7)
num_features_to_test = np.arange(1, 5)

In [144]:
params = {'pipes__cat__PCA__n_components': cat_features_to_test, 'pipes__num__select_best__k': num_features_to_test}

In [145]:
gridsearch = GridSearchCV(pipe_main, params, verbose=1).fit(df_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:   27.7s finished


In [146]:
gridsearch.best_params_

{'pipes__cat__PCA__n_components': 6, 'pipes__num__select_best__k': 4}

### Gridsearch different models

In [147]:
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

In [160]:
xgb_reg = xgb.XGBRegressor()
rand_reg = RandomForestRegressor()
model_to_test = [xgb_reg, rand_reg, base_model]

In [151]:
model_params = {'model': model_to_test}

In [152]:
model_gridsearch = GridSearchCV(pipe_main, model_params, verbose=1).fit(df_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   14.8s finished


In [153]:
model_gridsearch.best_params_

{'model': RandomForestRegressor()}

### Gridsearch hyperparams

In [206]:
rand_reg = RandomForestRegressor(n_estimators = 100, max_depth=5, min_samples_split=2)

In [207]:
pipe_main = Pipeline([('pipes',pipe_concat),('model',rand_reg)])

In [208]:
n_estimators_test = [50,100]
max_depth_test = [5,6,7]
min_samples_split = [2,3,4]

In [210]:
hyper_params = {'model__n_estimators': n_estimators_test, 'model__max_depth': max_depth_test, 'model__min_samples_split': min_samples_split}

In [211]:
hyper_gridsearch = GridSearchCV(pipe_main, hyper_params, verbose=1).fit(df_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:  1.3min finished


In [212]:
hyper_gridsearch.best_params_

{'model__max_depth': 5,
 'model__min_samples_split': 3,
 'model__n_estimators': 50}

### optimal model

In [213]:
rand_reg = RandomForestRegressor(n_estimators = 50, max_depth=5, min_samples_split= 3)

In [214]:
pipe_main = Pipeline([('pipes',pipe_concat),('model',rand_reg)])

In [215]:
pipe_main.fit(df_train,y_train)

Pipeline(steps=[('pipes',
                 FeatureUnion(transformer_list=[('num',
                                                 Pipeline(steps=[('num_feats',
                                                                  FunctionTransformer(func=<function numFeat at 0x7f9b215264c0>)),
                                                                 ('transformer_impute',
                                                                  SimpleImputer()),
                                                                 ('select_best',
                                                                  SelectKBest(k=3))])),
                                                ('cat',
                                                 Pipeline(steps=[('num_feats',
                                                                  FunctionTransformer(func=<function catFeat at 0x7f9b21526550>)),
                                                                 ('transformer_impute',
             

In [216]:
preds = pipe_main.predict(df_test)

In [217]:
pipe_main.score(df_test,y_test)

0.6194134607764028