## Challenge

In this challenge, we will work with the same [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) as in Week 5 Day 3 challenge, **Prediction of Sales**. The main goal is to create **pipeline** that covers all data preprocessing and modeling steps.


**TASK 1**: Build Pipeline which will end with regression model to predict `Item_Outlet_Sales` from the dataset. The pipeline should have following steps:

- split features to numerical and categorical (text)
- null value replacement
    - mean for numerical variables
    - the most frequent value for categorical
- creating dummy variables from categorical features
- Use PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after OneHotEncoder that outputs data in the SparseMatrix so we need to use **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
- select 3 best candidates from original numeric features using KBest
- Fit Ridge regression (default alpha is fine for now)

**TASK 2**: Tune parameters of models as well as preprocessing steps and find the best solution
- Try models: Random Forest, Gradient Boosting Regressor or Ridge Regression. We need to use the approach from the [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section **PIPELINE TUNING (ADVANCED VERSION)**, when we tried different scalers.

In [1]:
import pandas as pd
df = pd.read_csv("regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


Splitting to train and test set in the begining. We should always do this before Pipeline

In [3]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [4]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

# Task I

### Split Features to numerical and categorical

In [5]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [6]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [7]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

In [37]:
keep_num.transform(df_train)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year
0,9.300,0.016047,249.8092,1999
1,5.920,0.019278,48.2692,2009
2,17.500,0.016760,141.6180,1999
3,19.200,0.000000,182.0950,1998
4,8.930,0.000000,53.8614,1987
...,...,...,...,...
8518,6.865,0.056783,214.5218,1987
8519,8.380,0.046982,108.1570,2002
8520,10.600,0.035186,85.1224,2004
8521,7.210,0.145221,103.1332,2009


### null value replacement

In [24]:
from sklearn.impute import SimpleImputer
import numpy as np 

num_impute = SimpleImputer(missing_values=np.nan, strategy='mean')

In [25]:
cat_impute = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

### Creating dummy variables

In [26]:
from sklearn.preprocessing import OneHotEncoder


class ToDenseTransformer():

    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self

# use OneHotEncoder
cat_ohe = OneHotEncoder()
to_dense = ToDenseTransformer()

### PCA to reduce number of dummy variables to 3 principal components

In [27]:
from sklearn.decomposition import PCA
# don't forget ToDenseTransformer after one hot encoder
cat_pca = PCA(n_components=3)

### Select 3 best numeric features

In [29]:
from sklearn.feature_selection import SelectKBest

# use SelectKBest
num_select = SelectKBest(k=3)

### Fitting models

In [30]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()
forest_model = RandomForestRegressor()
grad_boost_model = GradientBoostingRegressor()

### Building Pipeline

In [32]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [44]:
num_pipeline = Pipeline([('num', keep_num),('imputer',num_impute),('selection',num_select)])
cat_pipeline = Pipeline([('cat', keep_cat),('imputer',cat_impute),('ohe',cat_ohe), ('to_dense', to_dense),('pca', cat_pca)])
union = FeatureUnion([('num',num_pipeline),('cat',cat_pipeline)])

In [50]:
model1 = Pipeline([('features',union),('model',base_model)]).fit(df_train,y_train)
model2 = Pipeline([('features',union),('model',forest_model)]).fit(df_train,y_train)
model3 = Pipeline([('features',union),('model',grad_boost_model)]).fit(df_train,y_train)

In [52]:
print(model1.score(df_test,y_test))
print(model2.score(df_test,y_test))
print(model3.score(df_test,y_test))

0.37561049095811316
0.5332145778354869
0.5897596319802384


# Task II

In [54]:
from sklearn.model_selection import GridSearchCV

In [56]:
param_grid = {"features__num__selection__k":[1,2,3],
                'features__cat__pca__n_components': [1,2,3],
                'model__n_estimators':[25,50,100],
                'model__max_depth':[None, 5, 10, 25],
                'model__n_estimators':[25,50,100]
                }

In [58]:
tuned_model = GridSearchCV(model2,param_grid,verbose=5, refit=True).fit(df_train,y_train)

n_estimators=100 
[CV]  features__cat__pca__n_components=3, features__num__selection__k=2, model__max_depth=10, model__n_estimators=100, score=0.587, total=   1.5s
[CV] features__cat__pca__n_components=3, features__num__selection__k=2, model__max_depth=25, model__n_estimators=25 
[CV]  features__cat__pca__n_components=3, features__num__selection__k=2, model__max_depth=25, model__n_estimators=25, score=0.516, total=   0.7s
[CV] features__cat__pca__n_components=3, features__num__selection__k=2, model__max_depth=25, model__n_estimators=25 
[CV]  features__cat__pca__n_components=3, features__num__selection__k=2, model__max_depth=25, model__n_estimators=25, score=0.499, total=   0.7s
[CV] features__cat__pca__n_components=3, features__num__selection__k=2, model__max_depth=25, model__n_estimators=25 
[CV]  features__cat__pca__n_components=3, features__num__selection__k=2, model__max_depth=25, model__n_estimators=25, score=0.460, total=   0.7s
[CV] features__cat__pca__n_components=3, features_

In [61]:
print('Final score is: ', tuned_model.score(df_test, y_test))
print('Best parameters are: ', tuned_model.best_params_)

Final score is:  0.5763102888853869
Best parameters are:  {'features__cat__pca__n_components': 1, 'features__num__selection__k': 3, 'model__max_depth': 5, 'model__n_estimators': 100}


In [62]:
param_grid = {"features__num__selection__k":[1,2,3,4],
                'features__cat__pca__n_components': [1,2,3,4],
                'model__learning_rate':[0.01,0.1,1],
                'model__n_estimators':[25,50,100],
                'model__subsample':[.25,.50,1],
                'model__max_depth':[None, 5, 10, 25]
                }
tuned_model = GridSearchCV(model3,param_grid,verbose=5, refit=True).fit(df_train,y_train)

el__learning_rate=1, model__max_depth=10, model__n_estimators=100, model__subsample=0.25 
[CV]  features__cat__pca__n_components=4, features__num__selection__k=4, model__learning_rate=1, model__max_depth=10, model__n_estimators=100, model__subsample=0.25, score=-226557176.235, total=   0.9s
[CV] features__cat__pca__n_components=4, features__num__selection__k=4, model__learning_rate=1, model__max_depth=10, model__n_estimators=100, model__subsample=0.5 
[CV]  features__cat__pca__n_components=4, features__num__selection__k=4, model__learning_rate=1, model__max_depth=10, model__n_estimators=100, model__subsample=0.5, score=-9.427, total=   1.4s
[CV] features__cat__pca__n_components=4, features__num__selection__k=4, model__learning_rate=1, model__max_depth=10, model__n_estimators=100, model__subsample=0.5 
[CV]  features__cat__pca__n_components=4, features__num__selection__k=4, model__learning_rate=1, model__max_depth=10, model__n_estimators=100, model__subsample=0.5, score=-7.186, total=  

In [63]:

print('Final score is: ', tuned_model.score(df_test, y_test))
print('Best parameters are: ', tuned_model.best_params_)

Final score is:  0.587092249794028
Best parameters are:  {'features__cat__pca__n_components': 2, 'features__num__selection__k': 4, 'model__learning_rate': 0.1, 'model__max_depth': 5, 'model__n_estimators': 25, 'model__subsample': 0.5}
