## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [110]:
import pandas as pd
df = pd.read_csv("regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [111]:
# creating target variable
y = df["Item_Outlet_Sales"]
df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [112]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [113]:
df_train.values

array([[9.3, 'Low Fat', 0.016047301, ..., 'Medium', 'Tier 1',
        'Supermarket Type1'],
       [5.92, 'Regular', 0.019278216, ..., 'Medium', 'Tier 3',
        'Supermarket Type2'],
       [17.5, 'Low Fat', 0.016760075, ..., 'Medium', 'Tier 1',
        'Supermarket Type1'],
       ...,
       [18.6, 'Low Fat', 0.118661426, ..., 'Medium', 'Tier 3',
        'Supermarket Type2'],
       [10.6, 'Low Fat', 0.035186271, ..., 'Small', 'Tier 2',
        'Supermarket Type1'],
       [7.21, 'Regular', 0.145220646, ..., 'Medium', 'Tier 3',
        'Supermarket Type2']], dtype=object)

In [114]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

---------------------
## Task I

### Split Features into numerical and categorical

In [115]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [116]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [117]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

### replacing null values

In [118]:
# Use SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [119]:
numeric_transform = Pipeline([('get_nums', keep_num),
                              ('impute_mean', SimpleImputer(strategy='mean')), 
                              ('scaling', StandardScaler())])
categorical_transform = Pipeline([('get_cats', keep_cat),
                                  ('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                  ('one-hot-encode', OneHotEncoder(sparse=False))])


In [120]:
# numeric_transform.fit_transform(df_train)

array([[-0.84668716, -0.96642147,  1.74097226,  0.13132612],
       [-1.64902076, -0.90351396, -1.47938543,  1.32756056],
       [ 1.09980266, -0.95254341,  0.0122119 ,  0.13132612],
       ...,
       [ 1.36091715,  1.03152614, -1.31177471,  1.32756056],
       [-0.53809731, -0.59377629, -0.89051728,  0.72944334],
       [-1.34280468,  1.54864725, -0.60272718,  1.32756056]])

In [121]:
# categorical_transform.fit_transform(df_train)

array([[0., 1., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])

### Creating dummy variables

In [122]:
# use OneHotEncoder


### Use PCA to reduce the number of dummy variables to 3 principal components.

In [123]:
from sklearn.decomposition import PCA
# don't forget ToDenseTransformer after one hot encoder

In [124]:
from scipy import sparse
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

In [125]:
pca_transform = Pipeline([('cat_trans', categorical_transform),
                        #   ('to_dense', DenseTransformer()), 
                          ('pca', PCA(n_components=3))])

In [126]:
pca_transform.fit_transform(df_train)

array([[-0.30931695, -0.00770143, -0.52623694],
       [ 1.29265541,  0.17533513,  0.81370671],
       [-0.29931858, -0.0072486 , -0.49836465],
       ...,
       [ 1.28162502,  0.07138006, -0.55498668],
       [-1.11054949, -0.26668961, -0.54939495],
       [ 1.28864999,  0.17628529,  0.88344675]])

### Select the 3 best numeric features

In [127]:
# use SelectKBest
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion

In [128]:
# feature_union = FeatureUnion([('pca', PCA(n_components=3)), 
#                               ('select_best', SelectKBest(k=3))])

In [129]:
kbest_transform = Pipeline([('num_trans', numeric_transform), 
                            ('k_best', SelectKBest(k=3))])

In [130]:
preprocessing = ColumnTransformer([('pca_trans', pca_transform, cat_feats), 
                                        ('kbest_trans', kbest_transform, num_feats)])
from sklearn import set_config
set_config(display='diagram')
preprocessing

In [132]:
# preprocessing.fit_transform(df_train, y_train)

array([[-0.30931678, -0.00770346, -0.52622962, -0.96642147,  1.74097226,
         0.13132612],
       [ 1.29265545,  0.17533549,  0.81370614, -0.90351396, -1.47938543,
         1.32756056],
       [-0.29931854, -0.00724808, -0.49836561, -0.95254341,  0.0122119 ,
         0.13132612],
       ...,
       [ 1.28162507,  0.07137987, -0.55498671,  1.03152614, -1.31177471,
         1.32756056],
       [-1.1105494 , -0.26668854, -0.54939625, -0.59377629, -0.89051728,
         0.72944334],
       [ 1.28864998,  0.17628487,  0.88344745,  1.54864725, -0.60272718,
         1.32756056]])

### Fitting models

In [226]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()

### Building a Pipeline

In [227]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [4]:
# model.score(df_test,y_test)

----------------------------
## Task II

In [208]:
from sklearn.model_selection import GridSearchCV

In [216]:
params = [
# 
]

In [219]:
# print('Final score is: ', tuned_model.score(df_test, y_test))

Final score is:  0.6241741712069144
