## Machine Learning Pipeline - Feature Engineering with in-house
In the following notebooks, we will go through the implementation of each steps in the Machine Learning Pipeline: 

1. Data Analysis
2. **Feature Engineering**
3. Feature Selection
4. Model Training
5. Obtaining Predictions/Scoring

In this notebook, we will setup all the feature engineering steps within a Scikit-learn pipeline utilizing the open source transformers + those developed in-house

### Rossman Store Sales Prediction
The aim of the project is to build an end-to-end machine learning model to predict the sales of a given store and a set of inputs, including the promotions, competition, school and state holidays, seasonality, and locality.

In [1]:
# data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# for saving the pipeline
import joblib

# from Scikit-learn
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Binarizer
from sklearn.impute import SimpleImputer

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)

from feature_engine.encoding import (
    RareLabelEncoder,
    OrdinalEncoder,
)

from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer,
)

from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper

import preprocessors as pp

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load dataset
df_sales = pd.read_csv('train.csv')
df_store = pd.read_csv('store.csv')

# rows and columns of the data
print(df_sales.shape)
print(df_store.shape)

(914629, 9)
(1115, 10)


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
# Merge
df_raw = pd.merge( df_sales, df_store, how = 'left', on = 'Store' )

print(df_raw.info())
df_raw.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 914629 entries, 0 to 914628
Data columns (total 18 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Store                      914629 non-null  int64  
 1   DayOfWeek                  914629 non-null  int64  
 2   Date                       914629 non-null  object 
 3   Sales                      914629 non-null  int64  
 4   Customers                  914629 non-null  int64  
 5   Open                       914629 non-null  int64  
 6   Promo                      914629 non-null  int64  
 7   StateHoliday               914629 non-null  object 
 8   SchoolHoliday              914629 non-null  int64  
 9   StoreType                  914629 non-null  object 
 10  Assortment                 914629 non-null  object 
 11  CompetitionDistance        912263 non-null  float64
 12  CompetitionOpenSinceMonth  623849 non-null  float64
 13  CompetitionOpenSinceYear   62

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,4,2015-04-30,6228,650,1,1,0,0,c,a,1270.0,9.0,2008.0,0,,,
1,2,4,2015-04-30,6884,716,1,1,0,0,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,4,2015-04-30,9971,979,1,1,0,0,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,4,2015-04-30,16106,1854,1,1,0,0,c,c,620.0,9.0,2009.0,0,,,
4,5,4,2015-04-30,6598,729,1,1,0,0,a,a,29910.0,4.0,2015.0,0,,,


In [4]:
# copy dataset
df1 = df_raw.copy()

# drop all rows with zero Sales amount
df1 = df1[df1['Sales'] > 0]

print(df1.shape)
df1.head()

(759848, 18)


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,4,2015-04-30,6228,650,1,1,0,0,c,a,1270.0,9.0,2008.0,0,,,
1,2,4,2015-04-30,6884,716,1,1,0,0,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,4,2015-04-30,9971,979,1,1,0,0,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,4,2015-04-30,16106,1854,1,1,0,0,c,c,620.0,9.0,2009.0,0,,,
4,5,4,2015-04-30,6598,729,1,1,0,0,a,a,29910.0,4.0,2015.0,0,,,


### Separate dataset into train and test
When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

Our feature engineering techniques will learn:

* mean
* mode
* exponents from the yeo-johnson
* category frequency
* and category to number mappings

from the train set.

In [5]:
# separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(
    df1.drop(['Store', 'Sales'], axis=1), # predictive variables
    df1['Sales'], # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

X_train.shape, X_test.shape

((607878, 16), (151970, 16))

In [6]:
# for target, we apply the logarithm (log1p)
y_train = np.log1p(y_train) # = np.log(y_train + 1)
y_test = np.log1p(y_test)

In [7]:
#  transform datatype of the variable date to datetime
X_train['Date'] = pd.to_datetime(X_train['Date'])
X_test['Date'] = pd.to_datetime(X_test['Date'])

In [8]:
# year
X_train['year'] = X_train['Date'].dt.year
X_test['year'] = X_test['Date'].dt.year

### Config

In [9]:

# categorical variables with NA in train set
CATEGORICAL_VARS_WITH_NA_MISSING = ['PromoInterval']

# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['Promo2SinceWeek', 'Promo2SinceYear',
                          'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear',
                          'CompetitionDistance']
NUMERICAL_VARS_IMPUTE_MEAN = ['CompetitionDistance']
NUMERICAL_VARS_IMPUTE_MODE = ['Promo2SinceWeek', 'Promo2SinceYear',
                          'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear']


TEMPORAL_VARS = ['CompetitionOpenSinceYear', 'Promo2SinceYear']
REF_VAR = "year"   
DROP_VARS = ['Date','Promo2SinceWeek','year', 'Promo2SinceWeek_na']   

# variables to log transform
NUMERICALS_LOG_VARS = ["Customers"]

NUMERICALS_YEO_VARS = ['CompetitionDistance']

# variables to map
ASSORT_VARS = ['Assortment']

# categorical variables to encode
CATEGORICAL_VARS = ['StateHoliday', 'StoreType', 'PromoInterval']

ASSORT_MAPPINGS = {'a':1, 'b':2, 'c': 3}

### Pipeline - Feature engineering

In [10]:
# set up pipeline for missing indicator
missing_ind_pipe = Pipeline([
    # add missing indicator
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),
])

In [11]:
# train the pipeline
missing_ind_pipe.fit(X_train, y_train)

Pipeline(steps=[('missing_indicator',
                 AddMissingIndicator(variables=['Promo2SinceWeek',
                                                'Promo2SinceYear',
                                                'CompetitionOpenSinceMonth',
                                                'CompetitionOpenSinceYear',
                                                'CompetitionDistance']))])

In [12]:
X_train = missing_ind_pipe.transform(X_train)
X_test = missing_ind_pipe.transform(X_test)

In [13]:
# set up pipeline for SimpleImputer
simple_imp_pipe = Pipeline([
    ('mode_imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
])

# learn the parameters from the train set
simple_imp_pipe = simple_imp_pipe.fit(X_train[NUMERICAL_VARS_IMPUTE_MODE])

X_train[NUMERICAL_VARS_IMPUTE_MODE] = simple_imp_pipe.transform(X_train[NUMERICAL_VARS_IMPUTE_MODE])
X_test[NUMERICAL_VARS_IMPUTE_MODE] = simple_imp_pipe.transform(X_test[NUMERICAL_VARS_IMPUTE_MODE])

In [15]:
# set up the pipeline
sales_pipe = Pipeline([
    # ===== IMPUTATION =====
    # impute categorical variables with string missing
    ('missing_imputation', CategoricalImputer(
        imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),

    # impute numerical variables with the mean
    ('mean_imputation', MeanMedianImputer(
        imputation_method='mean', variables=NUMERICAL_VARS_IMPUTE_MEAN)),


    # == TEMPORAL VARIABLES ====
    ('elapsed_time', pp.TemporalVariableTransformer(
        variables=TEMPORAL_VARS, reference_variable=REF_VAR)),

    ('drop_features', DropFeatures(features_to_drop=DROP_VARS)),


    # ==== VARIABLE TRANSFORMATION =====
    ('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),
    ('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),


    # === mappers ===
    ('mapper_assort', pp.Mapper(
        variables=ASSORT_VARS, mappings=ASSORT_MAPPINGS)),

    # == CATEGORICAL ENCODING
    ('rare_label_encoder', RareLabelEncoder(
        tol=0.01, n_categories=1, variables=CATEGORICAL_VARS)),

    # encode categorical and discrete variables using the target mean
    ('categorical_encoder', OrdinalEncoder(
        encoding_method='ordered', variables=CATEGORICAL_VARS)),
])

In [16]:
# train the pipeline
sales_pipe.fit(X_train, y_train)

Pipeline(steps=[('missing_imputation',
                 CategoricalImputer(variables=['PromoInterval'])),
                ('mean_imputation',
                 MeanMedianImputer(imputation_method='mean',
                                   variables=['CompetitionDistance'])),
                ('elapsed_time',
                 TemporalVariableTransformer(reference_variable='year',
                                             variables=['CompetitionOpenSinceYear',
                                                        'Promo2SinceYear'])),
                ('drop_features',
                 DropFeatures(featur...
                 YeoJohnsonTransformer(variables=['CompetitionDistance'])),
                ('mapper_assort',
                 Mapper(mappings={'a': 1, 'b': 2, 'c': 3},
                        variables=['Assortment'])),
                ('rare_label_encoder',
                 RareLabelEncoder(n_categories=1, tol=0.01,
                                  variables=['StateHoliday', 'St

In [17]:
X_train = sales_pipe.transform(X_train)
X_test = sales_pipe.transform(X_test)

In [19]:
# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

[]

In [20]:
# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]

[]

In [21]:
X_train.head()

Unnamed: 0,DayOfWeek,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceYear,PromoInterval,Promo2SinceYear_na,CompetitionOpenSinceMonth_na,CompetitionOpenSinceYear_na,CompetitionDistance_na
909242,6,6.09131,1,0,1,0,1,1,9.563479,9.0,0.0,1,2.0,1,0,1,1,0
865171,4,6.171701,1,0,1,0,1,3,7.70605,9.0,0.0,1,2.0,2,0,1,1,0
826279,4,6.54103,1,1,1,0,1,1,10.104742,5.0,6.0,1,0.0,2,0,0,0,0
524907,1,7.388328,1,1,1,0,0,3,16.814807,9.0,0.0,1,4.0,2,0,1,1,0
589075,6,6.082219,1,0,1,0,1,1,11.588098,9.0,0.0,0,2.0,3,1,1,1,0
