# Fast Moving Consumer Goods Sales Forecast - Part III

# Intermediate Machine Learning

Welcome to SCM.256 Week 2: **Intermediate Machine Learning**!

Now that you had an introduction to machine learning last week, you will learn how to quickly improve the quality of your models!  

This week you will accelerate your machine learning expertise by learning how to:
- Tackle data types often found in real-world datasets (**categorical variables**)
- Design **pipelines** to improve the quality of your machine learning code
- Use advanced techniques for model validation (**cross-validation**)
- Build state-of-the-art emsemble models with gradient boosted trees (**XGBoost**)
- Avoid common and important data science mistakes (**leakage**).

You will apply your knowledge with data about [weekly retail sales at Walmart stores](https://www.kaggle.com/datasets/rutuspatel/walmart-dataset-retail). The example Walmart Retail dataset is at the file path **`Walmart_Store_sales.csv`**.

You will use different explanatory variables to forecast FMCG weekly sales.  

## Setting Up the Workspace

In [1]:
# Install packages
#!pip install pandas


In [2]:
# Import packages
import pandas as pd
from datetime import datetime

#import matplotlib.pyplot as plt
#import numpy as np 
#import seaborn as sns 

# Import required sklearn modules --------
# Split X and y into training and testing sets
from sklearn.model_selection import train_test_split

# Import the preprocessing class
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
#from sklearn.preprocessing import StandardScaler

# Impute values
from sklearn.impute import SimpleImputer 

# Create pipelines
from sklearn.pipeline import Pipeline 

# Transform columns
from sklearn.compose import ColumnTransformer 

# Import the model class
from sklearn.ensemble import RandomForestRegressor 

# Import the metrics class
from sklearn.metrics import mean_absolute_error 
from sklearn.metrics import mean_absolute_percentage_error 
from sklearn.metrics import mean_squared_error 

# Configures sklearn to display pipeline diagrams
from sklearn import set_config
set_config(display="diagram")




## Loading the Data

In [3]:
#import pandas as pd # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.model_selection import train_test_split # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Load data
walmart_file_path = 'https://www.dropbox.com/s/ns7envvzoqyypui/Walmart_Store_sales.csv?dl=1'
#data = pd.read_csv(walmart_file_path, dtype={'Store' : 'int'}) 
# read the data and store data in DataFrame titled walmart_data
# Parse date column from day-month-year into Pandas 
walmart_data = pd.read_csv(walmart_file_path,parse_dates=['Date'], date_parser=lambda x: datetime.strptime(x, '%d-%m-%Y').date()) 
walmart_data = walmart_data.sort_values(['Date','Store'])
walmart_data.Store = walmart_data.Store.astype('category')

# Select target and predictors
y = walmart_data.Weekly_Sales
#walmart_features = ['Fuel_Price', 'Unemployment', 'CPI', 'Temperature', 'Holiday_Flag']
#X = data[walmart_features]
X = walmart_data.drop(['Weekly_Sales'], axis=1)

# Split data into training and validation subsets, for both features and target
# The split is based on a random number generator. 
# Supplying a numeric value to the random_state argument guarantees we get the same split every time we run this script.
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, shuffle=False,
                                                                random_state=123)

### Inspect the Features in the Training Data Subset

In [4]:
X_train_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5148 entries, 0 to 2545
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Store         5148 non-null   category      
 1   Date          5148 non-null   datetime64[ns]
 2   Holiday_Flag  5148 non-null   int64         
 3   Temperature   5148 non-null   float64       
 4   Fuel_Price    5148 non-null   float64       
 5   CPI           5148 non-null   float64       
 6   Unemployment  5148 non-null   float64       
dtypes: category(1), datetime64[ns](1), float64(4), int64(1)
memory usage: 288.0 KB


In [5]:
X_train_full.head()

Unnamed: 0,Store,Date,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-02-05,0,42.31,2.572,211.096358,8.106
143,2,2010-02-05,0,40.19,2.572,210.752605,8.324
286,3,2010-02-05,0,45.71,2.572,214.424881,7.368
429,4,2010-02-05,0,43.76,2.598,126.442065,8.623
572,5,2010-02-05,0,39.7,2.572,211.653972,6.566


In [6]:
X_train_full.tail()

Unnamed: 0,Store,Date,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1973,14,2012-04-13,0,51.83,4.044,190.759596,8.567
2116,15,2012-04-13,0,43.52,4.187,137.868,8.15
2259,16,2012-04-13,0,45.83,3.901,197.780931,6.169
2402,17,2012-04-13,0,46.94,3.833,131.108,6.235
2545,18,2012-04-13,0,47.75,4.025,137.868,8.304


In [7]:
X_train_full.describe(datetime_is_numeric=True).T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
Date,5148.0,2011-03-08 21:46:34.405594624,2010-02-05 00:00:00,2010-08-20 00:00:00,2011-03-11 00:00:00,2011-09-23 00:00:00,2012-04-13 00:00:00,
Holiday_Flag,5148.0,0.078671,0.0,0.0,0.0,0.0,1.0,0.269251
Temperature,5148.0,57.931014,-2.06,44.42,58.7,72.02,100.14,18.788346
Fuel_Price,5148.0,3.260448,2.472,2.837,3.236,3.644,4.294,0.446567
CPI,5148.0,170.536597,126.064,131.686,182.551954,211.406287,225.256244,38.923927
Unemployment,5148.0,8.179414,4.125,7.1805,8.021,8.625,14.313,1.879173


### Inspect the Features in the Validation Data Subset

In [8]:
X_valid_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1287 entries, 2688 to 6434
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Store         1287 non-null   category      
 1   Date          1287 non-null   datetime64[ns]
 2   Holiday_Flag  1287 non-null   int64         
 3   Temperature   1287 non-null   float64       
 4   Fuel_Price    1287 non-null   float64       
 5   CPI           1287 non-null   float64       
 6   Unemployment  1287 non-null   float64       
dtypes: category(1), datetime64[ns](1), float64(4), int64(1)
memory usage: 73.0 KB


In [9]:
X_valid_full.head()

Unnamed: 0,Store,Date,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
2688,19,2012-04-13,0,44.42,4.187,137.868,8.15
2831,20,2012-04-13,0,45.68,4.044,214.312703,7.139
2974,21,2012-04-13,0,69.03,3.891,221.148403,6.891
3117,22,2012-04-13,0,49.89,4.025,141.843393,7.671
3260,23,2012-04-13,0,41.81,4.025,137.868,4.125


In [10]:
X_valid_full.tail()

Unnamed: 0,Store,Date,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
5862,41,2012-10-26,0,41.8,3.686,199.219532,6.195
6005,42,2012-10-26,0,70.5,4.301,131.193097,6.943
6148,43,2012-10-26,0,69.17,3.506,214.741539,8.839
6291,44,2012-10-26,0,46.97,3.755,131.193097,5.217
6434,45,2012-10-26,0,58.85,3.882,192.308899,8.667


In [11]:
X_valid_full.describe(datetime_is_numeric=True).T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
Date,1287.0,2012-07-21 08:53:42.377622272,2012-04-13 00:00:00,2012-06-01 00:00:00,2012-07-20 00:00:00,2012-09-07 00:00:00,2012-10-26 00:00:00,
Holiday_Flag,1287.0,0.034965,0.0,0.0,0.0,0.0,1.0,0.183763
Temperature,1287.0,71.594856,36.9,64.255,72.9,80.53,100.07,11.824894
Fuel_Price,1287.0,3.751242,3.187,3.5795,3.73,3.921,4.468,0.251284
CPI,1287.0,175.745581,130.683,137.923067,191.00281,221.255812,227.232807,40.792466
Unemployment,1287.0,7.278099,3.879,6.17,7.139,8.253,11.627,1.679838


## RECAP: Fast Moving Consumer Goods Sales Forecast - Part II

---
# Fast Moving Consumer Goods Sales Forecast - Part III

In this section you will learn what a **categorical variable** is, along with three approaches for handling this type of data.

# 1. Data Handling: Categorical Variables

A **categorical variable** takes only a limited number of values.  

- Consider a websurvey that asks an online consumer about satisfaction with a product ranging from 
    * Very satisfied
    * Somewhat satisfied
    * Neither satisfied nor dissatisfied
    * Somewhat dissatisfied
    * Very dissatisfied.  
    
In this case, the data is categorical, because responses fall into a fixed set of categories.

- In our FMCG data, the store identifier (ID) is given by numbers ranging from 1 to 45.  In this case, the data is also categorical, when there is no numerical meaning to the ordering of the numbers.

You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first.  In this section, we'll compare three approaches that you can use to prepare your categorical data.

## Three Approaches

### 1) Drop Categorical Variables

The easiest approach to dealing with categorical variables is to simply remove them from the dataset.  This approach will only work well if the columns did not contain useful information.

### 2) Ordinal Encoding

**Ordinal encoding** assigns each unique value to a different integer.

![ordinalencode](https://i.imgur.com/PlXjEbC.png)

This approach assumes an ordering of the categories: "Small" (0) < "Medium" (1) < "Large" (2) < "Hyper" (3).

This assumption makes sense in this example, because there is an indisputable ranking to the categories.  Not all categorical variables have a clear ordering in the values, but we refer to those that do as **ordinal variables**.  For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables. 

### 3) One-Hot Encoding

**One-hot encoding** creates new columns indicating the presence (or absence) of each possible value in the original data.  To understand this, we'll work through an example.

![onehot](https://i.imgur.com/tzM3nmt.png)

The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset.  

In contrast to ordinal encoding, one-hot encoding *does not* assume an ordering of the categories.  Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data.  We refer to categorical variables without an intrinsic ranking as **nominal variables**.

One-hot encoding generally does not perform well if the categorical variable takes on a very large number of values. 



## Example

As last week, we will work with data about [weekly retail sales at Walmart stores](https://www.kaggle.com/datasets/rutuspatel/walmart-dataset-retail). 

We won't focus on the data loading step, since we discussed this last week. Instead, you can imagine you are at a point where you already have the training and validation data in `X_train_full`, `X_valid_full`, `y_train`, and `y_valid`, as loaded above at the top of the notebook.

In [12]:
# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# Select categorical columns
categorical_cols = [cname for cname in X_train_full.select_dtypes('category')]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.select_dtypes(['int64', 'float64'])]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()


In [13]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5148 entries, 0 to 2545
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   Store         5148 non-null   category
 1   Holiday_Flag  5148 non-null   int64   
 2   Temperature   5148 non-null   float64 
 3   Fuel_Price    5148 non-null   float64 
 4   CPI           5148 non-null   float64 
 5   Unemployment  5148 non-null   float64 
dtypes: category(1), float64(4), int64(1)
memory usage: 247.7 KB


In [14]:
X_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1287 entries, 2688 to 6434
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   Store         1287 non-null   category
 1   Holiday_Flag  1287 non-null   int64   
 2   Temperature   1287 non-null   float64 
 3   Fuel_Price    1287 non-null   float64 
 4   CPI           1287 non-null   float64 
 5   Unemployment  1287 non-null   float64 
dtypes: category(1), float64(4), int64(1)
memory usage: 63.0 KB


We take a peek at the training data with the `head()` method below. 

In [15]:
X_train.head()

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,0,42.31,2.572,211.096358,8.106
143,2,0,40.19,2.572,210.752605,8.324
286,3,0,45.71,2.572,214.424881,7.368
429,4,0,43.76,2.598,126.442065,8.623
572,5,0,39.7,2.572,211.653972,6.566


### Define Function to Measure Quality of Each Approach

We define a function `score_dataset()` to compare the three different approaches to dealing with categorical variables. This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.  In general, we want the MAE to be as low as possible!

In [16]:
#from sklearn.ensemble import RandomForestRegressor # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.metrics import mean_absolute_error # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Score from Approach 1: Drop Categorical Variables

We drop the `object` columns with the [`select_dtypes()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) method. 

In [17]:
drop_X_train = X_train.drop(categorical_cols, axis=1)
drop_X_valid = X_valid.drop(categorical_cols, axis=1)

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
440064.5559917593


### Score from Approach 2: Ordinal Encoding

Scikit-learn has a [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) class that can be used to get ordinal encodings.  We loop over the categorical variables and apply the ordinal encoder separately to each column.

In [18]:
#from sklearn.preprocessing import OrdinalEncoder # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Make copy to avoid changing original data 
ordinal_X_train = X_train.copy()
ordinal_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder(categories=[list(walmart_data.Store.unique())])
ordinal_X_train[categorical_cols] = ordinal_encoder.fit_transform(X_train[categorical_cols])
ordinal_X_valid[categorical_cols] = ordinal_encoder.transform(X_valid[categorical_cols])

print("MAE from Approach 2 (Ordinal Encoding):") 
print(score_dataset(ordinal_X_train, ordinal_X_valid, y_train, y_valid))

MAE from Approach 2 (Ordinal Encoding):
102223.73129355072


In [19]:
ordinal_X_train.head()

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,0.0,0,42.31,2.572,211.096358,8.106
143,1.0,0,40.19,2.572,210.752605,8.324
286,2.0,0,45.71,2.572,214.424881,7.368
429,3.0,0,43.76,2.598,126.442065,8.623
572,4.0,0,39.7,2.572,211.653972,6.566


In the code cell above, for each column, we randomly assign each unique value to a different integer.  This is a common approach that is simpler than providing custom labels; however, we can expect an additional boost in performance if we provide better-informed labels for all ordinal variables.

### Score from Approach 3: One-Hot Encoding

We use the [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) class from scikit-learn to get one-hot encodings.  There are a number of parameters that can be used to customize its behavior.  
- We set `handle_unknown='ignore'` to avoid errors when the validation data contains classes that aren't represented in the training data, and
- setting `sparse=False` ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

To use the encoder, we supply only the categorical columns that we want to be one-hot encoded.  For instance, to encode the training data, we supply `X_train[object_cols]`. (`object_cols` in the code cell below is a list of the column names with categorical data, and so `X_train[object_cols]` contains all of the categorical data in the training set.)

In [20]:
#from sklearn.preprocessing import OneHotEncoder # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Apply one-hot encoder to each column with categorical data
#OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_encoder = OneHotEncoder(handle_unknown='error', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[categorical_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[categorical_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(categorical_cols, axis=1)
num_X_valid = X_valid.drop(categorical_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([OH_cols_train, num_X_train], axis=1)
OH_X_valid = pd.concat([OH_cols_valid, num_X_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train.values, OH_X_valid.values, y_train.values, y_valid.values))

MAE from Approach 3 (One-Hot Encoding):
90743.75075244748


### Which approach is best?

In this case, dropping the categorical columns (**Approach 1**) performed worst, since it had the highest MAE score.  As for the other two approaches, since the returned MAE scores are so close in value, there doesn't appear to be any meaningful benefit to one over the other.

In general, one-hot encoding (**Approach 3**) will typically perform best, and dropping the categorical columns (**Approach 1**) typically performs worst, but it varies on a case-by-case basis. 

## Take-away: Categorical Data

*The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!*

---
In this section, you will learn how to use **pipelines** to clean up your modeling code.

# 2. Pipelines

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized.  Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
1. **Cleaner Code:** Accounting for data at each step of preprocessing can get messy.  With a pipeline, you won't need to manually keep track of your training and validation data at each step.
2. **Fewer Bugs:** There are fewer opportunities to misapply a step or forget a preprocessing step.
3. **Easier to Productionize:** It can be surprisingly hard to transition a model from a prototype to something deployable at scale.  We won't go into the many related concerns here, but pipelines can help.
4. **More Options for Model Validation:** You will see an example in the next section, which covers cross-validation.

## Example

As in the previous week, we will work with the Walmart data.  

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in `X_train`, `X_valid`, `y_train`, and `y_valid`. 

In [21]:
walmart_data.dtypes

Store                 category
Date            datetime64[ns]
Weekly_Sales           float64
Holiday_Flag             int64
Temperature            float64
Fuel_Price             float64
CPI                    float64
Unemployment           float64
dtype: object

In [22]:
y_train.describe()

count    5.148000e+03
mean     1.049825e+06
std      5.721647e+05
min      2.099862e+05
25%      5.516958e+05
50%      9.572268e+05
75%      1.417949e+06
max      3.818686e+06
Name: Weekly_Sales, dtype: float64

We take a peek at the training data with the `head()` method below.  Notice that the data contains categorical data.  With a pipeline, it's easy to deal with this!

In [23]:
X_train.head()

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,0,42.31,2.572,211.096358,8.106
143,2,0,40.19,2.572,210.752605,8.324
286,3,0,45.71,2.572,214.424881,7.368
429,4,0,43.76,2.598,126.442065,8.623
572,5,0,39.7,2.572,211.653972,6.566


We construct the full pipeline in three steps.

### Step 1: Define Preprocessing Steps

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the `ColumnTransformer` class to bundle together different preprocessing steps.  The code below:
- imputes missing values and applies scaling to **_numerical_** data  
- imputes missing values and applies a one-hot encoding to **_categorical_** data.

In [24]:
#from sklearn.pipeline import Pipeline # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.impute import SimpleImputer # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.preprocessing import StandardScaler # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.preprocessing import OneHotEncoder # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.compose import ColumnTransformer # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Preprocessing for numerical data
#numerical_transformer = Pipeline(steps=[
#    ('imputer', SimpleImputer(strategy='constant')),
#    ('scaler', StandardScaler())
#])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
#    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='error', sparse=False))
])
    # The Pipeline() function is like a railway track with a list of different stations (steps)
    # Each step is a tuple declaring the name of the step and then the function to apply

categorical_transformer

In [25]:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
#        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ], remainder='passthrough') 
    # The ColumnTransformer() function is like a railway switch: it tells what to do with the specified trainwagons (data columns).
    # The transformers list gives the different branches where columns  can go.
    # Each transformer is a tuple declaring the name of the transformer, the transformer to apply (eg. Pipeline defined above), and which columns need to be transformed
    # By default the ColumnTransformer() drops every column which is not explicitly specified in the list of transformers. 
    # With the parameter remainder='passthrough', the columns that you do not mention will not be dropped (and also will not transformed).

preprocessor

### Step 2: Define the Model

Next, we define a random forest model with the familiar [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) class.

In [26]:
#from sklearn.ensemble import RandomForestRegressor # --> DONE UPFRONT (see top section "Setting Up the Workspace")

model = RandomForestRegressor(n_estimators=100, random_state=0)

### Step 3: Create and Evaluate the Pipeline

Finally, we use the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class to define a pipeline that bundles the preprocessing and modeling steps.  There are a few important things to notice:
- With the pipeline, we preprocess the training data and fit the model in a single line of code.  (_In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps.  This becomes especially messy if we have to deal with both numerical and categorical variables!_)
- With the pipeline, we supply the unprocessed features in `X_valid` to the `predict()` command, and the pipeline automatically preprocesses the features before generating predictions.  (_However, without a pipeline, we have to remember to preprocess the validation data before making predictions._)

In [27]:
X_train

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,0,42.31,2.572,211.096358,8.106
143,2,0,40.19,2.572,210.752605,8.324
286,3,0,45.71,2.572,214.424881,7.368
429,4,0,43.76,2.598,126.442065,8.623
572,5,0,39.70,2.572,211.653972,6.566
...,...,...,...,...,...,...
1973,14,0,51.83,4.044,190.759596,8.567
2116,15,0,43.52,4.187,137.868000,8.150
2259,16,0,45.83,3.901,197.780931,6.169
2402,17,0,46.94,3.833,131.108000,6.235


In [28]:
#from sklearn.metrics import mean_absolute_error # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.metrics import mean_absolute_percentage_error # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.metrics import mean_squared_error # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
                            # Here the Pipeline() function is again like a railway track, with a higer level list of different stations (steps)
                            # Each step is a tuple declaring the name of the step and then the function to apply
my_pipeline

In [29]:
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE: OneHot encoder', score)
print('MAPE: OneHot encoder', mean_absolute_percentage_error(y_valid, preds))
print('RMSE: OneHot encoder', mean_squared_error(y_valid, preds,squared=False))

MAE: OneHot encoder 90743.75075244748
MAPE: OneHot encoder 0.09192409352102003
RMSE: OneHot encoder 135901.4699788533


Example of pipeline with ordinal encoding... (eg. Approach 2 above: this can yield a useful, parsimonious model when the categories of a categorical variable have an indisputable ranked ordering.

In [30]:
#from sklearn.metrics import mean_absolute_error  # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.preprocessing import OrdinalEncoder  # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(categories=[list(walmart_data.Store.unique())]))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
 #       ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ],remainder='passthrough')

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
my_pipeline

In [31]:
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE: Ordinal Encoder', score)
print('MAPE: Ordinal Encoder', mean_absolute_percentage_error(y_valid, preds))
print('RMSE: Ordinal encoder', mean_squared_error(y_valid, preds,squared=False))

MAE: Ordinal Encoder 102223.73129355072
MAPE: Ordinal Encoder 0.1023973195668454
RMSE: Ordinal encoder 182382.46364013187


Example of pipeline with categorical variables dropped... (eg. Approach 1 above: this will generally only work well if the columns did not contain meaningful information.)

In [32]:
#from sklearn.metrics import mean_absolute_error  # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.metrics import mean_absolute_percentage_error  # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.preprocessing import OrdinalEncoder  # --> DONE UPFRONT (see top section "Setting Up the Workspace")

categorical_cols = []

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='error'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
   #     ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ], remainder='passthrough')

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
my_pipeline

In [33]:
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train.drop('Store',axis=1), y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid.drop('Store',axis=1))

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE: NO Categories', score)

print('MAPE: NO Categories', mean_absolute_percentage_error(y_valid, preds))
print('RMSE: NO Categories', mean_squared_error(y_valid, preds,squared=False))

MAE: NO Categories 440064.5559917593
MAPE: NO Categories 0.7195844501980123
RMSE: NO Categories 555118.620790233


## Take-away: Pipelines

*Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.*

In the next class, you will learn how to use **cross-validation** for better measures of model performance. 
While it's _possible_ to do cross-validation without pipelines, it is quite difficult!  Using a pipeline will make the code remarkably straightforward.