# Fast Moving Consumer Goods Sales Forecast - Part IV


Today:
- Use advanced techniques for model validation (**cross-validation**)

You will apply your knowledge with data about [weekly retail sales at Walmart stores](https://www.kaggle.com/datasets/rutuspatel/walmart-dataset-retail). The example Walmart Retail dataset is at the file path **`Walmart_Store_sales.csv`**.

You will use different explanatory variables to forecast FMCG weekly sales.  

## Setting Up the Workspace

In [1]:
# Install packages
#!pip install pandas
#!pip install xgboost

In [2]:
# Import packages
import pandas as pd
from datetime import datetime

#import matplotlib.pyplot as plt
#import numpy as np 
#import seaborn as sns 

# Import required sklearn modules --------
# Split X and y into training and testing sets
#from sklearn.model_selection import train_test_split

# Perform cross validation
from sklearn.model_selection import cross_val_score  

# Import the preprocessing class
#from sklearn.preprocessing import StandardScaler
#from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# Impute values
#from sklearn.impute import SimpleImputer 

# Create pipelines
from sklearn.pipeline import Pipeline 
#from sklearn.pipeline import make_pipeline

# Transform columns
from sklearn.compose import ColumnTransformer 

# Import the model class
from sklearn.ensemble import RandomForestRegressor 
#from sklearn.ensemble import RandomForestClassifier 

# Import the metrics class
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import mean_absolute_error 
from sklearn.metrics import mean_absolute_percentage_error 
from sklearn.metrics import mean_squared_error 

# Configures sklearn to display pipeline diagrams
from sklearn import set_config
set_config(display="diagram")

# Import required xgboost modules --------
#from xgboost import XGBRegressor



## Loading the Data

In [3]:
#import pandas as pd # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.model_selection import train_test_split # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Load data
walmart_file_path = 'https://www.dropbox.com/s/ns7envvzoqyypui/Walmart_Store_sales.csv?dl=1'
#data = pd.read_csv(walmart_file_path, dtype={'Store' : 'category'}) 
# read the data and store data in DataFrame titled walmart_data
# Parse date column from day-month-year into Pandas 
walmart_data = pd.read_csv(walmart_file_path,parse_dates=['Date'], date_parser=lambda x: datetime.strptime(x, '%d-%m-%Y').date()) 
walmart_data = walmart_data.sort_values(['Date','Store'])
walmart_data.Store = walmart_data.Store.astype('category')

# Select target and predictors
y = walmart_data.Weekly_Sales
#walmart_features = ['Fuel_Price', 'Unemployment', 'CPI', 'Temperature', 'Holiday_Flag']
#X = data[walmart_features]
X = walmart_data.drop(['Weekly_Sales'], axis=1)

# Split data into training and validation subsets, for both features and target
# The split is based on a random number generator. 
# Supplying a numeric value to the random_state argument guarantees we get the same split every time we run this script.
#X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
#                                                                random_state=123)

# Select categorical columns
#categorical_cols = [cname for cname in X_train_full.select_dtypes('category')]
categorical_cols = [cname for cname in X.select_dtypes('category')]

# Select numerical columns
#numerical_cols = [cname for cname in X_train_full.select_dtypes(['int64', 'float64'])]
numerical_cols = [cname for cname in X.select_dtypes(['int64', 'float64'])]

# Keep selected columns only
#my_cols = categorical_cols + numerical_cols
#X_train = X_train_full[my_cols].copy()
#X_valid = X_valid_full[my_cols].copy()

In [4]:
categorical_cols

['Store']

In [5]:
numerical_cols

['Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']

## RECAP: Fast Moving Consumer Goods Sales Forecast - Part III

## 1. Data Handling: Categorical Variables

### Approach 1: Drop Categorical Variables
The easiest approach to dealing with categorical variables is to simply remove them from the dataset.  This approach will only work well if the columns did not contain useful information.

### Approach 2: Ordinal Encoding
This approach assumes an indisputable ordering of the categories.  

### Approach 3: One-Hot Encoding
This approach creates new columns indicating the presence (or absence) of each possible category in the original categorical variable. 

### Take-away: Categorical Data

*The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!*

## 2. Pipelines

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized.  Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
1. **Cleaner Code:** Accounting for data at each step of preprocessing can get messy.  With a pipeline, you won't need to manually keep track of your training and validation data at each step.
2. **Fewer Bugs:** There are fewer opportunities to misapply a step or forget a preprocessing step.
3. **Easier to Productionize:** It can be surprisingly hard to transition a model from a prototype to something deployable at scale.  We won't go into the many related concerns here, but pipelines can help.
4. **More Options for Model Validation:** You will see an example in the next section, which covers cross-validation.


We construct the full pipeline in three steps.

### Step 1: Define Preprocessing Steps

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the `ColumnTransformer` class to bundle together different preprocessing steps.  The code below:
- imputes missing values and applies scaling to **_numerical_** data  
- imputes missing values and applies a one-hot encoding to **_categorical_** data.

### Step 2: Define the Model

Next, we define a random forest model with the familiar [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) class.

### Step 3: Create and Evaluate the Pipeline

Finally, we use the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class to define a pipeline that bundles the preprocessing and modeling steps.  There are a few important things to notice:
- With the pipeline, we preprocess the training data and fit the model in a single line of code.  (_In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps.  This becomes especially messy if we have to deal with both numerical and categorical variables!_)
- With the pipeline, we supply the unprocessed features in `X_valid` to the `predict()` command, and the pipeline automatically preprocesses the features before generating predictions.  (_However, without a pipeline, we have to remember to preprocess the validation data before making predictions._)

## Take-away: Pipelines

*Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing*

---
# Fast Moving Consumer Goods Sales Forecast - Part IV

In this section, you will learn how to use **cross-validation** for better measures of model performance. Using a pipeline will make the code remarkably straightforward.

# 3. Cross Validation

Machine learning is an iterative process. 

You will face choices about what predictive variables to use, what types of models to use, what arguments to supply to those models, etc. So far, you have made these choices in a data-driven way by measuring model quality with a validation (or holdout) set.  

But there are some drawbacks to this approach.  To see this, imagine you have a dataset with 5000 rows.  You will typically keep about 20% of the data as a validation dataset, or 1000 rows.  But this leaves some random chance in determining model scores.  That is, a model might do well on one set of 1000 rows, even if it would be inaccurate on a different 1000 rows.  

At an extreme, you could imagine having only 1 row of data in the validation set. If you compare alternative models, which one makes the best predictions on a single data point will be mostly a matter of luck!

In general, the larger the validation set, the less randomness (aka "noise") there is in our measure of model quality, and the more reliable it will be.  Unfortunately, we can only get a large validation set by removing rows from our training data, and smaller training datasets mean worse models!

## What is cross-validation?

In **cross-validation**, we run our modeling process on different subsets of the data to get multiple measures of model quality. 

For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset.  In this case, we say that we have broken the data into 5 "**folds**".  

![tut5_crossval](https://i.imgur.com/9k60cVA.png)

Then, we run one experiment for each fold:
- In **Experiment 1**, we use the first fold as a validation (or holdout) set and everything else as training data. This gives us a measure of model quality based on a 20% holdout set.  
- In **Experiment 2**, we hold out data from the second fold (and use everything except the second fold for training the model). The holdout set is then used to get a second estimate of model quality.
- We repeat this process, using every fold once as the holdout set.  Putting this together, 100% of the data is used as holdout at some point, and we end up with a measure of model quality that is based on all of the rows in the dataset (even if we don't use all rows simultaneously).

## When should you use cross-validation?

Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions.  However, it can take longer to run, because it estimates multiple models (one for each fold).  

So, given these tradeoffs, when should you use each approach?
- _For small datasets_, where extra computational burden isn't a big deal, you should run cross-validation.
- _For larger datasets_, a single validation set is sufficient.  Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout.

There's no simple threshold for what constitutes a large vs. small dataset.  But if your model takes a couple minutes or less to run, it's probably worth switching to cross-validation.  

Alternatively, you can run cross-validation and see if the scores for each experiment seem close.  If each experiment yields the same results, a single validation set is probably sufficient.

## Example

As last class on Monday and last week, we will continue working with data about [weekly retail sales at Walmart stores](https://www.kaggle.com/datasets/rutuspatel/walmart-dataset-retail). 

We load the columns of feature data in the matrix `X` and the target variable to predict in the series `y` as above at the top of the notebook, and define preprocessing steps.

In [6]:
#from sklearn.pipeline import Pipeline # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.impute import SimpleImputer # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.preprocessing import StandardScaler # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.preprocessing import OneHotEncoder # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.compose import ColumnTransformer # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Preprocessing for numerical data
#numerical_transformer = Pipeline(steps=[
#    ('imputer', SimpleImputer(strategy='constant')),
#    ('scaler', StandardScaler())
#])

# Preprocessing for categorical data
# Raise an error if validation data contains classes that aren't represented in the training data
categorical_transformer = Pipeline(steps=[
#    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='error', sparse=False))
])
    # The Pipeline() function is like a railway track with a list of different stations (steps)
    # Each step is a tuple declaring the name of the step and then the function to apply

categorical_transformer

In [7]:
# Bundle preprocessing for numerical and categorical data
FMCG_preprocessor = ColumnTransformer(
    transformers=[
#        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ], remainder='passthrough') 
    # The ColumnTransformer() function is like a railway switch: it tells what to do with the specified trainwagons (data columns).
    # The transformers list gives the different branches where columns  can go.
    # Each transformer is a tuple declaring the name of the transformer, the transformer to apply (eg. Pipeline defined above), and which columns need to be transformed
    # By default the ColumnTransformer() drops every column which is not explicitly specified in the list of transformers. 
    # With the parameter remainder='passthrough', the columns that you do not mention will not be dropped (and also will not transformed).

FMCG_preprocessor

Then, we define a pipeline that uses our preprocessor and a random forest model with the familiar [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) class to make predictions. 

While it's _possible_ to do cross-validation without pipelines, it is quite difficult!  Using a pipeline will make the code remarkably straightforward.

In [8]:
#from sklearn.ensemble import RandomForestRegressor  # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.pipeline import Pipeline  # --> DONE UPFRONT (see top section "Setting Up the Workspace")
#from sklearn.impute import SimpleImputer  # --> DONE UPFRONT (see top section "Setting Up the Workspace")

FMCG_model = RandomForestRegressor(n_estimators=100, random_state=0)

FMCG_pipeline = Pipeline(steps=[('preprocessor', FMCG_preprocessor),
                              ('model', FMCG_model)
                             ])
                            # Here the Pipeline() function is again like a railway track, with a higer level list of different stations (steps)
                            # Each step is a tuple declaring the name of the step and then the function to apply
FMCG_pipeline

We obtain the cross-validation scores with the [`cross_val_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function from scikit-learn.  We set the number of folds with the `cv` parameter.

In [9]:
#from sklearn.model_selection import cross_val_score  # --> DONE UPFRONT (see top section "Setting Up the Workspace")

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(FMCG_pipeline, X.drop('Date',axis=1), y,
                              cv=5, verbose = 10,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] START .....................................................................
[CV] END .......................... score: (test=-190751.327) total time=   3.1s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.1s remaining:    0.0s


[CV] END .......................... score: (test=-147265.532) total time=   2.9s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.2s remaining:    0.0s


[CV] END ........................... score: (test=-59617.094) total time=   3.2s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    9.5s remaining:    0.0s


[CV] END .......................... score: (test=-117766.673) total time=   3.1s
[CV] START .....................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   12.7s remaining:    0.0s


[CV] END ........................... score: (test=-90743.751) total time=   3.0s
MAE scores:
 [190751.32742448 147265.53192168  59617.09441041 117766.67272898
  90743.75075245]


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   15.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   15.8s finished


The `scoring` parameter chooses a measure of model quality to report: in this case, we chose negative mean absolute error (MAE).  The docs for scikit-learn show a [list of options](http://scikit-learn.org/stable/modules/model_evaluation.html).  

It is a little surprising that we specify *negative* MAE. Scikit-learn cross_val_score() has a convention where all metrics are defined so a high number is better.  Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere. 

We typically want a single measure of model quality to compare alternative models.  So we take the average across experiments.

In [10]:
print("Average MAE score (across experiments):")
print(scores.mean())

Average MAE score (across experiments):
121228.87544759906


## Take-away: Cross Validation

*Using cross-validation yields a much better measure of model quality, with the added benefit of cleaning up our code: note that we no longer need to keep track of separate training and validation sets.  So, especially for small datasets, it's a good improvement!*
