# Advanced models and Hyperparameter Tuning

We tested several different advanced models.

The best performing model was a Gradient Boosted Tree model
from the LightGBM library.

We use the root mean square percentage error as a metric (RMSPE)

![](../assets/rmspe.png)

| Model              | RMSPE |
|--------------------|-------|
| Naive model (mean) | 62.03 |
| Linear Regression  | 22.81 |
| Random Forest      | 17.07 |
| Light GBM          | 12.38 |

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.dpi']=150

# Load scripts from parent path
import sys, os
sys.path.insert(0, os.path.abspath('..'))

# Ignore some future warnings triggered when training
import warnings
warnings.filterwarnings(action="ignore", category=FutureWarning)

## Load Data

In [2]:
import scripts.processing as scr
train_raw = scr.load_train_data()
train = scr.add_week_month_info(train_raw)
train = scr.add_beginning_end_month(train)
train = scr.process_data(train)
train = scr.add_store_info(train)

train.head()

Unnamed: 0,Store,DayOfWeek,Sales,Promo,StateHoliday,SchoolHoliday,week,month,end_of_month,beginning_of_month,StoreType,Assortment,CompetitionDistance,Store_Sales_mean,Store_Customers_mean
0,353.0,2.0,3139.0,0.0,1.0,1.0,1,1,1e-06,0.877078,b,b,900.0,4139.474576,1153.783333
1,335.0,2.0,2401.0,0.0,1.0,1.0,1,1,1e-06,0.877078,b,a,90.0,12845.896552,2384.271186
2,512.0,2.0,2646.0,0.0,1.0,1.0,1,1,1e-06,0.877078,b,b,590.0,3725.649123,888.627119
3,494.0,2.0,3113.0,0.0,1.0,1.0,1,1,1e-06,0.877078,b,a,1260.0,7079.15,1010.583333
4,530.0,2.0,2907.0,0.0,1.0,1.0,1,1,1e-06,0.877078,a,c,18160.0,2260.783333,333.610169


## Prepare train/test data

In [3]:
X = train.copy(deep=True).drop(columns=["Sales"])
y = train.loc[:, "Sales"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Gradient Boosted Tree Model

We found that a Gradient Boosted Tree model had good performance on the dataset.

Here we show the final model we found.

### Feature Encoding

In [4]:
import category_encoders as ce
from sklearn.preprocessing import StandardScaler

target_encode  = ce.TargetEncoder(cols = ['Store','StoreType','Assortment'])
ordinal_encode = ce.OrdinalEncoder(cols=['StateHoliday'])
scaler = StandardScaler()

### Create model, train and evaluate

We create a pipeline of the model and encoders.

In [5]:
from lightgbm import LGBMRegressor

from sklearn.pipeline import Pipeline
from scripts.processing import metric

# Define the model
reg = LGBMRegressor(
        n_estimators=500,
        max_depth=25,
        num_leaves=80
      )

# Build the pipeline
pipe_lgbm = Pipeline(steps=[
                ('ordinal_encode', ordinal_encode), 
                ('target_encode', target_encode),
                ('model', reg)]
                )

pipe_lgbm.fit(X_train, y_train)

y_pred_train = pipe_lgbm.predict(X_train)
y_pred = pipe_lgbm.predict(X_test)
metric(y_test.values, y_pred)


11.478737148311913

## Hyperparameter Optimization

We also performed hyperparameter optimziation to fine-tune the model.

In order to select the best hyperparameters for our pipeline, we performed a grid search on a number of parameters. 
The hyperparameters corresponding the model that we explored were: n_estimators, max_depth, num_leaves and additionally the set of features to be target encoded.

**Example parameter combinations that were tested**

In [6]:
parameters = {
   'model__n_estimators': [100, 500, 900, 5000],
   'model__max_depth': [15, 30, 60, 100],
   'model__num_leaves': [30, 60, 120, 300],
   'target_encode__cols':
   [
      ['Store','StoreType','Assortment','StateHoliday','week','month','beginning_of_month','end_of_month'],
      ['Store','StoreType','Assortment','StateHoliday','week','month'],
      ['Store','StoreType','Assortment']
   ]
}

**Final best parameter combination**

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from lightgbm import LGBMRegressor
model = LGBMRegressor()


pipe = Pipeline(steps=[ 
                ('target_encode', target_encode),
                ('scaler',scaler),
                ('model',model)])


regLGBMGridSearch = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1, error_score='raise')
grid_search = regLGBMGridSearch.fit(X_train, y_train) 

Fitting 5 folds for each of 192 candidates, totalling 960 fits


In [8]:
print("Best parameter (CV score=%0.3f):" % grid_search.best_score_)
grid_search.best_estimator_

Best parameter (CV score=0.944):


Pipeline(steps=[('target_encode',
                 TargetEncoder(cols=['Store', 'StoreType', 'Assortment'])),
                ('scaler', StandardScaler()),
                ('model',
                 LGBMRegressor(max_depth=15, n_estimators=5000,
                               num_leaves=60))])

### Evaluating the model

In [9]:
from scripts.processing import metric
y_pred = grid_search.best_estimator_.predict(X_test)
error_regLGBM = metric(y_test.values, y_pred)
print(f"RMSPE metric on test set: {error_regLGBM:.2f}")

RMSPE metric on test set: 10.67


### Final Training

We train the final model on the whole training data

In [10]:
best_model = grid_search.best_estimator_
best_model_wholeDataRefitted = best_model.fit(X,y)

### Save Model

In [11]:
from scripts.pipeline import save_pipeline
pipe = save_pipeline(pipeline=best_model_wholeDataRefitted, name='LGBM_hyperparam_optim_5')

 - Saving pipeline "LGBM_hyperparam_optim_5" at:
../data/trained_pipelines/pipeline_LGBM_hyperparam_optim_5.p
