# Simple Models

In this notebook, we explore some simple models that we use for the sales prediction:
 - Predicting the mean
 - Linear Regression
 - Random Forest Regressor

### Results

We use the root mean square percentage error as a metric (RMSPE)

![](../assets/rmspe.png)

| Model              | RMSPE |
|--------------------|-------|
| Naive model (mean) | 62.03 |
| Linear Regression  | 22.81 |
| Random Forest      | 17.07 |

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# Load scripts from parent path
import sys, os
sys.path.insert(0, os.path.abspath('..'))

## Data Processing

### Load Data

Loads the data and adds features (see detailed explanation in notebook `0_summary`)).

In [2]:
from scripts.processing import load_train_data, process_data, add_store_info, add_week_month_info

train_raw = load_train_data()
train = add_week_month_info(train_raw)
train = process_data(train)
train = add_store_info(train)

train.head()



Unnamed: 0,Store,DayOfWeek,Sales,Promo,StateHoliday,SchoolHoliday,week,month,StoreType,Assortment,CompetitionDistance,Store_Sales_mean,Store_Customers_mean
0,353.0,2.0,3139.0,0.0,1.0,1.0,1,1,b,b,900.0,4139.474576,1153.783333
1,335.0,2.0,2401.0,0.0,1.0,1.0,1,1,b,a,90.0,12845.896552,2384.271186
2,512.0,2.0,2646.0,0.0,1.0,1.0,1,1,b,b,590.0,3725.649123,888.627119
3,494.0,2.0,3113.0,0.0,1.0,1.0,1,1,b,a,1260.0,7079.15,1010.583333
4,530.0,2.0,2907.0,0.0,1.0,1.0,1,1,a,c,18160.0,2260.783333,333.610169


### Prepare train/test data

In [3]:
X = train.copy(deep=True).drop(columns=["Sales"])
y = train.loc[:, "Sales"]

# Mark categorical data
# X.loc[:, ['StoreType', 'Assortment']] = X.loc[:, ['StoreType', 'Assortment']].astype("category")

# Make train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [4]:
X.head()

Unnamed: 0,Store,DayOfWeek,Promo,StateHoliday,SchoolHoliday,week,month,StoreType,Assortment,CompetitionDistance,Store_Sales_mean,Store_Customers_mean
0,353.0,2.0,0.0,1.0,1.0,1,1,b,b,900.0,4139.474576,1153.783333
1,335.0,2.0,0.0,1.0,1.0,1,1,b,a,90.0,12845.896552,2384.271186
2,512.0,2.0,0.0,1.0,1.0,1,1,b,b,590.0,3725.649123,888.627119
3,494.0,2.0,0.0,1.0,1.0,1,1,b,a,1260.0,7079.15,1010.583333
4,530.0,2.0,0.0,1.0,1.0,1,1,a,c,18160.0,2260.783333,333.610169


In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 440048 entries, 0 to 440047
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Store                 440048 non-null  float64
 1   DayOfWeek             440048 non-null  float64
 2   Promo                 440048 non-null  float64
 3   StateHoliday          440048 non-null  float64
 4   SchoolHoliday         440048 non-null  float64
 5   week                  440048 non-null  int64  
 6   month                 440048 non-null  int64  
 7   StoreType             440048 non-null  object 
 8   Assortment            440048 non-null  object 
 9   CompetitionDistance   440048 non-null  float64
 10  Store_Sales_mean      440048 non-null  float64
 11  Store_Customers_mean  440048 non-null  float64
dtypes: float64(8), int64(2), object(2)
memory usage: 43.6+ MB


## Simple Models

In [6]:
from scripts.processing import metric

### Naive mean predictor

In [7]:
mean_predictor = pd.DataFrame(y_train.copy())

mean_predictor.loc[:, 'y_pred'] = mean_predictor.mean()['Sales']
mean_predictor.head()

Unnamed: 0,Sales,y_pred
32402,5719.0,6836.900917
157830,2365.0,6836.900917
320221,5975.0,6836.900917
404627,3791.0,6836.900917
376992,11611.0,6836.900917


In [8]:
rmspe_mean_predictor = metric(mean_predictor.loc[:, 'y_pred'].values, mean_predictor.loc[:, 'Sales'].values)
print(f"RMSPE for mean predictor:\n{rmspe_mean_predictor:.2f}")


RMSPE for mean predictor:
62.03


## Linear Regression

In [9]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder
from sklearn.pipeline import Pipeline


target_encoder = TargetEncoder(cols=['StoreType', 'Assortment', 'Store'])
scaler = StandardScaler()

reg = LinearRegression()
# reg = Ridge(alpha=40000)

pipe = Pipeline(steps=[ 
                ('target_encode', target_encoder),
                ('scaler', scaler),
                ('model', reg)])     

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

rmspe = metric(y_test.values, y_pred)
print(f"\nRMSPE for Linear regression:\n{rmspe:.2f}")

  from pandas import Int64Index as NumericIndex
  elif pd.api.types.is_categorical(cols):



RMSPE for Linear regression:
22.81


## Random Forest

In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from category_encoders import TargetEncoder
from sklearn.preprocessing import OneHotEncoder, StandardScaler

target_encoder = TargetEncoder(cols=['StoreType', 'Assortment', 'Store'])
scaler = StandardScaler()
 
rf = RandomForestRegressor(max_depth=10, min_samples_split=100)

pipe = Pipeline(steps=[ 
                ('target_encode', target_encoder),
                ('scaler', scaler),
                ('model', rf)])     



pipe.fit(X_train, y_train)


y_pred = pipe.predict(X_test)
metric(y_test.values, y_pred)

  elif pd.api.types.is_categorical(cols):


17.073040674742312

In [11]:
from scripts.pipeline import save_pipeline, load_pipeline

# save_pipeline(reg, name='random_forest_2')
pipe = load_pipeline(name='random_forest_2')


 - Loading pipeline "random_forest_2" at:
../data/trained_pipelines/pipeline_random_forest_2.p
