<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S8_9_retail_analytics/DT_S8_Module1A_Retail_Demand_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

As we did in Week 7, we begin by loading the sklearn packages needed.

In [0]:
import pandas
import sklearn
import numpy
from sklearn import *

# Block 1: Data input

In addition to the original data, we add a new variable, which is the squared price ('PRICE_p2').

In [2]:
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S8_9_retail_analytics/salesCereals.csv'

salesCereals = pandas.read_csv(url)
salesCereals['PRICE_p2'] = salesCereals['PRICE']**2
salesCereals.head()

Unnamed: 0.1,Unnamed: 0,WEEK_END_DATE,STORE_NUM,UPC,UNITS,VISITS,HHS,SPEND,PRICE,BASE_PRICE,FEATURE,DISPLAY,TPR_ONLY,Desc,Category,Sub-Category,SUMPRICE,COUNTPRICE,AVGPRICE,RELPRICE,PRICE_p2
0,6,2009-01-14,367.0,1111085319,14.0,13.0,13.0,26.32,1.88,1.88,0.0,0.0,0.0,PL HONEY NUT TOASTD OATS,COLD CEREAL,ALL FAMILY CEREAL,19.54,7,2.791429,0.67349,3.5344
1,8,2009-01-14,367.0,1111085350,35.0,27.0,25.0,69.3,1.98,1.98,0.0,0.0,0.0,PL BT SZ FRSTD SHRD WHT,COLD CEREAL,ALL FAMILY CEREAL,19.54,7,2.791429,0.709314,3.9204
2,12,2009-01-14,367.0,1600027527,12.0,10.0,10.0,38.28,3.19,3.19,0.0,0.0,0.0,GM HONEY NUT CHEERIOS,COLD CEREAL,ALL FAMILY CEREAL,19.54,7,2.791429,1.142784,10.1761
3,13,2009-01-14,367.0,1600027528,31.0,26.0,19.0,142.29,4.59,4.59,0.0,0.0,0.0,GM CHEERIOS,COLD CEREAL,ALL FAMILY CEREAL,19.54,7,2.791429,1.644319,21.0681
4,14,2009-01-14,367.0,1600027564,56.0,48.0,42.0,152.32,2.72,3.07,1.0,0.0,0.0,GM CHEERIOS,COLD CEREAL,ALL FAMILY CEREAL,19.54,7,2.791429,0.974411,7.3984


'UPC' stands for Unique Product Code, which can be understood as one SKU in this case and in our SCM terms in general. The code below helps us identify the SKUs by which we want to forecast and their corresponding data size (number of data instances). We can see that the number of instances for each UPC is similar and that there is no UPC with only a few data points. This is important because training a model on a small dataset may limit its generalization.

In [5]:
print(salesCereals.groupby('UPC').count())

            Unnamed: 0  WEEK_END_DATE  STORE_NUM  ...  AVGPRICE  RELPRICE  PRICE_p2
UPC                                               ...                              
1111085319         156            156        156  ...       156       156       156
1111085350         156            156        156  ...       156       156       156
1600027527         156            156        156  ...       156       156       156
1600027528         156            156        156  ...       156       156       156
1600027564         155            155        155  ...       155       155       155
3000006340         133            133        133  ...       133       133       133
3800031829         155            155        155  ...       155       155       155

[7 rows x 20 columns]


# Block 2: Feature engineering & preparation

We then organize the data by 'UPC.' The model presented here only runs on a predetermined subset of variables in the data. You can add or remove these explanatory variables based on your judgemental call. The last 3 lines print the data for the last item on the product list we have compiled before.

In [6]:
feature_list = ['PRICE', 'PRICE_p2', 'FEATURE', 'DISPLAY','TPR_ONLY','RELPRICE']

productList = salesCereals['UPC'].unique()
print(productList)

X, X_train, X_test = {}, {}, {}
y, y_train, y_test, y_pred = {}, {}, {}, {}

for upc in productList:
  
  X[upc] = salesCereals.loc[salesCereals['UPC']==upc][feature_list]
  y[upc] = salesCereals.loc[salesCereals['UPC']==upc]['UNITS']
  # Split into training and testing data
  X_train[upc], X_test[upc], y_train[upc], y_test[upc] = sklearn.model_selection.train_test_split(X[upc], y[upc], test_size=0.25, random_state=0)

print(upc)
print(X[upc])
print(y[upc])

[1111085319 1111085350 1600027527 1600027528 1600027564 3000006340
 3800031829]
3800031829
      PRICE  PRICE_p2  FEATURE  DISPLAY  TPR_ONLY  RELPRICE
6      3.14    9.8596      0.0      0.0       0.0  1.124872
13     3.14    9.8596      0.0      0.0       0.0  1.095167
20     3.14    9.8596      0.0      0.0       0.0  1.060299
27     3.14    9.8596      0.0      0.0       0.0  1.060811
34     3.14    9.8596      0.0      0.0       0.0  1.377193
...     ...       ...      ...      ...       ...       ...
1042   3.89   15.1321      0.0      0.0       0.0  1.269859
1048   3.89   15.1321      0.0      0.0       0.0  1.225197
1054   3.89   15.1321      0.0      0.0       0.0  1.323129
1060   3.89   15.1321      0.0      0.0       0.0  1.294509
1066   3.89   15.1321      0.0      0.0       0.0  1.300279

[155 rows x 6 columns]
6       14.0
13      17.0
20      23.0
27      25.0
34      23.0
        ... 
1042    10.0
1048    13.0
1054    18.0
1060    29.0
1066    27.0
Name: UNITS, Length: 1

# Block 3: Model & algorithm (training & testing)

In the next two cells, we train and test two different models, namely Linear Regression and Decision Tree Regression. In each cell, we create a loop **for** each UPC on the product list. The first line in each loop is to train the model and the second line is for testing the model's performance on unseen data. The next three lines compute the performance metrics we would like to measure.

We organize the linear regression result by 'UPC' (row) and performance metrics (columns). Don't forget to put the computed metric in the corresponding column (the last line in each loop).

In [7]:
#Linear model
regr = {}
regrSummary = pandas.DataFrame(columns=['MAE','MAPE', 'RMSE'], index = productList)

for upc in productList:
    regr[upc] = sklearn.linear_model.LinearRegression().fit(X_train[upc],y_train[upc])
    y_pred[upc] = regr[upc].predict(X_test[upc])
    testMAE = sklearn.metrics.mean_absolute_error(y_test[upc], y_pred[upc])
    testMAPE = numpy.mean(numpy.abs((y_test[upc] - y_pred[upc]) / y_test[upc]))
    testRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y_test[upc], y_pred[upc]))
    regrSummary.loc[upc] =  [testMAE, testMAPE, testRMSE]

print('Linear Regression Summary')
print(regrSummary)
print('average MAE:' + str(round(regrSummary['MAE'].mean(),2)))
print('average MAPE:' + str(round(regrSummary['MAPE'].mean(),2)))
print('average RMSE:' + str(round(regrSummary['RMSE'].mean(),2)))


Linear Regression Summary
                MAE      MAPE     RMSE
1111085319  6.56893  0.838554    8.233
1111085350  6.13527  0.748932  7.69617
1600027527  13.5468  0.547776  23.5199
1600027528  8.08809  0.251657  14.0652
1600027564  5.23954   0.27655  6.79671
3000006340  2.88635  0.720298   3.8743
3800031829  6.57475  0.379121  8.51852
average MAE:7.01
average MAPE:0.54
average RMSE:10.39


Likewise, we obtain the decision tree regression results by simply changing the function name and the result table name.

In [8]:
#Tree model
regr = {}
regrSummary = pandas.DataFrame(columns=['MAE','MAPE', 'RMSE'], index = productList)

for upc in productList:
      
    regr[upc] = tree.DecisionTreeRegressor(random_state = 0).fit(X_train[upc],y_train[upc])
    y_pred[upc] = regr[upc].predict(X_test[upc])

    testMAE = sklearn.metrics.mean_absolute_error(y_test[upc], y_pred[upc])
    testMAPE = numpy.mean(numpy.abs((y_test[upc] - y_pred[upc]) / y_test[upc]))
    testRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y_test[upc], y_pred[upc]))
    regrSummary.loc[upc] =  [testMAE, testMAPE, testRMSE]

print('Regression Tree Summary')
print(regrSummary)
print('average MAE:' + str(round(regrSummary['MAE'].mean(),2)))
print('average MAPE:' + str(round(regrSummary['MAPE'].mean(),2)))
print('average RMSE:' + str(round(regrSummary['RMSE'].mean(),2)))

Regression Tree Summary
                MAE      MAPE     RMSE
1111085319  7.10256  0.841491  10.7596
1111085350  5.92308  0.625815  7.47131
1600027527  15.5641  0.520292  30.3302
1600027528  8.64103  0.286716   13.221
1600027564  7.89744   0.38854  10.4538
3000006340      4.5  0.992923  7.76209
3800031829  7.48718  0.421318  9.13152
average MAE:8.16
average MAPE:0.58
average RMSE:12.73


# Block 4: Model Selection

By comparing the average result, we can see that the linear regression model outperformed the decision tree regression. Therefore, we proceed with the linear regression model for the whole dataset by replacing 'X_train' with 'X'. Given that the model has 'seen' the whole dataset, its forecast errors normally decrease. Therefore, to further check whether the model generalizes well into unseen data points, we will save the trained model and use it for the new data to be seen in the second part of today's session.

In [10]:
# Best model
regr = {}
regrSummary = pandas.DataFrame(columns=['MAE','MAPE', 'RMSE'], index = productList)

for upc in productList:
    regr[upc] = sklearn.linear_model.LinearRegression().fit(X[upc],y[upc])
    y_pred[upc] = regr[upc].predict(X[upc])
    testMAE = sklearn.metrics.mean_absolute_error(y[upc], y_pred[upc])
    testMAPE = numpy.mean(numpy.abs((y[upc] - y_pred[upc]) / y[upc]))
    testRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y[upc], y_pred[upc]))
    regrSummary.loc[upc] =  [testMAE, testMAPE, testRMSE]

print('Best Model Summary')
print(regrSummary)
print('average MAE:' + str(round(regrSummary['MAE'].mean(),2)))
print('average MAPE:' + str(round(regrSummary['MAPE'].mean(),2)))
print('average RMSE:' + str(round(regrSummary['RMSE'].mean(),2)))

Best Model Summary
                MAE      MAPE     RMSE
1111085319  6.08166  0.643456  7.75673
1111085350  5.80448      0.64  7.31238
1600027527  9.77341  0.611309  16.9264
1600027528  7.05949  0.302731  10.4261
1600027564  5.99011  0.296792  8.30306
3000006340  2.86525  0.699564  4.11992
3800031829  6.16428  0.336052  7.81023
average MAE:6.25
average MAPE:0.5
average RMSE:8.95


In [11]:
# we need to remount Google Drive in order to save into it
import pickle

from google.colab import drive
drive.mount('/content/drive')
cwd = '/content/drive/My Drive/'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# save the models to drive
for upc in productList:
    filename = cwd+str(upc)+'_demand_model.sav'
    # save the model to disk
    pickle.dump(regr[upc], open(filename, 'wb'))