<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S8_9_retail_analytics/S8_Module1A_Retail_Demand_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We begin by loading the required packages.

In [None]:
import pandas
import numpy
import sklearn
from sklearn import *

## *Supplement - Plot functions (this is a pre-built plot function)*

*They will be used later on for visualizations. There is no need to go through them. You only need to run the codes.*

In [None]:
import matplotlib.pyplot as plt

#See https://matplotlib.org/devdocs/gallery/subplots_axes_and_figures/subplots_demo.html

def plot_data_scatter(data_x, data_y, X_test, y_pred, feature_list):
    # Plot the results

    n_row_plot = int((len(feature_list)+1)/2) # 2 plots per row
    n_col_plot = 2
    fig, ax = plt.subplots(n_row_plot, n_col_plot, figsize=(12, 12))
    
    i = 0 # column index of the plot
    j = 0 # row index of the plot
        
    for count in range(len(feature_list)):
        #print(data_x[:,i])
        ax[j, i].scatter(data_x[:,min(count,len(feature_list))], data_y, s=20, edgecolor="black",
                    c="darkorange", label="data")
        ax[j, i].scatter(X_test.values[:,min(count,len(feature_list))], y_pred, s=30, marker="X", 
                    c="royalblue", label="prediction")
        ax[j, i].set(title=feature_list[count])
        
        ax[j, i].set(ylabel='UNITS')
            
        i = min(i+1,len(feature_list)) % n_col_plot
        if i == 0: j += 1

    plt.show()

# Block 1: Data input

In addition to the original data, we add a new variable, which is the squared price ('PRICE_p2').

In [None]:
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S8_9_retail_analytics/salesCereals.csv'

salesCereals = pandas.read_csv(url)
salesCereals['PRICE_p2'] = salesCereals['PRICE']**2
salesCereals.head()

'UPC' stands for Universal Product Code, which can be understood as one SKU in this case and in our SCM terms in general. The code below helps us identify the SKUs by which we want to forecast and their corresponding data size (number of data instances). We can see that the number of instances for each UPC is similar and that there is no UPC with only a few data points. This is important because training a model on a small dataset may limit its generalization.

In [None]:
print(salesCereals.groupby('UPC').count())

# Block 2: Feature engineering & preparation

We then organize the data by 'UPC.' The model presented here only runs on a predetermined subset of variables in the data. You can add or remove these explanatory variables based on your judgemental call. 

In [None]:
feature_list = ['PRICE', 'PRICE_p2', 'FEATURE', 'DISPLAY','TPR_ONLY','RELPRICE']

productList = salesCereals['UPC'].unique()
print(productList)

X, X_train, X_test = {}, {}, {}
y, y_train, y_test, y_pred = {}, {}, {}, {}

for upc in productList:
  
  X[upc] = salesCereals.loc[salesCereals['UPC']==upc][feature_list]
  y[upc] = salesCereals.loc[salesCereals['UPC']==upc]['UNITS']
  # Split into training and testing data
  X_train[upc], X_test[upc], y_train[upc], y_test[upc] = sklearn.model_selection.train_test_split(X[upc], y[upc], test_size=0.25, random_state=0)


# Block 3: Model & algorithm (training & testing)

In the next two cells, we train and test two different types of models, namely Linear Regression and Tree Regression. In each cell, we create a loop **for** each UPC on the product list. The first line in each loop is to train the model and the second line is for testing the model's performance on unseen data. The next three lines compute the performance metrics we would like to measure.

We organize the linear regression result by 'UPC' (row) and performance metrics (columns). Then we create a dataframe and put the computed metric in the corresponding column (the last line in each loop).

In [None]:
#Linear model
regr = {}
regrSummary = pandas.DataFrame(columns=['trainRMSE', 'testRMSE','testMAE','testMAPE'], index = productList)

for upc in productList:
    regr[upc] = sklearn.linear_model.LinearRegression().fit(X_train[upc],y_train[upc])
    trainRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y_train[upc], regr[upc].predict(X_train[upc])))
    y_pred[upc] = regr[upc].predict(X_test[upc])

    testMAE = sklearn.metrics.mean_absolute_error(y_test[upc], y_pred[upc])
    testMAPE = numpy.mean(numpy.abs((y_test[upc] - y_pred[upc]) / y_test[upc]))
    testRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y_test[upc], y_pred[upc]))
    regrSummary.loc[upc] =  [trainRMSE, testRMSE, testMAE, testMAPE]

print('Linear regression Summary')
print(regrSummary)
print('average training RMSE:' + str(round(regrSummary['trainRMSE'].mean(),2)))
print('average testing RMSE:' + str(round(regrSummary['testRMSE'].mean(),2)))
print('average testing MAE:' + str(round(regrSummary['testMAE'].mean(),2)))
print('average testing MAPE:' + str(round(regrSummary['testMAPE'].mean(),2)))


Here we visualize the data points and the predictions using the previously defined plot function.

In [None]:
# Plot prediction results for a product (UPC)
upc = productList[1]
data_y = salesCereals.loc[salesCereals['UPC']==upc]['UNITS'].values
data_x = salesCereals.loc[salesCereals['UPC']==upc][feature_list].values
plot_data_scatter(data_x, data_y, X_test[upc], y_pred[upc], feature_list)

In order to see the impact of the price on the demand, we use a simple plot function below from mathplotlib to see how the demand would change when the price changes.

For more details of the plot function, please see: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html

In [None]:
upc = productList[1]
input_x = []
prices = [2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5, 3.75, 4.0]

# generate inputs for the plot using simple feature values and varying price points
for p in prices:
  input_x.append([p, p**2, 0,0,0, 1.0])
  
# obtain the predicted demands
predict_y = regr[upc].predict(input_x)
plt.plot(prices, predict_y, marker='o')
plt.xlabel('Price')
plt.ylabel('Demand') 
plt.show()

Likewise, we obtain the tree regression results by simply changing the function name and the result table name.

In [None]:
#Tree models
regr = {}
regrSummary = pandas.DataFrame(columns=['trainRMSE', 'testRMSE','testMAE','testMAPE'], index = productList)

for upc in productList:
      
    regr[upc] = sklearn.tree.DecisionTreeRegressor(random_state = 0).fit(X_train[upc],y_train[upc]) # standard regression tree
    # regr[upc] = sklearn.ensemble.RandomForestRegressor(random_state = 0).fit(X_train[upc],y_train[upc]) # random forest tree
    trainRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y_train[upc], regr[upc].predict(X_train[upc])))
    y_pred[upc] = regr[upc].predict(X_test[upc])

    testMAE = sklearn.metrics.mean_absolute_error(y_test[upc], y_pred[upc])
    testMAPE = numpy.mean(numpy.abs((y_test[upc] - y_pred[upc]) / y_test[upc]))
    testRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y_test[upc], y_pred[upc]))
    regrSummary.loc[upc] =  [trainRMSE, testRMSE, testMAE, testMAPE]

print('Regression Tree Summary')
print(regrSummary)
print('average training RMSE:' + str(round(regrSummary['trainRMSE'].mean(),2)))
print('average testing RMSE:' + str(round(regrSummary['testRMSE'].mean(),2)))
print('average testing MAE:' + str(round(regrSummary['testMAE'].mean(),2)))
print('average testing MAPE:' + str(round(regrSummary['testMAPE'].mean(),2)))


In [None]:
# Plot prediction results for a product (UPC)
upc = productList[1]
data_y = salesCereals.loc[salesCereals['UPC']==upc]['UNITS'].values
data_x = salesCereals.loc[salesCereals['UPC']==upc][feature_list].values
plot_data_scatter(data_x, data_y, X_test[upc], y_pred[upc], feature_list)

In [None]:
upc = productList[1]
input_x = []
prices = [2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5, 3.75, 4.0]

# generate inputs for the plot using simple feature values and varying price points
for p in prices:
  input_x.append([p, p**2, 0,0,0, 1.0])
  
# obtain the predicted demands
predict_y = regr[upc].predict(input_x)
plt.plot(prices, predict_y, marker='o')
plt.xlabel('Price')
plt.ylabel('Demand') 
plt.show()

# Block 4: Model selection

By comparing the average result, we can see that the linear regression model generally outperformed the decision tree regression and did not overfit the data. Therefore, we proceed with the linear regression model for the whole dataset by replacing 'X_train' with 'X'. Given that the model has 'seen' the whole dataset, its forecast errors normally decrease. Therefore, we will save the trained model and use it for the new data which will be used in the optimization models in the next session.

In [None]:
# Best model
regr = {}
regrSummary = pandas.DataFrame(columns=['totalMAE','totalMAPE', 'totalRMSE'], index = productList)

for upc in productList:
    regr[upc] = sklearn.linear_model.LinearRegression().fit(X[upc],y[upc])
    y_pred[upc] = regr[upc].predict(X[upc])
    testMAE = sklearn.metrics.mean_absolute_error(y[upc], y_pred[upc])
    testMAPE = numpy.mean(numpy.abs((y[upc] - y_pred[upc]) / y[upc]))
    testRMSE = numpy.sqrt(sklearn.metrics.mean_squared_error(y[upc], y_pred[upc]))
    regrSummary.loc[upc] =  [testMAE, testMAPE, testRMSE]

print('Best Model Summary')
print(regrSummary)
print('average overall MAE:' + str(round(regrSummary['totalMAE'].mean(),2)))
print('average overall MAPE:' + str(round(regrSummary['totalMAPE'].mean(),2)))
print('average overall RMSE:' + str(round(regrSummary['totalRMSE'].mean(),2)))

## Save trained models

### **Colab**: In addition to downloading the results as we did in the previous session (not shown here), you can also save to your Google Drive.

In [None]:
# we need to remount Google Drive in order to save into it
import pickle

from google.colab import drive
drive.mount('/content/drive')
cwd = '/content/drive/My Drive/'

# save to drive
for upc in productList:
    filename = cwd+str(upc)+'_demand_model.sav'
    # save the model to disk
    pickle.dump(regr[upc], open(filename, 'wb'))

## **Jupyter**: Save to local folder

In [None]:
import pickle
cwd = './'

# save to 
for upc in productList:
    filename = cwd+str(upc)+'_demand_model.sav'
    # save the model to disk
    pickle.dump(regr[upc], open(filename, 'wb'))