<a href="https://colab.research.google.com/github/abunchoftigers/Prediction-of-Product-Sales/blob/main/Prediction_of_Product_Sales_Stack_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction of Product Sales Part 2
[Part One](https://colab.research.google.com/github/abunchoftigers/Prediction-of-Product-Sales/blob/main/Prediction_of_Product_Sales.ipynb)

# Run the first half of this project
(Includes data cleaning)

In [124]:
# %run '/content/drive/MyDrive/Coding Dojo - Data Science/01-Fundamentals/Colab Notebooks/Prediction of Product Sales.ipynb'

 - Author: David Dyer

## imports

In [125]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from sklearn import set_config
set_config(transform_output='pandas')

from google.colab import drive
import warnings

warnings.simplefilter('ignore')

## Define regression metrics

In [126]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def regression_metrics(y_true, y_pred, label='', verbose = True, output_dict=False):
  # Get metrics
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true, y_pred, squared=False)
  r_squared = r2_score(y_true, y_pred)
  if verbose == True:
    # Print Result with Label and Header
    header = "-"*60
    print(header, f"Regression Metrics: {label}", header, sep='\n')
    print(f"- MAE = {mae:,.3f}")
    print(f"- MSE = {mse:,.3f}")
    print(f"- RMSE = {rmse:,.3f}")
    print(f"- R^2 = {r_squared:,.3f}")
  if output_dict == True:
      metrics = {'Label':label, 'MAE':mae,
                 'MSE':mse, 'RMSE':rmse, 'R^2':r_squared}
      return metrics
def evaluate_regression(reg, X_train, y_train, X_test, y_test, verbose = True,
                        output_frame=False):
  # Get predictions for training data
  y_train_pred = reg.predict(X_train)
  # Call the helper function to obtain regression metrics for training data
  results_train = regression_metrics(y_train, y_train_pred, verbose = verbose,
                                     output_dict=output_frame,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = reg.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = regression_metrics(y_test, y_test_pred, verbose = verbose,
                                  output_dict=output_frame,
                                    label='Test Data' )
  # Store results in a dataframe if ouput_frame is True
  if output_frame:
    results_df = pd.DataFrame([results_train,results_test])
    # Set the label as the index
    results_df = results_df.set_index('Label')
    # Set index.name to none to get a cleaner looking result
    results_df.index.name=None
    # Return the dataframe
    return results_df.round(3)

## Get the data

In [127]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [128]:
fpath = '/content/drive/MyDrive/Coding Dojo - Data Science/01-Fundamentals/Week 3/Data/prediction of product sales.csv'
df = pd.read_csv(fpath)
df.head()

Unnamed: 0.1,Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [129]:
df.isnull().sum()

df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

## code

In [130]:
# Features and target
X = df.drop(columns=['Item_Outlet_Sales', 'Item_Identifier'])
y = df['Item_Outlet_Sales']

X

Unnamed: 0.1,Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,0,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,1,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,2,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,3,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store
4,4,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1
...,...,...,...,...,...,...,...,...,...,...,...
8518,8518,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1
8519,8519,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,,Tier 2,Supermarket Type1
8520,8520,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1
8521,8521,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2


In [131]:
# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [132]:
# Fill in missing string values
obj_cols = df.select_dtypes(include='object').drop(columns=['Item_Identifier']).columns
# Fill in missing numeric values
num_cols = df.select_dtypes(include='number').columns
num_cols = num_cols.drop(['Item_Outlet_Sales'])
df[num_cols] = df[num_cols].fillna(value=-1)

In [133]:
# Now it's safe to fill in missing values
X_train[obj_cols] = X_train[obj_cols].fillna(value='MISSING')
X_train[obj_cols] = X_train[obj_cols].fillna(value='MISSING')

Numeric pipeline

In [134]:
scaler = StandardScaler()
mean_imputer = SimpleImputer(strategy="mean")

numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

Categorical pipeline

In [135]:
impute_missing = SimpleImputer(strategy='constant',fill_value='MISSING')
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

ohe_encoder.fit(X_train)

ohe_pipe = make_pipeline(impute_missing, ohe_encoder)

Ordinal pipeline

In [136]:
ord_cols = df[['Item_Fat_Content', 'Outlet_Establishment_Year', 'Outlet_Size']]
ord_cols

Unnamed: 0,Item_Fat_Content,Outlet_Establishment_Year,Outlet_Size
0,Low Fat,1999,Medium
1,Regular,2009,Medium
2,Low Fat,1999,Medium
3,Regular,1998,
4,Low Fat,1987,High
...,...,...,...
8518,Low Fat,1987,High
8519,Regular,2002,
8520,Low Fat,2004,Small
8521,Regular,2009,Medium


Create preprocessing object

In [137]:
num_tuple = ('numeric', numeric_pipe, num_cols)
ohe_tuple = ('categorical', ohe_pipe, obj_cols)

In [138]:
col_transformer = ColumnTransformer([num_tuple, ohe_tuple], verbose_feature_names_out=False)

In [139]:
col_transformer.fit(X_train)

# Project 1 - Part 6 (Core):
This week, you will add modeling to your sales prediction project. The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in predicting sales.


**CRISP-DM Phase 4 - Modeling**

1. Your first task is to build a linear regression model to predict sales.

 * Build a linear regression model.
 * Use the custom evaluation function to get the metrics for your model (on training and test data).


In [140]:
lin_reg = LinearRegression()
linreg_pipe = make_pipeline(col_transformer, lin_reg)

In [141]:
obj_cols

ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

ohe_encoder.fit(X_train)

ohe_pipe = make_pipeline(ohe_encoder)
ohe_tuple = ('categorical', ohe_pipe, obj_cols)

In [142]:
linreg_pipe = make_pipeline(col_transformer, lin_reg)
linreg_pipe.fit(X_train, y_train)

## Evaluate Regression

In [143]:
evaluate_regression(linreg_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 847.117
- MSE = 1,297,543.414
- RMSE = 1,139.098
- R^2 = 0.562

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 804.202
- MSE = 1,194,424.476
- RMSE = 1,092.897
- R^2 = 0.567


 * Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?

  * The R-squared values are almost exactly the same, and not very high. This model is likely underfit.

## Create a Random Forest model


2. Your second task is to build a Random Forest model to predict sales.

 * Build a default Random Forest model.
Use the custom evaluation function to get the metrics for your model (on training and test data).

In [144]:
rf = RandomForestRegressor(random_state=42)

rf_pipe = make_pipeline(col_transformer, rf)

In [145]:
rf_pipe.fit(X_train, y_train)

In [146]:
evaluate_regression(linreg_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 847.117
- MSE = 1,297,543.414
- RMSE = 1,139.098
- R^2 = 0.562

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 804.202
- MSE = 1,194,424.476
- RMSE = 1,092.897
- R^2 = 0.567


In [147]:
evaluate_regression(rf_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 293.188
- MSE = 176,878.510
- RMSE = 420.569
- R^2 = 0.940

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 753.510
- MSE = 1,170,943.685
- RMSE = 1,082.102
- R^2 = 0.576


* Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?
  - The model fit the test data very well, but performed only marginally better (0.009) that the Linear Regression on the test data. This model is overfit.

* Compare this model's performance to the linear regression model: which model has the best test scores?
  - The Random Forest has slightly better test scores

3. Use GridSearchCV to tune at least two hyperparameters for a Random Forest model.


In [148]:
rf_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('numeric',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    Index(['Unnamed: 0', 'Item_Weight', 'Item_Visibility', 'Item_MRP',
          'Outlet_Establishment_Year'],
         dtype='object')),
                                   ('categorical',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer(fill_value='MISSING',
                                                                   strategy='constant')),
                                                    ('onehotencoder',
                                                     OneHotEncoder(handle_unk

In [149]:
# param_grid = {'randomforestregressor__max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None],
#               'randomforestregressor__min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
#               'randomforestregressor__min_samples_split': [2, 3, 4]}

param_grid = {'randomforestregressor__max_depth': [1, 3, 5, 7, 9, None],
              'randomforestregressor__min_samples_leaf': [2, 4, 6, 8, 10],
              'randomforestregressor__min_samples_split': [2, 3]}

# param_grid = {'randomforestregressor__max_depth': [2, 4, 6, 8, 10, None],
#               'randomforestregressor__min_samples_leaf': [1, 3, 5, 7, 9],
#               'randomforestregressor__min_samples_split': [2, 3, 4]}

In [150]:
grid_search = GridSearchCV(rf_pipe, param_grid, n_jobs=-1, verbose=1)

In [151]:
%%time
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
CPU times: user 8.22 s, sys: 913 ms, total: 9.13 s
Wall time: 7min 3s


In [152]:
grid_search.best_params_

{'randomforestregressor__max_depth': 5,
 'randomforestregressor__min_samples_leaf': 2,
 'randomforestregressor__min_samples_split': 2}

In [153]:
best_model = grid_search.best_estimator_

In [154]:
evaluate_regression(best_model, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 755.015
- MSE = 1,151,313.162
- RMSE = 1,072.993
- R^2 = 0.611

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 728.441
- MSE = 1,095,269.748
- RMSE = 1,046.551
- R^2 = 0.603



 * After determining the best parameters from your GridSearch, fit and evaluate a final best model on the entire training set (no folds).
 * Compare your tuned model to your default Random Forest: did the performance improve?
   - Yes, from .576 to .603 on the test data



**CRISP-DM Phase 5 - Evaluation**

4. You now have tried several different models on your data set. You need to determine which model to implement.

 * Overall, which model do you recommend?

    * The tuned Random Forest model
 * Justify your recommendation.
    * The Random Forest performed the best overall. I cut almost half of the parameter values I wanted to test, after my first attempt to call `grid_search.fit()` was still running 28 minutes later. I've got it down to 7 minutes, and by running it with different, lists of the same number of parameters, I can probably find improvements.

In [155]:
df['Item_Outlet_Sales'].mean(), 728.44 / df['Item_Outlet_Sales'].mean() * 100

(2181.288913575032, 33.394934319182894)


 * In a Markdown cell:
    * Interpret your model's performance based on R-squared in a way that your non-technical stakeholder can understand.
    * Select another regression metric (RMSE/MAE/MSE) to express the performance of your model to your stakeholder.
   * Include why you selected this metric to explain to your stakeholder.
   * Compare the training vs. test scores and answer the question: to what extent is this model overfit/underfit?

* **R^2**: "This model is able to explain (and therefore predict) a little more than 60% of the variance in product sales."

* **MAE**: "The model's average, or *mean* error, including both predictions that were higher than the true number as well as those that were lower, is 33% of the average sales figure. On average, the model predicts sales figures about 66% as well as a time-traveller from Q1 of next year would be able to"

  *  I chose to compare the MAE with the (arithmetic) mean for three reasons:

    1. It's the simplest metric.

    2. 33% and 66% are easy fractions. I expect the stakeholder would already have a solid intuitive understanding of what these numbers mean.

    3. The stakeholder probably remembers what the *mean* of a set is from high school.

* The training and test scores of this model are similar and in the middling range. This indicates the model is undertuned.

