<a href="https://colab.research.google.com/github/abunchoftigers/Prediction-of-Product-Sales/blob/main/Prediction_of_Product_Sales_Stack_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction of Product Sales Part 2
[Part One](https://colab.research.google.com/github/abunchoftigers/Prediction-of-Product-Sales/blob/main/Prediction_of_Product_Sales.ipynb)

 - Author: David Dyer

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, ColumnTransformer

from sklearn import set_config
set_config(transform_output='pandas')

from google.colab import drive
import warnings

warnings.simplefilter('ignore')

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
fpath = '/content/drive/MyDrive/Coding Dojo - Data Science/01-Fundamentals/Week 2/Data/sales_predictions_2023.csv'
df = pd.read_csv(fpath)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
#  Clean the fat column
item_fat_map = {
    'LF': 'Low Fat',
    'low fat': 'Low Fat',
    'reg': 'Regular'
}
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace(item_fat_map)
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [None]:
# Features and target
X = df.drop(columns=['Item_Outlet_Sales', 'Item_Identifier'])
y = df['Item_Outlet_Sales']

In [None]:
# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
# Fill in missing string values
obj_cols = df.select_dtypes(include='object').drop(columns=['Item_Identifier']).columns
# Fill in missing numeric values
num_cols = df.select_dtypes(include='number').columns
df[num_cols] = df[num_cols].fillna(value=-1)

In [None]:
# Now it's safe to fill in missing values
X_train[obj_cols] = X_train[obj_cols].fillna(value='MISSING')
X_train[obj_cols] = X_train[obj_cols].fillna(value='MISSING')

Numeric pipeline

In [None]:
scaler = StandardScaler()
mean_imputer = SimpleImputer(strategy="mean")

numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

Categorical pipeline

In [None]:
impute_missing = SimpleImputer(strategy='constant',fill_value='MISSING')
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

ohe_encoder.fit(X_train)

ohe_pipe = make_pipeline(impute_missing, ohe_encoder)

Create preprocessing object

In [None]:
num_tuple = ('numeric', numeric_pipe, num_cols)
ohe_tuple = ('categorical', ohe_pipe, obj_cols)

In [None]:
col_transformer = ColumnTransformer([num_tuple, ohe_tuple], verbose_feature_names_out=False)

# Project 1 - Part 6 (Core):
This week, you will add modeling to your sales prediction project. The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in predicting sales.


**CRISP-DM Phase 4 - Modeling**

1. Your first task is to build a linear regression model to predict sales.

 * Build a linear regression model.
 * Use the custom evaluation function to get the metrics for your model (on training and test data).
 * Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?
2. Your second task is to build a Random Forest model to predict sales.

 * Build a default Random Forest model.
Use the custom evaluation function to get the metrics for your model (on training and test data).
 * Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?
 * Compare this model's performance to the linear regression model: which model has the best test scores?

3. Use GridSearchCV to tune at least two hyperparameters for a Random Forest model.

 * After determining the best parameters from your GridSearch, fit and evaluate a final best model on the entire training set (no folds).
 * Compare your tuned model to your default Random Forest: did the performance improve?

**CRISP-DM Phase 5 - Evaluation**

4. You now have tried several different models on your data set. You need to determine which model to implement.

 * Overall, which model do you recommend?
 * Justify your recommendation.
 * In a Markdown cell:
    * Interpret your model's performance based on R-squared in a way that your non-technical stakeholder can understand.
    * Select another regression metric (RMSE/MAE/MSE) to express the performance of your model to your stakeholder.
   * Include why you selected this metric to explain to your stakeholder.
   * Compare the training vs. test scores and answer the question: to what extent is this model overfit/underfit?