<a href="https://colab.research.google.com/github/gatimo256/sales-predictions/blob/main/Sales_Predictions_Project_1_Final_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in predicting sales.

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from sklearn import set_config
set_config(display='diagram')

In [2]:
#Mount the Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#Load the Dataset
path = '/content/drive/MyDrive/CodingDojo Datascience/sales_predictions.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [5]:
#Lets deal with a few values first 
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [10]:
#We'll map the Outlet_Size values to numbers since they are Ordinal Values
outlet_size_mapping = mappings = {'Small': 0,
                               'Medium': 1,
                               'High': 2
                               }

In [11]:
df['Outlet_Size'] = df['Outlet_Size'].replace(outlet_size_mapping)
df['Outlet_Size'].value_counts()

1.0    2793
0.0    2388
2.0     932
Name: Outlet_Size, dtype: int64

In [12]:
#Show the unique values in Item_Fat_Contet
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [13]:
#create a dictionaary for values will replace the inconsistent values with 
item_fat_mapping = mappings = {'LF': 'Low Fat',
                               'reg': 'Regular',
                               'low fat': 'Low Fat'
                               }

#replace values with mappings
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace(item_fat_mapping)
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

1) Your first task is to build a linear regression model to predict sales.

- Build a linear regression model.
- Evaluate the performance of your model based on r^2.
- Evaluate the performance of your model based on rmse.

In [14]:
# Define features X and target y
X = df.drop('Item_Outlet_Sales', axis=1)
y = df['Item_Outlet_Sales']

#Perform a Validation split.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

We want to perform a Linear Regression on the data set so we'll have to create a column transformer to do the following
- Simple Imputation
- Scale the Numerical Data
- OneHot Encode the Categorical Data

In [15]:
# Create Selectors for Categories and Numbers
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [16]:
# Imputers
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')
# Scaler
scaler = StandardScaler()
# One-hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

In [17]:
# Numeric pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

# Categorical pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

In [19]:
#tuples for column transformer
number_tuple = (numeric_pipe,num_selector)
category_tuple = (categorical_pipe, cat_selector)

In [20]:
#ColumnTransformer
preprocessor = make_column_transformer(number_tuple, category_tuple)
preprocessor

In [21]:
# fit on train
preprocessor.fit(X_train)



In [23]:
# transform train and test
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

# Lets now get into our Linear Regression

In [24]:
#instatiate a linear regression model
lin_reg = LinearRegression()

In [25]:
#Fit Pipeline
lin_reg.fit(X_train, y_train)

In [26]:
#creat mode predictions
train_pred = lin_reg.predict(X_train)
test_pred = lin_reg.predict(X_test)

Evaluate Model Using R^2

In [27]:
#Evaluate Model performance using R^2
train_r2 = r2_score(y_train, train_pred)
test_r2 = r2_score(y_test,test_pred)

print(f'Model Training R2: {train_r2}')
print(f'Model Testing R2: {test_r2}')

Model Training R2: 0.6716866170021532
Model Testing R2: -2.2000993154978634e+18


Evaluate Model Using RMSE

In [28]:
#calculating RMSE
train_RMSE = np.sqrt(np.mean(np.abs(train_pred - y_train)**2))
test_RMSE = np.sqrt(np.mean(np.abs(test_pred - y_test)**2))


print(f'Model Train Root Mean Squared Error (RMSE): {train_RMSE}')
print(f'Model Train Root Mean Squared Error (RMSE): {test_RMSE}')

Model Train Root Mean Squared Error (RMSE): 985.7123891812413
Model Train Root Mean Squared Error (RMSE): 2463741878196.166


2) Your second task is to build a regression tree model to predict sales.

- Build a simple regression tree model.
- Compare the performance of your model based on r^2.
- Compare the performance of your model based on rmse.  

We shall use a Random Forest Model.

In [29]:
rf= RandomForestRegressor(random_state=42)

In [30]:
rf.fit(X_train, y_train)

In [31]:
#lets evaluate the model using R^2
rf_train_score = rf.score(X_train, y_train)
rf_test_score = rf.score(X_test, y_test)
print(rf_train_score)
print(rf_test_score)

0.9377504328209025
0.5509594794820931


The R^2 score for the

 **Random Forest Model( 0.93 on the Train Set and 0.55 on the Test Set)** 
 
 is much better than the R^2 score for the 
 
 **Linear Regression Model (0.6716866170021532 on the Train Set and - 2.2000993154978634e+18 on the Test Set**).


 The Random Forest Model performs better. 

In [32]:
#Lets first predict the values using the Random Forest Method
train_rf_pred = rf.predict(X_train)
test_rf_pred = rf.predict(X_test)

In [34]:
#Evaluating the Model using RMSE
train_rf_RMSE = np.sqrt(np.mean(np.abs(train_rf_pred - y_train)**2))
test_rf_RMSE = np.sqrt(np.mean(np.abs(test_rf_pred - y_test)**2))


print(f'Model Train Root Mean Squared Error (RMSE): {train_rf_RMSE}')
print(f'Model Train Root Mean Squared Error (RMSE): {test_rf_RMSE}')

Model Train Root Mean Squared Error (RMSE): 429.214208400059
Model Train Root Mean Squared Error (RMSE): 1113.0555230597365



The Linear Regression Model RMSE is  2463741878196.166 compared to the Random Forrest Model RMSE: 1113.0555230597365

3) You now have tried 2 different models on your data set. You need to determine which model to implement.

Overall, which model do you recommend?
Justify your recommendation.

I would recommend the Random Forest Model because (although still a bit OverFit it has a better R^2 score than the Linear Regression Model. 

The Linear Regression Model has a larger RMSE 2463741878196.166 compared to the Random Forrest Model RMSE of 1113.0555230597365

These results show that the Random Forest Model performs better than the Linear Regression model. 

Its possible to tune the Random Forest Model to get better results. 