<a href="https://colab.research.google.com/github/cipalisoc/project1/blob/main/Project_1_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this step is to help the retailer by using machine learning to make predictions about future sales based on the data provided. 

- The following is a continuation of the processing of the data to create linear and regression tree models. 

In [27]:
# Imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
from sklearn.linear_model import LinearRegression
set_config(display='diagram')
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor

In [28]:
# Load and view dataset; Since each row is identified by a unique item ID, I assigned the index as the 'Item_Identifier' column
filename = '/content/drive/MyDrive/Coding Dojo/Week 2: Pandas/sales_predictions.csv'
df = pd.read_csv(filename, index_col='Item_Identifier')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 8523 entries, FDA15 to DRG01
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                7060 non-null   float64
 1   Item_Fat_Content           8523 non-null   object 
 2   Item_Visibility            8523 non-null   float64
 3   Item_Type                  8523 non-null   object 
 4   Item_MRP                   8523 non-null   float64
 5   Outlet_Identifier          8523 non-null   object 
 6   Outlet_Establishment_Year  8523 non-null   int64  
 7   Outlet_Size                6113 non-null   object 
 8   Outlet_Location_Type       8523 non-null   object 
 9   Outlet_Type                8523 non-null   object 
 10  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 799.0+ KB


Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


- Missing values occur under 'Item_Weight' and 'Outlet_Size' columns. 
- Numeric data: 'Item_Weight', 'Item_visibility', 'Item_MRP', and 'Outlet_Establishment_Year. 
- Ordinal columns: 'Item_Fat_Content', 'Outlet_Size', and 'Outlet_Location_Type'. 
- Nominal columns: 'Item_type', 'Outlet_Type', and 'Outlet_Identifier'.

In [29]:
# confirmed no duplicated rows
df.duplicated().sum()

0

In [30]:
# examining number of unique values in each column
df.nunique()

Item_Weight                   415
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Identifier              10
Outlet_Establishment_Year       9
Outlet_Size                     3
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3493
dtype: int64

In [31]:
# Ordinal Encoding for Item Fat Content
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [32]:
df['Item_Fat_Content'].replace({'Low Fat':0 , 'LF':0 , 'low fat':0 , 'Regular':1 , 'reg':1}, inplace=True)

In [33]:
# Ordinal Encoding for Outlet Size
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [34]:
df['Outlet_Size'].replace({'Small':0 , 'Medium':1 , 'High':2}, inplace=True)

In [35]:
# Ordinal Encoding for Outlet Location Type
df['Outlet_Location_Type'].value_counts()

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

In [36]:
df['Outlet_Location_Type'].replace({'Tier 1':0 , 'Tier 2':1 , 'Tier 3':2}, inplace=True)

In [37]:
# Validation split
X = df.drop('Item_Outlet_Sales', axis=1)
y = df['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [38]:
# Instantiate Column Selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [39]:
# Instantiate Transformers

# For imputers, using 'most frequent' stategy for categorical values and 'median' for numerical values since I have int and float data types
freq_imputer = SimpleImputer(strategy='most_frequent')
median_imputer = SimpleImputer(strategy='median')
# Scaler
scaler = StandardScaler()
# One Hot Encoder
ohe = OneHotEncoder(handle_unknown='ignore' , sparse=False)

In [40]:
# Instantiate Pipelines

# Numeric pipeline
numeric_pipe = make_pipeline(median_imputer, scaler)
numeric_pipe

In [41]:
# Categorical pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

In [42]:
# Instantiate Columntransformer

# Tuples for Column Transformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)
# ColumnTransformer
preprocessor = make_column_transformer(number_tuple, category_tuple)
preprocessor

# Linear Regression model

In [43]:
# Fit Linear Regression model pipe on to both numerical and nominal pipes
reg = LinearRegression()
reg_pipe = make_pipeline(preprocessor, reg)
reg_pipe

In [44]:
reg_pipe.fit(X_train, y_train)

In [45]:
# Make predictions using the training and testing data
training_predictions = reg_pipe.predict(X_train)
test_predictions = reg_pipe.predict(X_test)
training_predictions

array([3811. , 2656.5, 2608.5, ..., 3736.5, 1932.5, 1536.5])

- Now that we have our linear regression model, we can evaluate its performance based on R2 and the RMSE

In [46]:
# Calculating MSE first
train_MSE = mean_squared_error(y_train, training_predictions)
test_MSE = mean_squared_error(y_test, test_predictions)
print(f'Training MSE: {train_MSE}')
print(f'Testing MSE: {test_MSE}')

Training MSE: 1297555.623978998
Testing MSE: 1194355.2714576465


In [47]:
# Now that we have the MSE, we can calculate RMSE
train_RMSE = np.sqrt(train_MSE)
test_RMSE = np.sqrt(test_MSE)
print(f'Training RMSE: {train_RMSE}')
print(f'Testing RMSE: {test_RMSE}')

Training RMSE: 1139.1029909446283
Testing RMSE: 1092.8656236965487


In [48]:
# Calculating R2
train_r2 = r2_score(y_train, training_predictions)
test_r2 = r2_score(y_test, test_predictions)
print(f'R2 Train score: {train_r2}')
print(f'R2 Test score: {test_r2}')

R2 Train score: 0.5615559908552253
R2 Test score: 0.5671021734263202


#Regression Tree Model

In [49]:
# Instantiante Decision Tree Regressor
dec_tree = DecisionTreeRegressor(random_state = 42)

In [50]:
# create model pipeline 
dec_tree_pipe = make_pipeline(preprocessor, dec_tree)
# fit model
dec_tree_pipe.fit(X_train, y_train)

In [52]:
# Make predictions for Decision Tree Model using training and testing data
train_preds = dec_tree_pipe.predict(X_train)
test_preds = dec_tree_pipe.predict(X_test)

- Now that we have the Decision Tree model, lets calculate the model metrics

In [53]:
# Calculate regression tree MSE
train_MSE = mean_squared_error(y_train, train_preds)
test_MSE = mean_squared_error(y_test, test_preds)
print(f'Regression Tree Training MSE: {train_MSE}')
print(f'Regression Tree Testing MSE: {test_MSE}')

Regression Tree Training MSE: 3.0330171474830394e-29
Regression Tree Testing MSE: 2229870.601748823


In [54]:
# Calculate regression tree RMSE
train_RMSE = np.sqrt(train_MSE)
test_RMSE = np.sqrt(test_MSE)
print(f'Regression Tree Training RMSE: {train_RMSE}')
print(f'Regression Tree Testing RMSE: {test_RMSE}')

Regression Tree Training RMSE: 5.50728349323243e-15
Regression Tree Testing RMSE: 1493.27512593923


In [56]:
# Calculate regression tree R2
train_r2 = r2_score(y_train, train_preds)
test_r2 = r2_score(y_test, test_preds)
print(f'Regression Tree Training R2: {train_r2}')
print(f'Regression Tree Testing R2: {test_r2}')


Regression Tree Training R2: 1.0
Regression Tree Testing R2: 0.19177638337083347


# Conclusion
After running the linear regression and regression tree models on the data, we obtained better results from the linear regression. Although not the most ideal R2 results, they were not overfit like the results that the regression tree model produced. This was further reflected on the RMSE results between both models where the training and testing results from the linear regression model were more in-line compared with the regression tree model.