<a href="https://colab.research.google.com/github/carlolopez03/Prediction-of-Product-Sales/blob/main/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Prediction of Sales**
##Carlo Lopez

##**Load Data**

In [194]:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn import set_config
set_config(transform_output='pandas')

In [195]:
#Loading data
file = '/content/drive/MyDrive/CodingDojo/02-IntroML/Week05/Data/sales_predictions_2023.csv'
df = pd.read_csv(file)
df.head()


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [196]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


##**Data Cleaning**

In [197]:
df.shape

(8523, 12)

In [198]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [199]:
df.duplicated().sum()

0

In [200]:
str_cols = df.select_dtypes(object).columns
for col in str_cols:
  print(f'Value Counts for {col}')
  print(df[col].value_counts())
  print('\n')

Value Counts for Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64


Value Counts for Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64


Value Counts for Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64


Value Counts for Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930


In [201]:
df = df.drop(columns=['Item_Identifier', 'Outlet_Identifier', 'Outlet_Establishment_Year'])

In [202]:
df['Item_Fat_Content'].replace({'LF':'Low Fat', 'low fat':'Low Fat', 'reg':'Regular'}, inplace = True)
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

##**Machine Learning**

In [203]:
#Defining the features and target
y = df['Item_Outlet_Sales']
X = df.drop(columns = 'Item_Outlet_Sales')

#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type
4776,16.35,Low Fat,0.029565,Household,256.4646,Medium,Tier 3,Supermarket Type2
7510,15.25,Regular,0.0,Snack Foods,179.766,Medium,Tier 3,Supermarket Type2
5828,12.35,Regular,0.158716,Meat,157.2946,Medium,Tier 1,Supermarket Type1
5327,7.975,Low Fat,0.014628,Baking Goods,82.325,Small,Tier 2,Supermarket Type1
4810,19.35,Low Fat,0.016645,Frozen Foods,120.9098,,Tier 2,Supermarket Type1


In [204]:
#Defining numeric features
num_cols = X_train.select_dtypes('number').columns
num_cols

Index(['Item_Weight', 'Item_Visibility', 'Item_MRP'], dtype='object')

In [205]:
#Processors
impute_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

#Pipeline for numeric feature
num_pipe = make_pipeline(impute_mean, scaler)
num_pipe

In [206]:
#Defining ordinal features
ordinal_cols = ['Outlet_Location_Type', 'Outlet_Size']
loc_type_list = ['Tier 1', 'Tier 2', 'Tier 3']
size_list = ['Small', 'Medium', 'High']

#Processors
ord = OrdinalEncoder(categories=[loc_type_list, size_list])
freq_imputer = SimpleImputer(strategy='most_frequent', fill_value='Missing')

#Pipeline for ordinal features
ord_pipeline = make_pipeline(freq_imputer, ord)
ord_pipeline

In [207]:
#Defining nominal features
nominal_cols = X_train.select_dtypes('object').drop(columns=ordinal_cols).columns

#Processors
missing_imputer = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

#Pipeline for nominal features
nom_pipeline = make_pipeline(missing_imputer, ohe)
nom_pipeline

In [208]:
#Defining tuples
numeric_tuple = ('numeric', num_pipe, num_cols)
ohe_tuple = ('categorical', nom_pipeline, nominal_cols)
ord_tuple = ('ordinal', ord_pipeline, ordinal_cols)

#Making column transformer
col_transformer = ColumnTransformer([numeric_tuple,ord_tuple, ohe_tuple], verbose_feature_names_out=False)

#Fitting transformer
col_transformer.fit(X_train)

In [209]:
X_train_processed = col_transformer.transform(X_train)
X_train_processed.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Location_Type,Outlet_Size,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,...,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
4776,0.817249,-0.712775,1.828109,2.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7510,0.55634,-1.291052,0.603369,2.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
5828,-0.131512,1.813319,0.244541,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5327,-1.169219,-1.004931,-0.952591,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4810,1.528819,-0.965484,-0.33646,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [210]:
X_test_processed = col_transformer.transform(X_test)
X_test_processed.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Location_Type,Outlet_Size,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,...,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
7503,0.3310089,-0.776646,-0.998816,2.0,2.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2957,-1.179892,0.100317,-1.585194,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
7031,0.3784469,-0.482994,-1.595784,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1084,4.213344e-16,-0.41544,0.506592,2.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
856,-0.6426567,-1.047426,0.886725,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [211]:
def regression_metrics(y_true, y_pred, label='', verbose = True, output_dict=False):
  #Metrics
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true, y_pred, squared=False)
  r_squared = r2_score(y_true, y_pred)
  if verbose == True:

    header = "-"*60
    print(header, f"Regression Metrics: {label}", header, sep='\n')
    print(f"- MAE = {mae:,.3f}")
    print(f"- MSE = {mse:,.3f}")
    print(f"- RMSE = {rmse:,.3f}")
    print(f"- R^2 = {r_squared:,.3f}")
  if output_dict == True:
      metrics = {'Label':label, 'MAE':mae,
                 'MSE':mse, 'RMSE':rmse, 'R^2':r_squared}
      return metrics

def evaluate_regression(reg, X_train, y_train, X_test, y_test, verbose = True,
                        output_frame=False):
  #Predictions for training data
  y_train_pred = reg.predict(X_train)

  # Calling helper function to obtain regression metrics
  results_train = regression_metrics(y_train, y_train_pred, verbose = verbose,
                                     output_dict=output_frame,
                                     label='Training Data')
  print()
  #Predictions for test data
  y_test_pred = reg.predict(X_test)
  # Calling helper function to obtain regression metrics for test data
  results_test = regression_metrics(y_test, y_test_pred, verbose = verbose,
                                  output_dict=output_frame,
                                    label='Test Data' )

  if output_frame:
    results_df = pd.DataFrame([results_train,results_test])
    results_df = results_df.set_index('Label')
    results_df.index.name=None
    # Returning dataframe
    return results_df.round(3)

In [215]:
X_train_tf = col_transformer.transform(X_train)
X_test_tf = col_transformer.transform(X_test)
X_train_tf.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Location_Type,Outlet_Size,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,...,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
4776,0.817249,-0.712775,1.828109,2.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7510,0.55634,-1.291052,0.603369,2.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
5828,-0.131512,1.813319,0.244541,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5327,-1.169219,-1.004931,-0.952591,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4810,1.528819,-0.965484,-0.33646,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [219]:
lin_reg = LinearRegression()

In [222]:
lin_reg.fit(X_train_tf, y_train)


In [225]:
y_predictions_train = lin_reg.predict(X_train_tf)
y_predictions_test = lin_reg.predict(X_test_tf)

# Saving a copy of X_test_tf and adding the true and predicted price and the error
prediction_df = X_test_tf.copy()
prediction_df['True Price'] = y_test
prediction_df['Predicted Price'] = y_predictions_test.round(1)
prediction_df['Error'] = (y_predictions_test - y_test).round(1)
prediction_df.head(10)


Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Location_Type,Outlet_Size,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,True Price,Predicted Price,Error
7503,0.3310089,-0.776646,-0.998816,2.0,2.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1743.0644,1341.4,-401.7
2957,-1.179892,0.100317,-1.585194,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,356.8688,781.3,424.5
7031,0.3784469,-0.482994,-1.595784,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,377.5086,822.7,445.2
1084,4.213344e-16,-0.41544,0.506592,2.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,5778.4782,4233.8,-1544.7
856,-0.6426567,-1.047426,0.886725,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2356.932,3276.0,919.0
4304,-0.8075039,-0.470511,-1.748367,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,865.54,550.0,-315.6
2132,4.213344e-16,1.189692,1.070615,2.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4613.994,4758.7,144.7
1385,-0.5703138,-1.025995,0.000559,2.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,2410.8618,2066.6,-344.3
5239,0.2598518,-0.824923,-0.620321,2.0,1.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1948.1308,1402.7,-545.5
6516,-1.042322,-0.974654,0.801084,2.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1937.478,2817.8,880.3


This model is underfitting because there is a lot of errors.

In [217]:
rf = RandomForestRegressor(random_state=42)
rf_pipe = make_pipeline(rf)
rf_pipe.fit(X_train_tf, y_train)

In [226]:
evaluate_regression(rf_pipe, X_train_tf, y_train, X_test_tf, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 301.168
- MSE = 187,648.542
- RMSE = 433.184
- R^2 = 0.937

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 780.360
- MSE = 1,266,770.237
- RMSE = 1,125.509
- R^2 = 0.541


In [227]:
rf_pipe.get_params()

{'memory': None,
 'steps': [('randomforestregressor', RandomForestRegressor(random_state=42))],
 'verbose': False,
 'randomforestregressor': RandomForestRegressor(random_state=42),
 'randomforestregressor__bootstrap': True,
 'randomforestregressor__ccp_alpha': 0.0,
 'randomforestregressor__criterion': 'squared_error',
 'randomforestregressor__max_depth': None,
 'randomforestregressor__max_features': 1.0,
 'randomforestregressor__max_leaf_nodes': None,
 'randomforestregressor__max_samples': None,
 'randomforestregressor__min_impurity_decrease': 0.0,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__min_samples_split': 2,
 'randomforestregressor__min_weight_fraction_leaf': 0.0,
 'randomforestregressor__n_estimators': 100,
 'randomforestregressor__n_jobs': None,
 'randomforestregressor__oob_score': False,
 'randomforestregressor__random_state': 42,
 'randomforestregressor__verbose': 0,
 'randomforestregressor__warm_start': False}

In [228]:
params = {'randomforestregressor__max_depth': [None,10,15,20],
          'randomforestregressor__n_estimators':[10,100,150,200],
          'randomforestregressor__min_samples_leaf':[2,3,4],
          'randomforestregressor__max_features':['sqrt','log2',None],
          'randomforestregressor__oob_score':[True,False],}
gridsearch = GridSearchCV(rf_pipe, params, n_jobs=-1, cv=2,verbose=1)
gridsearch.fit(X_train_tf, y_train)

Fitting 2 folds for each of 288 candidates, totalling 576 fits


In [229]:
gridsearch.best_params_

{'randomforestregressor__max_depth': 15,
 'randomforestregressor__max_features': 'sqrt',
 'randomforestregressor__min_samples_leaf': 3,
 'randomforestregressor__n_estimators': 150,
 'randomforestregressor__oob_score': True}

In [232]:
best_rf = gridsearch.best_estimator_
evaluate_regression(best_rf, X_train_tf, y_train, X_test_tf, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 656.847
- MSE = 859,163.267
- RMSE = 926.911
- R^2 = 0.710

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 751.397
- MSE = 1,131,005.653
- RMSE = 1,063.487
- R^2 = 0.590


This model is also underlifting because the train and test data are not accurate

I believe the random forests model has better test scores

In [236]:
model = DecisionTreeRegressor(random_state = 42)
model.fit(X_train_tf, y_train)
evaluate_regression(model, X_train_tf, y_train, X_test_tf, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 0.000
- MSE = 0.000
- RMSE = 0.000
- R^2 = 1.000

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 1,061.021
- MSE = 2,373,918.346
- RMSE = 1,540.753
- R^2 = 0.140


In [237]:
model.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 42,
 'splitter': 'best'}

In [241]:
param_grid = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None], 'min_samples_split': [2, 3, 4]}

In [243]:
grid_search = GridSearchCV(model, param_grid, n_jobs = -1, verbose= 1)
grid_search.fit(X_train_tf, y_train)

Fitting 5 folds for each of 33 candidates, totalling 165 fits


In [244]:
grid_search.best_params_

{'max_depth': 5, 'min_samples_split': 2}

In [246]:
best_model = grid_search.best_estimator_
evaluate_regression(best_model, X_train_tf, y_train, X_test_tf, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 762.610
- MSE = 1,172,122.773
- RMSE = 1,082.646
- R^2 = 0.604

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 738.317
- MSE = 1,118,185.973
- RMSE = 1,057.443
- R^2 = 0.595


I recommend the random forests model because it gave us the best results for the testing data(.595).

R2 in our model shows us that it is able to predict 59% of the data

The MAE in both models are more close to each other than the previous models. this tells us the predictions are more accurate

This model is overlift because it seems to recognize a pattern in the data