<a href="https://colab.research.google.com/github/alaazagha/Prediction-of-Product-Sales/blob/main/SalesToML_Alaa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - Part 5 (Core)
  - Author: Alaa Zagha
## Preparing Sales prediction dataset for Machine Learning
  We will continue to work on your sales prediction project. The goal of this step is to help the retailer by using machine learning to make predictions about future sales based on the data provided.

For Part 5, you will go back to your original, uncleaned, sales prediction dataset with the goal of preventing data leakage.

You should load a fresh version of the original data set here using pd.read_csv() and start your cleaning process over to ensure there is no data leakage!

 - Before splitting your data, you can drop duplicates and fix inconsistencies in categorical data.* (*There is a way to do this after the split, but for this project, you may perform this step before the split)
 - Identify the features (X) and target (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.
 - Hint: We recommend you drop the "Item_Identifier" feature because it has very high cardinality.
 - Perform a train test split
 - Create a preprocessing object to prepare the dataset for Machine Learning
 - Make sure your imputation of missing values occurs after the train test split using SimpleImputer.

## Importing Libraries

In [32]:
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
set_config(transform_output ='pandas')

## Loading Data

In [33]:
f_path = "/content/drive/MyDrive/CodingDojo/01-Fundamentals/sales_predictions_2023.csv"
df = pd.read_csv(f_path)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


so we have 8523 category and 12 feature, 5 are numerical and 7 are categorical

## Duplicates Inspection

In [35]:
df_dup = df.duplicated().sum()
df_dup

0

no duplicates

## Incosistency Inspection

In [36]:
cat_cols = df.select_dtypes('object').columns
num_cols = df.select_dtypes('number').columns

In [37]:
for col in cat_cols:
  print(f"Value Counts for {col}")
  print(df[col].value_counts())
  # Increasing readability by adding an empty line
  print('\n')

Value Counts for Item_Identifier
Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: count, Length: 1559, dtype: int64


Value Counts for Item_Fat_Content
Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: count, dtype: int64


Value Counts for Item_Type
Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: count, dtype: int64


Value Counts for Outlet_Identifier
Outlet_Identifier
OUT027    935
OUT013

there's inconsistency in Item_Fat_content Low Fat and Regular should be used only

In [38]:
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF':"Low Fat",'low fat':'Low Fat'})
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'reg':"Regular"})

## Train/Test Split Data

In [39]:
# droping Item_Identifier because it has a high cardinality
df = df.drop(columns= ['Item_Identifier'])
y = df['Item_Outlet_Sales']
X = df.drop(columns = ['Item_Outlet_Sales'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

## Numeric Pipeline

In [40]:
X_num_cols = num_cols.drop(['Item_Outlet_Sales'])

In [41]:
impute_num = SimpleImputer(strategy = 'mean')
num_scaler = StandardScaler()
num_pipe = make_pipeline(impute_num, num_scaler)

In [42]:
num_tuple = ('Numeric', num_pipe, X_num_cols)

## Ordinal Pipeline

In [54]:
ord_cols = ['Outlet_Size','Outlet_Location_Type']
size_order = ['Small','Medium','High']
loc_order = ['Tier 1','Tier 2','Tier 3']
ordinal_category_orders = [size_order, loc_order]
impute_ord = SimpleImputer(strategy = 'most_frequent')
ord_encoder = OrdinalEncoder(categories = ordinal_category_orders)
ord_scaler = StandardScaler()
ord_pipe = make_pipeline(impute_ord, ord_encoder, ord_scaler)

In [55]:
ord_tuple = ('Ordinal', ord_pipe, ord_cols)

## Categorical Pipeline

In [59]:
X_cat_cols = cat_cols.drop(['Outlet_Size','Outlet_Location_Type','Item_Identifier'])

In [60]:
impute_na = SimpleImputer(strategy = 'most_frequent')
ohe_encoder = OneHotEncoder(sparse_output = False, handle_unknown= 'ignore')
ohe_pipe = make_pipeline(impute_na, ohe_encoder)

In [61]:
ohe_tuple = ('Categorical', ohe_pipe, X_cat_cols)

## Column Transformer

In [62]:
col_trans = ColumnTransformer([num_tuple, ord_tuple, ohe_tuple], verbose_feature_names_out = False)

In [63]:
col_trans.fit(X_train)

In [64]:
X_train_processed = col_trans.transform(X_train)
X_test_processed = col_trans.transform(X_test)
display(X_train_processed.describe().round(2), X_test_processed.describe().round(2))

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,...,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
count,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,...,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0
mean,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.65,0.35,0.07,0.03,...,0.06,0.11,0.11,0.11,0.11,0.11,0.12,0.65,0.11,0.11
std,1.0,1.0,1.0,1.0,1.0,1.0,0.48,0.48,0.26,0.16,...,0.24,0.32,0.31,0.31,0.31,0.31,0.33,0.48,0.31,0.32
min,-1.98,-1.29,-1.77,-1.53,-1.38,-1.38,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.81,-0.76,-0.76,-1.29,-1.38,-1.38,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,-0.23,0.03,0.14,0.29,-0.15,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.76,0.56,0.72,0.73,0.29,1.08,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,2.0,5.13,1.99,1.33,1.96,1.08,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,...,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
count,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,...,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0
mean,-0.04,0.01,-0.06,-0.01,0.01,-0.04,0.65,0.35,0.08,0.04,...,0.07,0.1,0.1,0.11,0.11,0.12,0.13,0.66,0.11,0.1
std,1.01,1.04,0.98,0.99,1.01,1.01,0.48,0.48,0.27,0.19,...,0.25,0.3,0.3,0.31,0.31,0.32,0.34,0.47,0.31,0.3
min,-1.97,-1.29,-1.75,-1.53,-1.38,-1.38,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.89,-0.76,-0.78,-1.29,-1.38,-1.38,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,-0.24,-0.15,0.14,0.29,-0.15,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.73,0.56,0.64,0.73,0.29,1.08,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,2.0,4.79,1.99,1.33,1.96,1.08,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
