<a href="https://colab.research.google.com/github/gitAlhajji/sales-prediction/blob/main/Preprocessing_for_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#Mounting storage to allow for read and write operations 
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#Importing the Libraries to use for data manipulation, analysis and visualisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn import set_config
set_config(display='diagram')

In [19]:
#Loading te dataset from the provided data
dataset = '/content/sales_predictions.csv'

#Creating a pandas dataframe from the dataset
df = pd.read_csv(dataset)
#Previewing the dataframe
df.head(3)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27


##Data Exploration
After loading  the data, it is important that to explore tha data to further understand it and get some rough insights before staring to manipulate it for machine learning.
This exploration involves checking the number of rows and columns in the dataframe, the datatypes of the data in the columns & the statisticla summaries of the data

In [4]:
#checking the number of rows and columns in the dataframe
df.shape

(8523, 12)

In [5]:
#checking the columns in the dataframe and the datatypes of their respective contents
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [6]:
#checking the statistical summaries of the data
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


##Data cleaning
Once the data is better understood, next is to check for duplicates and dealing with them, the categorical columns for inconsistencies. Once the inconsistencies are identified, the next step is to align them to the rest of the data in the same column.

In [8]:
#checking for duplicated records
df.duplicated().sum()

0

In [9]:
#Checking for the missing values
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

The output above shows that the 'Item_Weight' and 'Outlet_Size' have missing values. At this point, we leave these unresolved for later imputation using sklearn methods of puting missing values.
Proceeding on to check for inconsistencies in the categorical data

In [20]:
#Checking for inconsistencies in the Item_Fat_Content column
df['Item_Fat_Content'].value_counts()


Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

From the output, the data in the 'Item_Fat_Content' column are ordinal features - the data follows a certain order.
At this point, the data can be ordinal encoded to fix the inconsistencies and also to turn it into the numeric equivalents using a dictionary.

In [21]:
item_fat_dictionary = {'Low Fat':0,'Regular':1,'LF':0,'reg':1, 'low fat':0}
df['Item_Fat_Content'].replace(item_fat_dictionary,inplace = True)

In [22]:
#Checking for inconsistencies in the Item_Type column
df['Item_Type'].value_counts()

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

The output above shows that the data in the Item_Type column doesnot follw a specific order. This makes it norminal. This data will be hot-encoded later on during the proprocessing of the data.

In [23]:
 #Checking for inconsistencies in the Outlet_Identifier column
df['Outlet_Identifier'].value_counts()

OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    555
OUT019    528
Name: Outlet_Identifier, dtype: int64

The 'Outlet_Identifier' data also doesnt follow a given order. This will also be hot encoded during preprocessing

In [26]:
 #Checking for inconsistencies in the Outlet_Size column
df['Outlet_Size'].value_counts()

1.0    2793
0.0    2388
2.0     932
Name: Outlet_Size, dtype: int64

In [25]:
#Ordinal Encoding the data in the Outlet_Size column
outlet_size_dictionary = {'Small':0,'Medium':1,'High':2}
df['Outlet_Size'].replace(outlet_size_dictionary,inplace = True)

In [27]:
#Checking for inconsistencies in the Outlet_Location_Type column
df['Outlet_Location_Type'].value_counts()

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

In [28]:
#Ordinal Encoding the data in the Outlet_Location_Type column
outlet_loc_dictionary = {'Tier 1':0,'Tier 2':1,'Tier 3':2}
df['Outlet_Location_Type'].replace(outlet_size_dictionary,inplace = True)

In [29]:
#Checking for inconsistencies in the Outlet_Type column
df['Outlet_Type'].value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

In [30]:
#Ordinal Encoding the data in the Outlet_Type column
outlet_loc_dictionary = {'Grocery Store':0,'Supermarket Type1':1,'Supermarket Type2':2, 'Supermarket Type3':3}
df['Outlet_Type'].replace(outlet_size_dictionary,inplace = True)

Once all inconsistencies are dealt with, next is to define our prediction target as ascalar,y and the matrix, X of features that will determine its prediction.
These are used splitting the data into training and testing sets that will be used in model validation.

In [32]:
#defing matrix X and scalar y
X = df.drop(columns = 'Item_Outlet_Sales')
y = df['Item_Outlet_Sales']

#Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

##Data preprocessing
The next step is to prepare the data for machine learning through preprocessing. 
Operations at this level include;
1. Imputation for missing values
2. Norminal encoding of norminal features using OneHotEncoder
3. Scaling of nuerical data

To achieve this, pipelines are used inconjuction with column transformers

In [43]:
#Instantiating the column selectors
number_selector = make_column_selector(dtype_include = 'number')
category_selector = make_column_selector(dtype_include = 'object')

In [45]:
#instantiating the imputers
mean_imputer = SimpleImputer(strategy = 'mean')
constant_imputer  =SimpleImputer(strategy = 'constant',fill_value='missing')

In [46]:
#instantiating the OneHotEncoder
ohot_encoder = OneHotEncoder(handle_unknown = 'ignore',sparse = False)

In [47]:
#instantiating the Standard scaler
scaler = StandardScaler()

In [48]:
#instantiate  pipeline for numeric feature processing
number_pipeline = make_pipeline(mean_imputer,scaler)
number_pipeline

In [49]:
#instantiate a pipeline for categorical data
category_pipeline = make_pipeline(constant_imputer,ohot_encoder)
category_pipeline

In [50]:
#creating turples for the numeric and categoric data
number_turple = (number_pipeline,number_selector)
category_turple = (category_pipeline,category_selector)

In [51]:
#instantiating the column transformer
optimus = make_column_transformer(number_turple,category_turple)
optimus

Once the Pipelines for both the numerical and categorical data have been created and passed into a column transformer, next is to fit it the transformer onto the training data. If the transformer is fit on the testing set, this will result into data leakage which inturn will bias the prediction.

In [52]:
#fitting the transformer on the TRAINING data
optimus.fit(X_train)

In [53]:
#Transforming the train and test sets
X_train_processed = optimus.transform(X_train)
X_test_processed = optimus.transform(X_test)

In [54]:
# Check for missing values and that data is scaled and one-hot encoded
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (6392, 1589)




array([[ 0.81724868, -0.7403206 , -0.71277507, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.5563395 ,  1.35076614, -1.29105225, ...,  0.        ,
         1.        ,  0.        ],
       [-0.13151196,  1.35076614,  1.81331864, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.11373638, -0.7403206 , -0.92052713, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.76600931, -0.7403206 , -0.2277552 , ...,  1.        ,
         0.        ,  0.        ],
       [ 0.81724868, -0.7403206 , -0.95867683, ...,  1.        ,
         0.        ,  0.        ]])