<a href="https://colab.research.google.com/github/amnamalik1993/Sales-Prediction/blob/main/Project_1_Part_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 1 Part 5**

# **Preprocesing for Machine Learning**

For Part 5, we will use the original dataset with the goal of preventing data leakage.

 

* Before splitting the data, drop duplicates and fix inconsistencies in categorical data.
* Identify  target (X) and features (y): We will assign the 'Item_Outlet_Sales' as the target and the rest of the relevant variables as the features matrix.
* Perform a train test split 
* Create a pre processing pipeline to prepare the dataset for Machine Learning
* Make sure your imputation of missing values occurs after the train test split using SimpleImputer. 





**Importing**

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

**ReLoading Data**

In [4]:
from google.colab import drive
drive.mount ('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
filename = '/content/sales_predictions.csv'
df = pd.read_csv('/content/sales_predictions.csv')
df

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
...,...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,,Tier 2,Supermarket Type1,549.2850
8520,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976


In [6]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


**Performing Preprocessing Steps**

In [8]:
# Checking for duplicates
df.duplicated().sum()

0

In [10]:
# Inspecting and addressing inconsistencies in categorical data
data_types = df.dtypes
str_cols = data_types[data_types=='object'].index
str_cols

Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')

In [20]:
df['Outlet_Type'].value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

In [11]:
for col in str_cols:
    print(f'- {col}:')
    print(df[col].value_counts(dropna=False))
    print("\n\n")

- Item_Identifier:
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64



- Item_Fat_Content:
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64



- Item_Type:
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64



- Outlet_Identifier:
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    9

After inspecting, 


*   The column Item_Fat_Content has inconsistencies with the spellings. This needs to be addressed  
*   Reg as Regular
*   low fat and LF as Low Fat

*   The column Outlet_Type has three inconsistencies with the spellings. Supermarket has three types. It should be one supermarket
*   Supermarket Type1 should be Supermarket

*   Supermarket Type2 should be Supermarket
*   Supermarket Type3 should be Supermarket








In [12]:
Item_Fat_Content_Map = {'LF':'Low Fat',
                   'low fat':'Low Fat',
                   'reg':'Regular'}

df['Item_Fat_Content'] = df['Item_Fat_Content'].replace(Item_Fat_Content_Map)

In [13]:
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [21]:
Outlet_Type_Map = {'Supermarket Type1':'Supermarket Type',
                   'Supermarket Type2':'Supermarket Type',
                   'Supermarket Type3':'Supermarket Type'}

df['Outlet_Type'] = df['Outlet_Type'].replace(Outlet_Type_Map)

In [22]:
df['Outlet_Type'].value_counts()

Supermarket Type    7440
Grocery Store       1083
Name: Outlet_Type, dtype: int64

**Ordinal Encoding**

In [24]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [25]:
# Ordinal Encoding 'Outlet_Size'
replacement_dictionary = {'High':2, 'Medium':1, 'Small':0}
df['Outlet_Size'].replace(replacement_dictionary, inplace=True)
df['Outlet_Size']

0       1.0
1       1.0
2       1.0
3       NaN
4       2.0
       ... 
8518    2.0
8519    NaN
8520    0.0
8521    1.0
8522    0.0
Name: Outlet_Size, Length: 8523, dtype: float64

Identify the features (X) and target (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix

In [27]:
# Defining X
X = df.drop('Item_Outlet_Sales', axis=1)

# Defining y
y = df['Item_Outlet_Sales']

**Validation Split**

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

**Instantiate Column Selectors**

In [30]:
# Selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

cat_selector(X_train)

['Item_Identifier',
 'Item_Fat_Content',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Location_Type',
 'Outlet_Type']

In [42]:
num_selector(X_train)

['Item_Weight',
 'Item_Visibility',
 'Item_MRP',
 'Outlet_Establishment_Year',
 'Outlet_Size']

**Instantiate Transformers**

In [31]:
# Imputers
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')

# Scaler
scaler = StandardScaler()

# One-hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

**Instantiate Pipelines**

In [32]:
# Numeric pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [33]:
# Categorical pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

**Instantiate ColumnTransformer**

In [35]:
# Tuples for Column Transformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)

# ColumnTransformer
preprocessor = make_column_transformer(number_tuple, category_tuple, remainder = 'passthrough')
preprocessor

**Transformer Data**

In [37]:
# fit on train
preprocessor.fit(X_train)



In [39]:
# transform train and test
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

X_train_processed

array([[ 0.81724868, -0.71277507,  1.82810922, ...,  1.        ,
         0.        ,  1.        ],
       [ 0.5563395 , -1.29105225,  0.60336888, ...,  1.        ,
         0.        ,  1.        ],
       [-0.13151196,  1.81331864,  0.24454056, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.11373638, -0.92052713,  1.52302674, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.76600931, -0.2277552 , -0.38377708, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.81724868, -0.95867683, -0.73836105, ...,  0.        ,
         0.        ,  1.        ]])

**Inspect the Result**

In [40]:
# Check for missing values and that data is scaled and one-hot encoded
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (6392, 1588)




array([[ 0.81724868, -0.71277507,  1.82810922, ...,  1.        ,
         0.        ,  1.        ],
       [ 0.5563395 , -1.29105225,  0.60336888, ...,  1.        ,
         0.        ,  1.        ],
       [-0.13151196,  1.81331864,  0.24454056, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.11373638, -0.92052713,  1.52302674, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.76600931, -0.2277552 , -0.38377708, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.81724868, -0.95867683, -0.73836105, ...,  0.        ,
         0.        ,  1.        ]])

In [41]:
type(preprocessor)

sklearn.compose._column_transformer.ColumnTransformer