# Predictive Modelling on Big Mart Sales Data 

#### This piece of code includes a full machine learning project including the whole pipeline of operations outlined below:

1. data preparation
2. choosing a model
3. training a model
4. evaluating the model
5. parameter tuning
6. make predictions



A machine learning pipeline is a way to codify and automate the workflow it takes to produce a machine learning model. The machine learning pipelines consist of multiple sequential steps that do everything from data extraction and pre-processing to model training and evaluation. 

The packages used in this programming are: Pandas, Numpy, Scikit-Learn

The *'Big Mart Predictive Sales Report'* accompanies this code and demonstrates understanding of **ethical design**.

This research article was used to support intial understanding of machine learning models. 
[Link](https://www.researchgate.net/profile/Vladimir-Nasteski/publication/328146111_An_overview_of_the_supervised_machine_learning_methods/links/5c1025194585157ac1bba147/An-overview-of-the-supervised-machine-learning-methods.pdf)

In [1]:
# Import necessary libraries 

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline 

### Importing Data from Excel

In [2]:
# test_df = test data frame

test_df = pd.read_csv(r'C:\Users\karol\Documents\Keele University\Year 2\Semester 2\Software Development\Data\Test.csv')

# train_df = train data frame

train_df = pd.read_csv(r'C:\Users\karol\Documents\Keele University\Year 2\Semester 2\Software Development\Data\Train.csv')

In [3]:
# checking they have imported correctly

test_df.head()

train_df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
# merging data frames

test_train = [test_df, train_df]

df = pd.concat(test_train)

### Data Cleaning

In [5]:
# Further checks on data imported 

first_5test = test_df.head()
first_5train = train_df.head()

test_df.info()
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5681 entries, 0 to 5680
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            5681 non-null   object 
 1   Item_Weight                4705 non-null   float64
 2   Item_Fat_Content           5681 non-null   object 
 3   Item_Visibility            5681 non-null   float64
 4   Item_Type                  5681 non-null   object 
 5   Item_MRP                   5681 non-null   float64
 6   Outlet_Identifier          5681 non-null   object 
 7   Outlet_Establishment_Year  5681 non-null   int64  
 8   Outlet_Size                4075 non-null   object 
 9   Outlet_Location_Type       5681 non-null   object 
 10  Outlet_Type                5681 non-null   object 
dtypes: float64(3), int64(1), object(7)
memory usage: 488.3+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 colu

In [6]:
# Print all data 

print(test_df)
print(train_df)

     Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0              FDW58       20.750          Low Fat         0.007565   
1              FDW14        8.300              reg         0.038428   
2              NCN55       14.600          Low Fat         0.099575   
3              FDQ58        7.315          Low Fat         0.015388   
4              FDY38          NaN          Regular         0.118599   
...              ...          ...              ...              ...   
5676           FDB58       10.500          Regular         0.013496   
5677           FDD47        7.600          Regular         0.142991   
5678           NCO17       10.000          Low Fat         0.073529   
5679           FDJ26       15.300          Regular         0.000000   
5680           FDU37        9.500          Regular         0.104720   

               Item_Type  Item_MRP Outlet_Identifier  \
0            Snack Foods  107.8622            OUT049   
1                  Dairy   87.3198 

## Data Cleaning Plan

- Keeping all columns of data, do not need any data column drops - df.drop()
- Standardise/ clean the content within the columns
- Check value counts to see what needs to be adjusted - value.counts()
- Replace all missing data points with NaN

Example of Changes - 
- Item_Weight - empty points to N/A
- Item_Fat_Content - LF, Low Fat, low fat to Low; Reg to Regular

In [7]:
# Cleaning column names 

def clean_col(col):
    col = col.strip()
    col = col.replace("Item_Identifier", "Item_Id")
    col = col.replace("Outlet_Identifier", "Outlet_Id")
    col = col.replace(" ","_")
    col = col.title()
    return col

new_columns = []
for c in train_df.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)
    
train_df.columns = new_columns

print(train_df)

     Item_Id  Item_Weight Item_Fat_Content  Item_Visibility  \
0      FDA15        9.300          Low Fat         0.016047   
1      DRC01        5.920          Regular         0.019278   
2      FDN15       17.500          Low Fat         0.016760   
3      FDX07       19.200          Regular         0.000000   
4      NCD19        8.930          Low Fat         0.000000   
...      ...          ...              ...              ...   
8518   FDF22        6.865          Low Fat         0.056783   
8519   FDS36        8.380          Regular         0.046982   
8520   NCJ29       10.600          Low Fat         0.035186   
8521   FDN46        7.210          Regular         0.145221   
8522   DRG01       14.800          Low Fat         0.044878   

                  Item_Type  Item_Mrp Outlet_Id  Outlet_Establishment_Year  \
0                     Dairy  249.8092    OUT049                       1999   
1               Soft Drinks   48.2692    OUT018                       2009   
2        

In [8]:
print(train_df["Item_Weight"].value_counts())

12.150    86
17.600    82
13.650    77
11.800    76
15.100    68
          ..
7.560      2
9.420      1
5.400      1
6.520      1
7.685      1
Name: Item_Weight, Length: 415, dtype: int64


In [9]:
print(train_df["Item_Fat_Content"].value_counts())

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64


In [10]:
# Cleaning up irregularities in the column data

train_df['Item_Fat_Content'] = train_df['Item_Fat_Content'].replace(['Low Fat', 'low fat', 'LF', 'reg'], ['Low', 'Low', 'Low', 'Regular'])
print(train_df["Item_Fat_Content"].value_counts())

Low        5517
Regular    3006
Name: Item_Fat_Content, dtype: int64


In [11]:
print(train_df["Item_Visibility"].value_counts())

0.000000    526
0.076975      3
0.041283      2
0.085622      2
0.187841      2
           ... 
0.092576      1
0.067544      1
0.115168      1
0.146896      1
0.050902      1
Name: Item_Visibility, Length: 7880, dtype: int64


In [12]:
print(train_df["Item_Type"].value_counts())

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64


In [13]:
train_df['Item_Type'] = train_df['Item_Type'].replace(['Hard Drinks', 'Others'],['Alcohol', 'Other'])
print(train_df["Item_Type"].value_counts())

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Alcohol                   214
Other                     169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64


In [14]:
print(train_df["Item_Mrp"].value_counts())

172.0422    7
188.1872    6
170.5422    6
109.5228    6
196.5084    6
           ..
212.8218    1
190.3872    1
162.6868    1
189.1214    1
51.3008     1
Name: Item_Mrp, Length: 5938, dtype: int64


In [15]:
print(train_df["Outlet_Id"].value_counts())

OUT027    935
OUT013    932
OUT049    930
OUT035    930
OUT046    930
OUT045    929
OUT018    928
OUT017    926
OUT010    555
OUT019    528
Name: Outlet_Id, dtype: int64


In [16]:
print(train_df["Outlet_Establishment_Year"].value_counts())

1985    1463
1987     932
1999     930
1997     930
2004     930
2002     929
2009     928
2007     926
1998     555
Name: Outlet_Establishment_Year, dtype: int64


In [17]:
print(train_df["Outlet_Size"].value_counts())

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64


In [18]:
print(train_df["Outlet_Location_Type"].value_counts())

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64


In [19]:
print(train_df["Outlet_Type"].value_counts())

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64


In [20]:
print(train_df["Item_Outlet_Sales"].value_counts())

958.7520     17
1342.2528    16
1845.5976    15
703.0848     15
1278.3360    14
             ..
3167.8764     1
2226.4352     1
1684.4740     1
1574.6170     1
6692.6216     1
Name: Item_Outlet_Sales, Length: 3493, dtype: int64


In [21]:
# Replace Blank values with DataFrame.replace() methods.

train_df = train_df.replace(r'^\s*$', np.nan, regex=True)

In [22]:
print(train_df)

     Item_Id  Item_Weight Item_Fat_Content  Item_Visibility  \
0      FDA15        9.300              Low         0.016047   
1      DRC01        5.920          Regular         0.019278   
2      FDN15       17.500              Low         0.016760   
3      FDX07       19.200          Regular         0.000000   
4      NCD19        8.930              Low         0.000000   
...      ...          ...              ...              ...   
8518   FDF22        6.865              Low         0.056783   
8519   FDS36        8.380          Regular         0.046982   
8520   NCJ29       10.600              Low         0.035186   
8521   FDN46        7.210          Regular         0.145221   
8522   DRG01       14.800              Low         0.044878   

                  Item_Type  Item_Mrp Outlet_Id  Outlet_Establishment_Year  \
0                     Dairy  249.8092    OUT049                       1999   
1               Soft Drinks   48.2692    OUT018                       2009   
2        

In [None]:
# clean sales to remove decimals

## Data Analysis/ Cleaned Data onwards