# Bigmart Sales Data predictor


In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

In [34]:
trainData = pd.read_csv("bigmart-sales-data/Train.csv")
testData = pd.read_csv("bigmart-sales-data/Test.csv")


## Having a look at the data and the features...

Some of the features may be redundant or might not actually contribute significantly to the sales. To find out more, we look more at the features, their possible values to know more about the data.


In [35]:
print(trainData.head())
print(trainData.shape)

#print(testData.head())
print(testData.shape)

  Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0           FDA15         9.30          Low Fat         0.016047   
1           DRC01         5.92          Regular         0.019278   
2           FDN15        17.50          Low Fat         0.016760   
3           FDX07        19.20          Regular         0.000000   
4           NCD19         8.93          Low Fat         0.000000   

               Item_Type  Item_MRP Outlet_Identifier  \
0                  Dairy  249.8092            OUT049   
1            Soft Drinks   48.2692            OUT018   
2                   Meat  141.6180            OUT049   
3  Fruits and Vegetables  182.0950            OUT010   
4              Household   53.8614            OUT013   

   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type  \
0                       1999      Medium               Tier 1   
1                       2009      Medium               Tier 3   
2                       1999      Medium               Tier

In [36]:
labels = trainData.columns.to_list()
print(labels)


['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'Item_MRP', 'Outlet_Identifier', 'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']


The identifier of an item might not affect the sales. So, it might be safe to drop these columns

In [42]:
badCols = ['Item_Identifier','Outlet_identifier']


df = trainData.drop(trainData.columns[[0,6,4]],axis=1)
df.head(20)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,249.8092,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,48.2692,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,17.5,Low Fat,0.01676,141.618,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,19.2,Regular,0.0,182.095,1998,,Tier 3,Grocery Store,732.38
4,8.93,Low Fat,0.0,53.8614,1987,High,Tier 3,Supermarket Type1,994.7052
5,10.395,Regular,0.0,51.4008,2009,Medium,Tier 3,Supermarket Type2,556.6088
6,13.65,Regular,0.012741,57.6588,1987,High,Tier 3,Supermarket Type1,343.5528
7,,Low Fat,0.12747,107.7622,1985,Medium,Tier 3,Supermarket Type3,4022.7636
8,16.2,Regular,0.016687,96.9726,2002,,Tier 2,Supermarket Type1,1076.5986
9,19.2,Regular,0.09445,187.8214,2007,,Tier 2,Supermarket Type1,4710.535


## Continuous and discrete features...

The features can be divided into two sets,

Continuous : {item wght,item visibility,item MRP,est year}

Discrete: {item fat cont,item type,outlet size,outlet location,outlet type}



In [48]:
discCols = ['Item_Fat_Content','Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
contCols = ['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']

#Lets look at the discrete features...
for column in discCols:
    print("\t"+column)
    print(df[column].unique())
    print("\n\n")

    
#Lets look at the continuous features...
print("\t\t\tMAX \t\t MIN")
for column in contCols:
    print(column +"\t"+ str(df[column].max())+"\t"+str(df[column].min()))

	Item_Fat_Content
['Low Fat' 'Regular' 'low fat' 'LF' 'reg']



	Outlet_Size
['Medium' nan 'High' 'Small']



	Outlet_Location_Type
['Tier 1' 'Tier 3' 'Tier 2']



	Outlet_Type
['Supermarket Type1' 'Supermarket Type2' 'Grocery Store'
 'Supermarket Type3']



			MAX 		 MIN
Item_Weight	21.35	4.555
Item_Visibility	0.328390948	0.0
Item_MRP	266.8884	31.29
Outlet_Establishment_Year	2009	1985


Out of all discrete features, only outlet size contains "NaN" or incomplete data. Item fat content unique values looks sketchy as "Low Fat", "low fat" and "LF" might mean the same thing just like "reg" and "Regular".

Another interesting point is that visibility is already normalised(in between 0 to 1)
There is the establishment year which although is discrete, can be used in regression (maybe) in form of time-series


Speaking of which, continuous features might have NaNs too, so lets find out which features and how many of them have NaNs in them.

In [50]:
df.isnull().sum(axis = 0)

Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_MRP                        0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

The dataset is of 8k examples and we have 2k NaNs, need a way to fill the NaNs as dropping is not that great solution.