### Problem Statement - Bigmart Sales Data Set
shopping-cart-1269174_960_720Retail is another industry which extensively uses analytics to optimize business processes. Tasks like product placement, inventory management, customized offers, product bundling, etc. are being smartly handled using data science techniques. As the name suggests, this data comprises of transaction records of a sales store. This is a regression problem. The data has 8523 rows of 12 variables.


Problem: Predict the sales of a store.

Start: Get Data - https://www.kaggle.com/devashish0507/big-mart-sales-prediction#Train.csv

Tutorial: Get Here - https://www.analyticsvidhya.com/blog/2016/02/bigmart-sales-solution-top-20/

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

## 1) The Hypotheses

Based on the domain knowledge, we can propose following hypotheses:

### Store Level Hypotheses

1) City Type - 
    Stores located in urban or Tier1 cities should have higher sales because of the higher income levels of people there.
    
2) Population Density
    Stores located in densely populated areas should have higher sales because of more demand.
    
3) Store Capacity - 
    Stores which are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place
    
4) Competitors - 
    Stores having similar establishments near by should have less sales becasue of more competetion
    
5) Marketing - 
    Stores which have good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising
    
6) Location - 
    Stores located within popular market places should have higher sales because of better access to customers
    
7) Customer Behavior - 
    Stores keeping the right set of products to meet the local needs of customers will have higher sales
    
8) Ambience - 
    Stores which are well maintained and managed by polite and humble people are expected to have higher sales

### Product Level Hypotheses

1) Brand: Branded products should have higher sales because of higher trust in the customer

2) Packaging: Products with good packaging can attract customers and sell more

3) Utility: Daily use products should have a higher tendency to sell as compared to the specific use products

4) Display Area: Products which are given bigger shelves in the store are likely to catch attention first and sell more

5) Visibility in Store: The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back

6) Advertising: Better advertising of products in the store will should higher sales in most cases

7) Promotional Offers: Products accompanied with attractive o􀃜ers and discounts will sell more

## 2) Data Exploration

| Variable | Description | Relation to Hypothesis |
| --- | --- | --- |
| Item_Identifier | Unique Product ID | ID Variable |
| Item_Weight | Weight of Product | Not considered in Hypothesis |
| Item_Fat_Content | Whether the product is low fat or not  | Linked to 'Utility' hypothesis. Low fat items are generally used more than others |
| Item_Visibility | The % of total display area of all products in a store allocated to the particular product  | Linked to 'Display Area' hypothesis |
| Item_Type | The category to which the product belongs  | More inferences of 'Utility' can be derived from this |
| Item_MRP | List price of the Product | Not considered in Hypothesis |
| Outlet_Identifier | Unique_Store_ID | ID Variable |
| Outlet_Establishment_Year | The year in which the store was established | Not considered in Hypothesis |
| Outlet_Size | The ground area of the store | Linked to 'Store Capacity' hypothesis |
| Outlet_Location_Type | The type of city in which store is located | Linked to 'City Type' Hypothesis |
| Outlet_Type | Whethere the outlet is a grocery store or super market | Linked to 'Store Capacity' hypothesis |
| Item_Outlet_Sales | Sales of product in a store | Outcome variable to be predicted |

Out of 15 hypotheses, six features are relevant and have data associated to them.

In [1]:
import numpy as np
import pandas as pd

# \033[1m     # Print in bold begin
# \033[0;0m   # Print in bold end

bmTrain_data = pd.read_csv("data/BigMartTrainData.csv")
bmTest_data = pd.read_csv("data/BigMartTestData.csv")

In [2]:
# Let us combine train and test data and fetch train and test data separately by program so that each run picks random sets
bmTrain_data['source'] = 'train' # to indicate it is training data
bmTest_data['source'] = 'test'   # to indicate it is testing data
bm_data = pd.concat([bmTrain_data, bmTest_data], ignore_index = True, sort=False)
print ("Size of Big Mart Sales Data = ", bm_data.shape, "Size of Big Mart Training Sales Data = ", bmTrain_data.shape, "Size of Big Mart Test Sales Data = ", bmTest_data.shape)
bm_data

Size of Big Mart Sales Data =  (14204, 13) Size of Big Mart Training Sales Data =  (8523, 13) Size of Big Mart Test Sales Data =  (5681, 12)


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,source
0,FDA15,9.30,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380,train
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,train
2,FDN15,17.50,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700,train
3,FDX07,19.20,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800,train
4,NCD19,8.93,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14199,FDB58,10.50,Regular,0.013496,Snack Foods,141.3154,OUT046,1997,Small,Tier 1,Supermarket Type1,,test
14200,FDD47,7.60,Regular,0.142991,Starchy Foods,169.1448,OUT018,2009,Medium,Tier 3,Supermarket Type2,,test
14201,NCO17,10.00,Low Fat,0.073529,Health and Hygiene,118.7440,OUT045,2002,,Tier 2,Supermarket Type1,,test
14202,FDJ26,15.30,Regular,0.000000,Canned,214.6218,OUT017,2007,,Tier 2,Supermarket Type1,,test


In [3]:
# Let us find missing data
# Note Item_Outlet_Sales is the outcome variable so we ignore
#      Item_Weight, Outlet_Size need to be imputed
bm_data.apply(lambda x: sum(x.isnull()))

Item_Identifier                 0
Item_Weight                  2439
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  4016
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales            5681
source                          0
dtype: int64

In [4]:
# Let us examine data
# Note Item_Visibility has a min value of 0. Makes no sense of including products which are not available
#      Outlet_Establishment_Year to be converted to how old the store is
#      The lower ‘count’ of Item_Weight and Item_Outlet_Sales con􀃝rms the 􀃝ndings from the missing value check
bm_data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,11765.0,14204.0,14204.0,14204.0,8523.0
mean,12.792854,0.065953,141.004977,1997.830681,2181.288914
std,4.652502,0.051459,62.086938,8.371664,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.71,0.027036,94.012,1987.0,834.2474
50%,12.6,0.054021,142.247,1999.0,1794.331
75%,16.75,0.094037,185.8556,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [5]:
# Find out unique values of each variable
# Note there are 1559 products (stated also in the problem)
#                10 stores (stated also in the problem)
#                16 types of products
bm_data.apply(lambda x: len(x.unique()))

Item_Identifier               1559
Item_Weight                    416
Item_Fat_Content                 5
Item_Visibility              13006
Item_Type                       16
Item_MRP                      8052
Outlet_Identifier               10
Outlet_Establishment_Year        9
Outlet_Size                      4
Outlet_Location_Type             3
Outlet_Type                      4
Item_Outlet_Sales             3494
source                           2
dtype: int64

In [6]:
# Filter categorical variables
categorical_cols = [x for x in bm_data.dtypes.index if bm_data.dtypes[x] == 'object']
print (categorical_cols)

['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'source']


In [7]:
# Exclude ID cols
categorical_cols = [x for x in categorical_cols if x not in ['Item_Identifier', 'Outlet_Identifier', 'source']]
print (categorical_cols)

['Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']


In [8]:
# Print frequency of categories
for col in categorical_cols:
    print ("Frequency of categories for variable\033[1m %s\033[0;0m is " %col )
    print(bm_data[col].value_counts())

Frequency of categories for variable[1m Item_Fat_Content[0;0m is 
Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64
Frequency of categories for variable[1m Item_Type[0;0m is 
Fruits and Vegetables    2013
Snack Foods              1989
Household                1548
Frozen Foods             1426
Dairy                    1136
Baking Goods             1086
Canned                   1084
Health and Hygiene        858
Meat                      736
Soft Drinks               726
Breads                    416
Hard Drinks               362
Others                    280
Starchy Foods             269
Breakfast                 186
Seafood                    89
Name: Item_Type, dtype: int64
Frequency of categories for variable[1m Outlet_Size[0;0m is 
Medium    4655
Small     3980
High      1553
Name: Outlet_Size, dtype: int64
Frequency of categories for variable[1m Outlet_Location_Type[0;0m is 
Tier 3    5583
Tier 2    4641
Ti

In [9]:
# From the results above we notice the following:
#  1) Item_Fat_Content - Low Fat, LF and low fat are identical and need correction
#  2) Item_Type - We can regroup the categories to make a striking difference among categories

## 3) Data Cleaning

Let us impute missing values:

Item_Weight - Let us impute it with average weight of the particular item

Outlet_Size - Let us impute it with the mode of the Outlet_Size for the particular type of outlet

In [10]:
# Imputing Item_Weight

# Determine average weight per item
item_avg_weight = bm_data.pivot_table(values='Item_Weight', index='Item_Identifier')
item_avg_weight

Unnamed: 0_level_0,Item_Weight
Item_Identifier,Unnamed: 1_level_1
DRA12,11.600
DRA24,19.350
DRA59,8.270
DRB01,7.390
DRB13,6.115
...,...
NCZ30,6.590
NCZ41,19.850
NCZ42,10.500
NCZ53,9.600


In [11]:
# Get a boolean variable specifying missing Item_Weight values
miss_bool = bm_data['Item_Weight'].isnull()
print(miss_bool)

0        False
1        False
2        False
3        False
4        False
         ...  
14199    False
14200    False
14201    False
14202    False
14203    False
Name: Item_Weight, Length: 14204, dtype: bool


In [12]:
# Impute data and check #missing values before and after imputation
print('Original #missing: %d' %sum(miss_bool))

Original #missing: 2439


In [13]:
bm_data.loc[miss_bool, 'Item_Weight'] = bm_data.loc[miss_bool, 'Item_Identifier'].apply(lambda x: item_avg_weight[x])

KeyError: 'FDP10'

In [None]:
print('Final #missing: %d' %sum(bm_data['Item_Weight'].isnull()))

In [None]:
# Import mode function:
from scipy.stats import mode

# Determine the mode for each
outlet_size_mode = bm_data.pivot_table(values='Outlet_Size', columns='Outlet_Type', aggfunc=(lambda x: mode(x).mode[0]))
#print ('Mode for each Outlet_Type: ')
#print(outlet_size_mode)

# Get a boolean variable specifying missing Item_Weight values
miss_bool = bm_data['Outlet_Size'].isnull()

#Impute data and check #missing values before and after imputation
print('\nOriginal #missing: %d' %sum(miss_bool))
bm_data.loc[miss_bool, 'Outlet_Size'] = bm_data.loc[miss_bool, 'Outlet_Type'].apply(lambda x:outlet_size_mode[x])
print(sum(bm_data['Outlet_size'].isnull())) 

## 4) Feature Engineering

In [None]:
dict_a = [{'name': 'python', 'points': 10}, {'name': 'java', 'points': 8}]

map(lambda x : x['name'], dict_a)


In [None]:
print(list(dict_a))