### High Level Steps

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome

2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.

3. Data Cleaning – imputing missing values in the data and checking for outliers

4. Feature Engineering – modifying existing variables and creating new ones for analysis

5. Model Building – making predictive models on the data

### 1. Hypothesis Generation

This is a very pivotal step in the process of analyzing data. This involves understanding the problem and making some hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analysis which we can potentially perform if data is available. Read more about hypothesis generation here.

**The Problem Statement**

Understanding the problem statement is the first and foremost step. You can view this in the competition page but I’ll iterate the same here:

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

So the idea is to find out the properties of a product, and store which impacts the sales of a product. Let’s think about some of the analysis that can be done and come up with certain hypothesis.

**The Hypotheses**

I came up with the following hypothesis while thinking about the problem. These are just my thoughts and you can come-up with many more of these. Since we’re talking about stores and products, lets make different sets for each.

** Store Level Hypotheses: **

1. City type: Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there.

2. Population Density: Stores located in densely populated areas should have higher sales because of more demand.

3. Store Capacity: Stores which are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place

4. Competitors: Stores having similar establishments nearby should have less sales because of more competition.

5. Marketing: Stores which have a good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising.

6. Location: Stores located within popular marketplaces should have higher sales because of better access to customers.

7. Customer Behavior: Stores keeping the right set of products to meet the local needs of customers will have higher sales.

8. Ambiance: Stores which are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales.

** Product Level Hypotheses: **

1. Brand: Branded products should have higher sales because of higher trust in the customer.

2. Packaging: Products with good packaging can attract customers and sell more.

3. Utility: Daily use products should have a higher tendency to sell as compared to the specific use products.

4. Display Area: Products which are given bigger shelves in the store are likely to catch attention first and sell more.

5. Visibility in Store: The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back.

6. Advertising: Better advertising of products in the store will should higher sales in most cases.

7. Promotional Offers: Products accompanied with attractive offers and discounts will sell more.

These are just some basic 15 hypothesis I have made, but you can think further and create some of your own. Remember that the data might not be sufficient to test all of these, but forming these gives us a better understanding of the problem and we can even look for open source information if available.

Lets move on to the data exploration where we will have a look at the data in detail.

### 2. Data Exploration

In [1]:
# Read libraries

import pandas as pd
import numpy as np

In [9]:
#Read files:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [10]:
# Its generally a good idea to combine both train and test data sets into one, 
# perform feature engineering and then divide them later again. 
# This saves the trouble of performing the same steps twice on test and train. 
# Lets combine them into a dataframe ‘data’ with a ‘source’ column specifying where each observation belongs.

train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
print (train.shape, test.shape, data.shape)

# Thus we can see that data has same columns but rows equivalent to both test and train.
data

(8523, 13) (5681, 12) (14204, 13)


Unnamed: 0,Item_Fat_Content,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Type,Item_Visibility,Item_Weight,Outlet_Establishment_Year,Outlet_Identifier,Outlet_Location_Type,Outlet_Size,Outlet_Type,source
0,Low Fat,FDA15,249.8092,3735.1380,Dairy,0.016047,9.300,1999,OUT049,Tier 1,Medium,Supermarket Type1,train
1,Regular,DRC01,48.2692,443.4228,Soft Drinks,0.019278,5.920,2009,OUT018,Tier 3,Medium,Supermarket Type2,train
2,Low Fat,FDN15,141.6180,2097.2700,Meat,0.016760,17.500,1999,OUT049,Tier 1,Medium,Supermarket Type1,train
3,Regular,FDX07,182.0950,732.3800,Fruits and Vegetables,0.000000,19.200,1998,OUT010,Tier 3,,Grocery Store,train
4,Low Fat,NCD19,53.8614,994.7052,Household,0.000000,8.930,1987,OUT013,Tier 3,High,Supermarket Type1,train
5,Regular,FDP36,51.4008,556.6088,Baking Goods,0.000000,10.395,2009,OUT018,Tier 3,Medium,Supermarket Type2,train
6,Regular,FDO10,57.6588,343.5528,Snack Foods,0.012741,13.650,1987,OUT013,Tier 3,High,Supermarket Type1,train
7,Low Fat,FDP10,107.7622,4022.7636,Snack Foods,0.127470,,1985,OUT027,Tier 3,Medium,Supermarket Type3,train
8,Regular,FDH17,96.9726,1076.5986,Frozen Foods,0.016687,16.200,2002,OUT045,Tier 2,,Supermarket Type1,train
9,Regular,FDU28,187.8214,4710.5350,Frozen Foods,0.094450,19.200,2007,OUT017,Tier 2,,Supermarket Type1,train


In [11]:
# One of the key challenges in any data set is missing values. 
# Lets start by checking which columns contain missing values

data.apply(lambda x: sum(x.isnull()))

Item_Fat_Content                0
Item_Identifier                 0
Item_MRP                        0
Item_Outlet_Sales            5681
Item_Type                       0
Item_Visibility                 0
Item_Weight                  2439
Outlet_Establishment_Year       0
Outlet_Identifier               0
Outlet_Location_Type            0
Outlet_Size                  4016
Outlet_Type                     0
source                          0
dtype: int64

In [12]:
# Note that the Item_Outlet_Sales is the target variable and
# missing values are ones in the test set. So we need 
# not worry about it. But we’ll impute the missing values 
# in Item_Weight and Outlet_Size in the data cleaning section.

# Lets look at some basic statistics for numerical variables.

data.describe()

Unnamed: 0,Item_MRP,Item_Outlet_Sales,Item_Visibility,Item_Weight,Outlet_Establishment_Year
count,14204.0,8523.0,14204.0,11765.0,14204.0
mean,141.004977,2181.288914,0.065953,12.792854,1997.830681
std,62.086938,1706.499616,0.051459,4.652502,8.371664
min,31.29,33.29,0.0,4.555,1985.0
25%,94.012,834.2474,0.027036,8.71,1987.0
50%,142.247,1794.331,0.054021,12.6,1999.0
75%,185.8556,3101.2964,0.094037,16.75,2004.0
max,266.8884,13086.9648,0.328391,21.35,2009.0


In [13]:
# Some observations:

# 1. Item_Visibility has a min value of zero. 
# This makes no practical sense because when a 
# product is being sold in a store, the visibility cannot be 0.
# 2. Outlet_Establishment_Years vary from 1985 to 2009.
# The values might not be apt in this form. Rather, 
# if we can convert them to how old the particular store is,
# it should have a better impact on sales.
# 3. The lower ‘count’ of Item_Weight and Item_Outlet_Sales 
# confirms the findings from the missing value check.

In [14]:
# Moving to nominal (categorical) variable, lets have a 
# look at the number of unique values in each of them.

data.apply(lambda x: len(x.unique()))

Item_Fat_Content                 5
Item_Identifier               1559
Item_MRP                      8052
Item_Outlet_Sales             3494
Item_Type                       16
Item_Visibility              13006
Item_Weight                    416
Outlet_Establishment_Year        9
Outlet_Identifier               10
Outlet_Location_Type             3
Outlet_Size                      4
Outlet_Type                      4
source                           2
dtype: int64

In [None]:
# This tells us that there are 1559 products and 
# 10 outlets/stores (which was also mentioned in 
# problem statement). Another thing that should 
# catch attention is that Item_Type has 16 unique values.
# Let’s explore further using the frequency of different 
# categories in each nominal variable. I’ll exclude the ID 
# and source variables for obvious reasons.

In [16]:
#Filter categorical variables
categorical_columns = [x for x in data.dtypes.index if data.dtypes[x]=='object']

categorical_columns

['Item_Fat_Content',
 'Item_Identifier',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Location_Type',
 'Outlet_Size',
 'Outlet_Type',
 'source']

In [18]:
#Exclude ID cols and source:
categorical_columns = [x for x in categorical_columns if x not in ['Item_Identifier','Outlet_Identifier','source']]

categorical_columns


['Item_Fat_Content',
 'Item_Type',
 'Outlet_Location_Type',
 'Outlet_Size',
 'Outlet_Type']

In [27]:
#Print frequency of categories
for col in categorical_columns:
    print ("Frequency of Categories for varible %s %col")
    print (data[col].value_counts())
    print("")

Frequency of Categories for varible %s %col
Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64

Frequency of Categories for varible %s %col
Fruits and Vegetables    2013
Snack Foods              1989
Household                1548
Frozen Foods             1426
Dairy                    1136
Baking Goods             1086
Canned                   1084
Health and Hygiene        858
Meat                      736
Soft Drinks               726
Breads                    416
Hard Drinks               362
Others                    280
Starchy Foods             269
Breakfast                 186
Seafood                    89
Name: Item_Type, dtype: int64

Frequency of Categories for varible %s %col
Tier 3    5583
Tier 2    4641
Tier 1    3980
Name: Outlet_Location_Type, dtype: int64

Frequency of Categories for varible %s %col
Medium    4655
Small     3980
High      1553
Name: Outlet_Size, dtype: int64

Frequency of Categories for va

In [None]:
# The output gives us following observations:

# 1. Item_Fat_Content: Some of ‘Low Fat’ values mis-coded as ‘low fat’ and ‘LF’. Also, some of ‘Regular’ are mentioned as ‘regular’.
# 2. Item_Type: Not all categories have substantial numbers. It looks like combining them can give better results.
# 3. Outlet_Type: Supermarket Type2 and Type3 can be combined. But we should check if that’s a good idea before doing it.

### 3. Data Cleaning 

This step typically involves imputing missing values and treating outliers. Though outlier removal is very important in regression techniques, advanced tree based algorithms are impervious to outliers. So I’ll leave it to you to try it out. We’ll focus on the imputation step here, which is a very important step.

Note: We’ll be using some Pandas library extensively here. If you’re new to Pandas, please go through this article.

** Imputing Missing Values **

We found two variables with missing values – Item_Weight and Outlet_Size. Lets impute the former by the average weight of the particular item. This can be done as:

In [82]:
#Determine the average weight per item:
item_avg_weight = data.pivot_table(values='Item_Weight', index='Item_Identifier')

#list(item_avg_weight.columns)

item_avg_weight.loc['DRA12']
#item_avg_weight[item_avg_weight['Item_Identifier'] == 'DRA12']
#item_avg_weight

# item_avg_weight.query('Item_Identifier == ["DRA12"]')



Item_Weight    11.6
Name: DRA12, dtype: float64

In [53]:
#Get a boolean variable specifying missing Item_Weight values
miss_bool = data['Item_Weight'].isnull() 

#miss_bool

In [32]:
#Impute data and check #missing values before and after imputation to confirm
print ('Orignal #missing: %d'% sum(miss_bool))

Orignal #missing: 2439


In [84]:
data.loc[miss_bool,'Item_Weight'] = data.loc[miss_bool,'Item_Identifier'].apply(lambda x: item_avg_weight.loc[x])

In [85]:
print ('Final #missing: %d'% sum(data['Item_Weight'].isnull()))

Final #missing: 0


This confirms that the column has no missing values now. Lets impute Outlet_Size with the mode of the Outlet_Size for the particular type of outlet.

In [86]:
#Import mode function:
from scipy.stats import mode

In [88]:
#Determing the mode for each
outlet_size_mode = data.dropna(subset=['Outlet_Size']).pivot_table(values='Outlet_Size', columns='Outlet_Type',aggfunc=(lambda x:mode(x).mode[0]), dropna=True)

#data.pivot_table(values='Outlet_Size', columns='Outlet_Type',aggfunc=(lambda x:mode(x).mode[0]) )



In [104]:
print ('Mode for each Outlet_Type:')
print("")
outlet_size_mode

Mode for each Outlet_Type:



Outlet_Type,Grocery Store,Supermarket Type1,Supermarket Type2,Supermarket Type3
Outlet_Size,Small,Small,Medium,Medium


In [92]:
#Get a boolean variable specifying missing Item_Weight values
miss_bool = data['Outlet_Size'].isnull() 

In [103]:
#Impute data and check #missing values before and after imputation to confirm
print ('Orignal #missing: %d' % sum(miss_bool))

Orignal #missing: 4016


In [98]:
data.loc[miss_bool,'Outlet_Size'] = data.loc[miss_bool,'Outlet_Type'].apply(lambda x: outlet_size_mode[x])

In [100]:
print (sum(data['Outlet_Size'].isnull()))

0


### 4. Feature Engineering 