https://medium.com/analytics-vidhya/feature-engineering-using-featuretools-with-code-10f8c83e5f68

The objective is to build a predictive model to estimate the sales of each product at a particular store. This would help the decision-makers at BigMart to find out the properties of any product or store, which play a key role in increasing the overall sales.

In [1]:
import featuretools as ft
import numpy as np
import pandas as pd

In [2]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
test.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


## cleaning

In [5]:
# test_item_id = test['Item_Identifier'].copy()
# test_outlet_id = test['Outlet_Identifier'].copy()

#target
sales = train['Item_Outlet_Sales'].copy()
train.drop(['Item_Outlet_Sales'], axis=1, inplace=True)

In [6]:
# combining train and test to avoid performing the same steps in two dfs
train_test = pd.concat([train, test], ignore_index=True)

In [7]:
# missing data
train_test['Item_Weight'].fillna(train_test['Item_Weight'].mean(), inplace=True)
train_test['Outlet_Size'].fillna("missing", inplace=True)

In [8]:
train.shape

(8523, 11)

In [9]:
test.shape

(5681, 11)

In [10]:
train_test.shape

(14204, 11)

In [11]:
train_test['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [12]:
fat_content_dict = {'Low Fat': 0, 'Regular':1, 'low fat': 0, 'LF': 0, 'reg': 1}
train_test['Item_Fat_Content'] = train_test.Item_Fat_Content.replace(fat_content_dict, regex=True)

In [13]:
train_test['id'] = train_test['Item_Identifier'] + train_test['Outlet_Identifier']
train_test.drop(['Item_Identifier'], axis=1, inplace=True)

we will have to create an EntitySet. An EntitySet is a structure that contains multiple dataframes and relationships between them. So, let’s create an EntitySet and add the dataframe combination to it.

In [14]:
entity_set = ft.EntitySet(id='sales')

In [15]:
entity_set.entity_from_dataframe(entity_id='bigmart', dataframe=train_test, index='id')

Entityset: sales
  Entities:
    bigmart [Rows: 14204, Columns: 11]
  Relationships:
    No relationships

In [38]:
train_test[
    ['Outlet_Identifier', 'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']]\
    .duplicated().sum()

14194

In [16]:
entity_set.normalize_entity(
    base_entity_id='bigmart', new_entity_id='outlet', index='Outlet_Identifier',
    additional_variables=['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'])

Entityset: sales
  Entities:
    bigmart [Rows: 14204, Columns: 7]
    outlet [Rows: 10, Columns: 5]
  Relationships:
    bigmart.Outlet_Identifier -> outlet.Outlet_Identifier

In [17]:
feature_matrix, feature_names = ft.dfs(entityset=entity_set, target_entity='bigmart',
                                       max_depth=2, verbose=1, n_jobs=1)

Built 37 features
Elapsed: 00:00 | Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████


In [18]:
feature_matrix.columns

Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
       'Item_MRP', 'Outlet_Identifier', 'outlet.Outlet_Establishment_Year',
       'outlet.Outlet_Size', 'outlet.Outlet_Location_Type',
       'outlet.Outlet_Type', 'outlet.COUNT(bigmart)',
       'outlet.MAX(bigmart.Item_Fat_Content)', 'outlet.MAX(bigmart.Item_MRP)',
       'outlet.MAX(bigmart.Item_Visibility)',
       'outlet.MAX(bigmart.Item_Weight)',
       'outlet.MEAN(bigmart.Item_Fat_Content)',
       'outlet.MEAN(bigmart.Item_MRP)', 'outlet.MEAN(bigmart.Item_Visibility)',
       'outlet.MEAN(bigmart.Item_Weight)',
       'outlet.MIN(bigmart.Item_Fat_Content)', 'outlet.MIN(bigmart.Item_MRP)',
       'outlet.MIN(bigmart.Item_Visibility)',
       'outlet.MIN(bigmart.Item_Weight)', 'outlet.MODE(bigmart.Item_Type)',
       'outlet.NUM_UNIQUE(bigmart.Item_Type)',
       'outlet.SKEW(bigmart.Item_Fat_Content)',
       'outlet.SKEW(bigmart.Item_MRP)', 'outlet.SKEW(bigmart.Item_Visibility)',
       'outlet.SKEW(bi

In [42]:
feature_matrix['outlet.MAX(bigmart.Item_Visibility)'].unique()

array([0.18785023, 0.18832266, 0.31393469, 0.18591302, 0.18665037,
       0.187939  , 0.18861954, 0.18755863, 0.18752316, 0.32839095])

In [20]:
feature_matrix = feature_matrix.reindex(index=train_test['id'])
feature_matrix = feature_matrix.reset_index()

In [21]:
feature_matrix

Unnamed: 0,id,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,outlet.Outlet_Establishment_Year,outlet.Outlet_Size,outlet.Outlet_Location_Type,...,outlet.SKEW(bigmart.Item_Visibility),outlet.SKEW(bigmart.Item_Weight),outlet.STD(bigmart.Item_Fat_Content),outlet.STD(bigmart.Item_MRP),outlet.STD(bigmart.Item_Visibility),outlet.STD(bigmart.Item_Weight),outlet.SUM(bigmart.Item_Fat_Content),outlet.SUM(bigmart.Item_MRP),outlet.SUM(bigmart.Item_Visibility),outlet.SUM(bigmart.Item_Weight)
0,FDA15OUT049,9.30,0,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,...,0.790782,0.099024,0.478027,62.144594,0.043924,4.650796,547,218802.9588,91.450099,19844.655
1,DRC01OUT018,5.92,1,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,...,0.783017,0.102602,0.478308,62.022851,0.044489,4.650874,547,217987.3906,92.723425,19794.425
2,FDN15OUT049,17.50,0,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,...,0.790782,0.099024,0.478027,62.144594,0.043924,4.650796,547,218802.9588,91.450099,19844.655
3,FDX07OUT010,19.20,1,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,missing,Tier 3,...,0.776902,0.112759,0.479301,62.010835,0.073604,4.675070,330,130572.7618,94.293418,11768.655
4,NCD19OUT013,8.93,0,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,...,0.759033,0.104392,0.478213,62.140848,0.044005,4.650214,549,219172.4492,93.555174,19859.980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14199,FDB58OUT046,10.50,1,0.013496,Snack Foods,141.3154,OUT046,1997,Small,Tier 1,...,0.779426,0.094235,0.477828,62.106844,0.044303,4.653788,546,218274.0716,93.285857,19860.635
14200,FDD47OUT018,7.60,1,0.142991,Starchy Foods,169.1448,OUT018,2009,Medium,Tier 3,...,0.783017,0.102602,0.478308,62.022851,0.044489,4.650874,547,217987.3906,92.723425,19794.425
14201,NCO17OUT045,10.00,0,0.073529,Health and Hygiene,118.7440,OUT045,2002,missing,Tier 2,...,0.758935,0.098933,0.478168,62.262676,0.044373,4.649769,547,218047.1632,94.284513,19809.905
14202,FDJ26OUT017,15.30,1,0.000000,Canned,214.6218,OUT017,2007,missing,Tier 2,...,0.774783,0.106563,0.477922,62.295513,0.044152,4.655234,544,217561.3500,94.342210,19722.750


## Machine Learning

In [22]:
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split

In [23]:
# converting categorical features to string
categorical_features = np.where(feature_matrix.dtypes == 'object')[0]
for i in categorical_features:
    feature_matrix.iloc[:,i] = feature_matrix.iloc[:, i].astype('str')

In [24]:
# split the combined df into train and test again
feature_matrix.drop(['id'], axis=1, inplace=True)
train = feature_matrix[:8523]
test = feature_matrix[8523:]

In [25]:
train.drop(['Outlet_Identifier'], axis=1, inplace=True)
test.drop(['Outlet_Identifier'], axis=1, inplace=True)

In [30]:
categorical_features = np.where(train.dtypes == 'object')[0]
categorical_features

array([ 3,  6,  7,  8, 22])

In [31]:
x_train, x_val, y_train, y_val = train_test_split(train, sales, test_size=0.25, random_state=11)

In [32]:
model_cat = CatBoostRegressor(iterations=100, learning_rate=0.3, depth=6, eval_metric='RMSE', random_seed=7)

In [33]:
model_cat.fit(x_train, y_train, cat_features=categorical_features, use_best_model=True)

You should provide test set for use best model. use_best_model parameter has been switched to false value.


0:	learn: 1440.1669686	total: 57.3ms	remaining: 5.67s
1:	learn: 1291.2555633	total: 67.2ms	remaining: 3.29s
2:	learn: 1198.9242824	total: 75.3ms	remaining: 2.43s
3:	learn: 1144.4826511	total: 79.1ms	remaining: 1.9s
4:	learn: 1110.1295069	total: 82.8ms	remaining: 1.57s
5:	learn: 1089.8059753	total: 86.3ms	remaining: 1.35s
6:	learn: 1081.0870251	total: 89ms	remaining: 1.18s
7:	learn: 1073.8363039	total: 91.2ms	remaining: 1.05s
8:	learn: 1068.1980841	total: 93.6ms	remaining: 946ms
9:	learn: 1065.3736138	total: 95.5ms	remaining: 860ms
10:	learn: 1062.6591514	total: 97.4ms	remaining: 788ms
11:	learn: 1058.9950763	total: 99.6ms	remaining: 730ms
12:	learn: 1058.2651366	total: 102ms	remaining: 680ms
13:	learn: 1057.5983637	total: 103ms	remaining: 633ms
14:	learn: 1056.4337486	total: 105ms	remaining: 593ms
15:	learn: 1055.8339476	total: 106ms	remaining: 557ms
16:	learn: 1053.4063958	total: 108ms	remaining: 526ms
17:	learn: 1053.3144343	total: 109ms	remaining: 497ms
18:	learn: 1052.0205175	total

<catboost.core.CatBoostRegressor at 0x7f5637c09b70>

In [34]:
model_cat.score(x_val, y_val)

0.5792825973878382