## Big Mart Sales Analysis

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. 

*   Item_Identifier: Unique product ID

*   Item_Weight: Weight of product

*   Item_Fat_Content: Whether the product is low fat or not

*   Item_Visibility: The % of total display area of all products in a store allocated to the particular product

*   Item_Type: The category to which the product belongs

*   Item_MRP: Maximum Retail Price (list price) of the product

*   Outlet_Identifier: Unique store ID

*   Outlet_Establishment_Year: The year in which store was established

*   Outlet_Size: The size of the store in terms of ground area covered

*   Outlet_Location_Type: The type of city in which the store is located

*   Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket

*   Item_Outlet_Sales: Sales of the product in the particular store. This is the outcome variable to be predicted.

In [1]:
import pandas as pd
import math
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

## Import data

In [2]:
X=pd.read_csv('trainbig.csv')
y=X['Item_Outlet_Sales']
X=X.drop(['Item_Outlet_Sales'],axis=1)

In [3]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
dtypes: float64(3), int64(1), object(7)
memory usage: 732.6+ KB


In [4]:
X['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [5]:
X['Item_Type'].unique()

array(['Dairy', 'Soft Drinks', 'Meat', 'Fruits and Vegetables',
       'Household', 'Baking Goods', 'Snack Foods', 'Frozen Foods',
       'Breakfast', 'Health and Hygiene', 'Hard Drinks', 'Canned',
       'Breads', 'Starchy Foods', 'Others', 'Seafood'], dtype=object)

In [6]:
X['Outlet_Size'].unique()

array(['Medium', nan, 'High', 'Small'], dtype=object)

In [7]:
X['Item_Weight'].fillna(X['Item_Weight'].mean(), inplace=True)
X['Outlet_Size'].fillna('Medium', inplace=True)

## Fat Content column has irregular data

In [8]:
X['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [9]:
X['Item_Fat_Content'].replace(['low fat','LF','reg'],['Low Fat','Low Fat','Regular'],inplace = True)
X['Item_Fat_Content'].replace(['low fat','LF','reg'],['Low Fat','Low Fat','Regular'],inplace = True)

# One Hot Encoding

In [10]:
fatcols=pd.get_dummies(X['Item_Fat_Content'], prefix='Item_Fat')
typecols=pd.get_dummies(X['Item_Type'], prefix='Item_Type')
yearcols=pd.get_dummies(X['Outlet_Establishment_Year'], prefix='Year')
sizecols=pd.get_dummies(X['Outlet_Size'], prefix='Size')
loccols=pd.get_dummies(X['Outlet_Location_Type'], prefix='Loc')
otypecols=pd.get_dummies(X['Outlet_Type'], prefix='Loctype')

In [11]:
temp=[fatcols,typecols,yearcols,sizecols,loccols,otypecols]
for df in temp:
    for col in df.columns:
        df[col]=df[col].astype(int)

In [12]:
X=pd.concat([X, fatcols,typecols,yearcols,sizecols,loccols,otypecols],axis=1)

In [13]:
X=X.drop(['Item_Identifier','Outlet_Identifier','Outlet_Establishment_Year','Item_Fat_Content','Item_Type','Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],axis=1)

In [14]:
X['Item_Weight']=X['Item_Weight']-min(X['Item_Weight'])
X['Item_Visibility']=X['Item_Visibility']-min(X['Item_Visibility'])
X['Item_MRP']=X['Item_MRP']-min(X['Item_MRP'])

In [15]:
X['Item_Weight']=X['Item_Weight']/max(X['Item_Weight'])
X['Item_Visibility']=X['Item_Visibility']/max(X['Item_Visibility'])
X['Item_MRP']=X['Item_MRP']/max(X['Item_MRP'])

In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 40 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Item_Weight                      8523 non-null   float64
 1   Item_Visibility                  8523 non-null   float64
 2   Item_MRP                         8523 non-null   float64
 3   Item_Fat_Low Fat                 8523 non-null   int64  
 4   Item_Fat_Regular                 8523 non-null   int64  
 5   Item_Type_Baking Goods           8523 non-null   int64  
 6   Item_Type_Breads                 8523 non-null   int64  
 7   Item_Type_Breakfast              8523 non-null   int64  
 8   Item_Type_Canned                 8523 non-null   int64  
 9   Item_Type_Dairy                  8523 non-null   int64  
 10  Item_Type_Frozen Foods           8523 non-null   int64  
 11  Item_Type_Fruits and Vegetables  8523 non-null   int64  
 12  Item_Type_Hard Drink

In [17]:
xtrain, xtest, ytrain, ytest=train_test_split(X,y,random_state=0,test_size=0.1)


In [18]:
reg = LinearRegression()
reg.fit(xtrain, ytrain)
reg.score(xtest,ytest)
preds=reg.predict(xtest)
print("rms error = ",math.sqrt(mean_squared_error(ytest,preds)))
print('R2 score: %.2f' % r2_score(ytest, preds))

rms error =  1142.6865657688804
R2 score: 0.56


In [19]:
model = Lasso(alpha=0.5)
model.fit(xtrain, ytrain)
model.score(xtest,ytest)
preds=model.predict(xtest)
print("rms error = ",math.sqrt(mean_squared_error(ytest,preds)))
print('R2 score: %.2f' % r2_score(ytest, preds))

rms error =  1142.0295275762949
R2 score: 0.56


In [20]:
depths=[1,2,3,4,5,6,7]
for d in depths:
    dtr = DecisionTreeRegressor(criterion='mse', max_depth=d)
    dtr.fit(xtrain, ytrain)
    dtr.score(xtest,ytest)
    preds=dtr.predict(xtest)
    print("rms error (depth=",d,") = ",math.sqrt(mean_squared_error(ytest,preds)))
    print('R2 score: %.2f' % r2_score(ytest, preds))
    

rms error (depth= 1 ) =  1513.262029862781
R2 score: 0.24
rms error (depth= 2 ) =  1284.6128570163416
R2 score: 0.45
rms error (depth= 3 ) =  1183.6971610287283
R2 score: 0.53
rms error (depth= 4 ) =  1115.654158683325
R2 score: 0.58
rms error (depth= 5 ) =  1100.389475745765
R2 score: 0.60
rms error (depth= 6 ) =  1095.967872301307
R2 score: 0.60
rms error (depth= 7 ) =  1114.3007726385044
R2 score: 0.59


In [21]:
rfr = RandomForestRegressor(n_estimators=200,max_depth=6)
rfr.fit(xtrain, ytrain)
rfr.score(xtest,ytest)
preds=rfr.predict(xtest)
print("rms error = ",math.sqrt(mean_squared_error(ytest,preds)))
print('R2 score: %.2f' % r2_score(ytest, preds))

rms error =  1088.3428175374975
R2 score: 0.60
