# Bigmart Sales Analysis: For data comprising of transaction records of a sales store. 
The data has 8523 rows of 12 variables. Predict the sales of a store. Sample Test data set available
here https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

Variable - Description

Item_Identifier - Unique product ID

Item_Weight - Weight of product

Item_Fat_Content - Whether the product is low fat or not

Item_Visibility - The % of total display area of all products in a store allocated to the particular product

Item_Type - The category to which the product belongs

Item_MRP - Maximum Retail Price (list price) of the product

Outlet_Identifier - Unique store ID

Outlet_Establishment_Year - The year in which store was established

Outlet_Size - The size of the store in terms of ground area covered

Outlet_Location_Type - The type of city in which the store is located

Outlet_Type - Whether the outlet is just a grocery store or some sort of supermarket

Item_Outlet_Sales - Sales of the product in the particulat store. This is the outcome variable to be predicted.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
Train = pd.read_csv("Train.csv",header=None)
Test = pd.read_csv("Test.csv",header=None)

In [3]:
headers = ['Item_Identifier','Item_Weight','Item_Fat_Content','Item_Visibility','Item_Type','Item_MRP','Outlet_Identifier','Outlet_Establishment_Year','Outlet_Size','Outlet_Location_Type','Outlet_Type','Item_Outlet_Sales']

In [4]:
Train.columns = headers
Test.columns = headers[:11]

In [5]:
Train['Source'] = 'Train'
Test['Source'] = 'Test'
final_df = Test[['Item_Identifier','Outlet_Identifier']].copy()

In [6]:
Data = pd.concat([Train,Test],ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [7]:
Data['Outlet_Establishment_Year'] = Data['Outlet_Establishment_Year'].apply(lambda x: 2017 - x)

In [8]:
Data['Item_Fat_Content'].replace('LF','Low',inplace = True)

In [9]:
Data['Item_Fat_Content'].replace('low fat','Low',inplace = True)

In [10]:
Data['Item_Fat_Content'].replace('reg','Regular',inplace = True)

replace missing values with mean for Item_Weight

In [11]:
Item_Weight_Mean = Data['Item_Weight'].mean(axis=0)

In [12]:
Data['Item_Weight'].replace(np.NaN,Item_Weight_Mean, inplace = True)

replace missing values with mean for Item_Visibility

In [13]:
Data['Item_Visibility'].replace(0,np.NaN,inplace = True)

In [14]:
Item_Visibility_Mean = Data['Item_Visibility'].mean(axis = 0)

In [15]:
Data['Item_Visibility'].replace(np.NaN,Item_Visibility_Mean,inplace = True)

replace item_type by itemID initials (to reduce total number of types from 16 to 3)

In [16]:
Data['Item_Type'] = Data['Item_Identifier'].apply(lambda x : x[0:2])

replace missing values for Outlet_Size

In [17]:
from scipy.stats import mode

In [18]:
Outlet_Size_mode = Data.pivot_table(values = 'Outlet_Size',columns = 'Outlet_Type', aggfunc = (lambda x:x.mode().iat[0]))

In [19]:
miss_bool = Data['Outlet_Size'].isnull()

In [20]:
Data.loc[miss_bool,'Outlet_Size'] = Data.loc[miss_bool,'Outlet_Type'].apply(lambda x: Outlet_Size_mode[x])

Convert categorical to numerical using dummy columns

In [21]:
dummies = ['Item_Fat_Content','Item_Type','Outlet_Location_Type','Outlet_Size','Outlet_Type']

In [22]:
Data = pd.get_dummies(Data, columns = dummies)

Drop useless columns

In [23]:
Data.drop(['Outlet_Identifier','Item_Identifier'],axis=1, inplace=True)

split df into train and test

In [24]:
Train  = Data.loc[Data['Source']=='Train']
Test = Data.loc[Data['Source']=='Test']

In [25]:
Train.drop('Source', axis = 1, inplace = True)
Test.drop('Source', axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [26]:
x_train = np.array(Train.drop(['Item_Outlet_Sales'],axis=1))
y_train = np.array(Train['Item_Outlet_Sales'])

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [28]:
lr = LinearRegression(normalize = True)

In [29]:
lr.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [30]:
lr.intercept_

674.1391857212407

In [31]:
lr.coef_

array([ 1.55630512e+01, -2.28198823e+02, -5.31490410e-01, -3.05858007e+01,
       -2.51266580e+01, -1.72419419e+01,  2.34181541e+01, -3.96478514e+00,
        1.31860617e+01, -1.52829669e+01,  2.04759834e+02,  4.05715666e+01,
       -2.10516511e+02,  6.08230705e+02, -1.07973150e+02, -1.44082824e+02,
       -1.56147946e+03, -3.73376904e+01, -3.01636970e+02,  2.15948767e+03])

In [32]:
y_train_pred = lr.predict(x_train)

In [33]:
rmse = metrics.mean_squared_error(y_train,y_train_pred)

In [34]:
rmse

1272540.3426513916

In [35]:
y_train_pred

array([4058.17643045,  572.44986169, 2369.87036589, ..., 1414.25092926,
       1414.02641472, 1220.94532403])

In [36]:
output_df = pd.DataFrame(y_train_pred)

In [37]:
final_df['Outles_Sales'] = output_df

In [38]:
final_df

Unnamed: 0,Item_Identifier,Outlet_Identifier,Outles_Sales
0,FDW58,OUT049,4058.176430
1,FDW14,OUT017,572.449862
2,NCN55,OUT010,2369.870366
3,FDQ58,OUT017,1021.240437
4,FDY38,OUT027,901.875118
...,...,...,...
5676,FDB58,OUT046,2224.449312
5677,FDD47,OUT018,4018.980412
5678,NCO17,OUT045,2631.990476
5679,FDJ26,OUT017,3138.364564


In [39]:
final_df.to_csv("output.csv")