Walmart Store Sales Forecasting

Kaggle Problem Definition

Walmart is a supermarket chain in the USA. Currently, they have opened their stores all over the world. Predicting future sales for a company like Walmart is one of the most important aspects of strategic planning. Based on the sales number they can take an informed decision for the short term and long term. The future sales number also helps to recruit contract employees based on sales. It will also help a Walmart store to work more efficiently.

The Big Question - Why we need a Machine Learning approach?

They have multiple stores in various locations across the world. There may be different sales patterns in different stores placed in different locations. To identify this pattern and predict future sales, we should have a complete solution. Here, we can take a machine learning approach to create this complete solution. If we have a trained model, predicting the sales number of a store shouldn’t be a tedious task to do. It will save lots of human time.

Overview of this Blog:

Part-1: Kaggle Data
Part-2: Exploratory Data Analysis (EDA)
Part-3: Data Pre-Processing & Feature Extraction
Part-4: Machine Learning Regression Models
Part-5: Evaluation of the Models
Part-6: Future Work

Part-1: Kaggle Data

We are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

stores.csv
This file contains anonymized information about the 45 stores, indicating the type and size of the store.

train.csv
This is the historical training data covering 2010–02–05 to 2012–11–01. Within this file you will find the following fields:

Store - the store number
Dept - the department number
Date - the week
Weekly_Sales - sales for the given department in the given store
IsHoliday - whether the week is a special holiday week

test.csv
This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store - the store number
Date - the week
Temperature - the average temperature in the region
Fuel_Price - the cost of fuel in the region
MarkDown1–5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011 and is not available for all stores all the time. Any missing value is marked with an NA.
CPI - the consumer price index
Unemployment - the unemployment rate
IsHoliday - whether the week is a special holiday week

There are 4 holidays mentioned in the dataset

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

The performance Metric of this competition is the Weighted Mean Absolute Error(WMAE). As per the Kaggle page, the weight of the week which has a holiday is 5 otherwise 1. That means Walmart is more focused on the holiday week and they want to have less error on that week. We also focus on that part as well and try to reduce the error.

Part-2: Exploratory Data Analysis (EDA)

At first, we want to load all CSV files

#Sales Data
train_data = pd.read_csv("train/train.csv")
test_data = pd.read_csv("test/test.csv")

#Features.csv
feature_data = pd.read_csv("features.csv")

#Stores.csv
store_data = pd.read_csv("stores.csv")

Currently, we have multiple data frames. Let's join those data frames.

#Join The Train data
train_all = train_data.set_index("Store").join(store_data.set_index("Store"))
train_all.reset_index(inplace=True)
train_all = train_all.set_index(["Store","Date"]).join(feature_data.set_index(["Store","Date"]))
train_all.reset_index(inplace=True)
#Train data size
print(train_all.shape)
train_all.head()

Now, check the number of NaN values for each column for both Train and Test data.

Train Data

print(train_all.isnull().sum())
print(train_all.isnull().sum()*100.0/train_all.shape[0])

Test Data

print(test_all.isnull().sum())
print(test_all.isnull().sum()*100.0/test_all.shape[0])

Observations

There are more than 60 per cent markdowns that are NULL. As per the competition page Markdowns are available after Nov 2011. For the data before Nov 2011, the markdowns should be zero.
For one-third of the test data the CPI and Unemployment. These were dependent on the location and time. But it cannot be changed drastically. So we can impute store location-wise mean of these columns.

Let's fill in the NaN values -

#Replace CPI & Unemployment
test_all['CPI'] = test_all['CPI'].fillna(train_all.groupby('Store')['CPI'].transform('mean'))
test_all['Unemployment'] = test_all['Unemployment'].fillna(train_all.groupby('Store')['Unemployment'].transform('mean'))
#Replace the Markdowns with zero
train_all.fillna(0, inplace=True)
test_all.fillna(0, inplace=True)

Store Type Vs Size

sns.set_style("whitegrid")
ax = sns.boxplot(x='Type', y='Size', data=train_all).set_title("Box Plot of Type vs Size")

Observations

Store Size is very different for different types of store.

Highest Sales of a Week for a Department

train_all_new = train_all.sort_values('Weekly_Sales',ascending=False)
train_all_new = train_all_new.head(50)
train_all_new['Week'] = train_all_new['Date'].dt.week
train_all_new['Month'] = train_all_new['Date'].dt.month
train_all_new['Year'] = train_all_new['Date'].dt.year
#Pick top 100 rows
print(tabulate(train_all_new[['Store','Dept','Size','Type','Week','Month','Year','Weekly_Sales']], headers='keys', tablefmt='psql', showindex=False))

Observations

The highest sales are available on the week of Thanksgiving and the week before Christmas.
Most of the sales take place in the department of 7 and 72

Average Weekly sales over the time

avg_sales = train_all.groupby('Date')[['Weekly_Sales']].mean().reset_index()
fig = plt.figure(figsize=(18,6))
plt.plot(avg_sales['Date'], avg_sales['Weekly_Sales'])
plt.xlabel("Time --->", fontsize=16)
plt.ylabel("Weekly Sales --->", fontsize=16)
plt.title("Average weekly sales over time", fontsize=18)
plt.show()

Let zoom the sales number over the year

avg_sales = train_all.groupby('Date')[['Weekly_Sales']].agg(['mean','median']).reset_index()
avg_sales['Week'] = avg_sales['Date'].dt.week
avg_sales_2010 = avg_sales[avg_sales['Date'].dt.year == 2010]
avg_sales_2011 = avg_sales[avg_sales['Date'].dt.year == 2011]
avg_sales_2012 = avg_sales[avg_sales['Date'].dt.year == 2012]
#Plot the model
fig = plt.figure(figsize=(18,6))
plt.plot(avg_sales_2010['Week'], avg_sales_2010['Weekly_Sales']['mean'], label='Year 2010 Mean Sales')
plt.plot(avg_sales_2010['Week'], avg_sales_2010['Weekly_Sales']['median'], label='Year 2010 Median Sales')
plt.plot(avg_sales_2011['Week'], avg_sales_2011['Weekly_Sales']['mean'], label='Year 2011 Mean Sales')
plt.plot(avg_sales_2011['Week'], avg_sales_2011['Weekly_Sales']['median'], label='Year 2011 Median Sales')
plt.plot(avg_sales_2012['Week'], avg_sales_2012['Weekly_Sales']['mean'], label='Year 2012 Mean Sales')
plt.plot(avg_sales_2012['Week'], avg_sales_2012['Weekly_Sales']['median'], label='Year 2012 Median Sales')
plt.xticks(np.arange(1, 53, step=1))
plt.grid(axis='both',color='grey', linestyle='--', linewidth=1)
plt.legend()
plt.xlabel("Week --->", fontsize=16)
plt.ylabel("Weekly Sales --->", fontsize=16)
plt.title("Average/Median weekly sales of the Different Week Over a year", fontsize=18)

Observations

The Sales Number from last November to December are much bigger than a normal week.
The Sales numbers are minimum after December.
We can see a spike for the Super Bowl.
In January the average sales are low.

Now, Let's try to find the pattern of Weekly sales for some departments.

avg_sales_depart = train_all.groupby(['Date','Dept'])[['Weekly_Sales']].mean().reset_index()
#PLot for department 1 to 10

fig = plt.figure(figsize=(18,6))
for depart in range(1,11):
    avg_sales_depa_curr = avg_sales_depart[avg_sales_depart['Dept'] == depart]
    plt.plot(avg_sales_depa_curr['Date'], avg_sales_depa_curr['Weekly_Sales'], label='Department ' + str(depart))
    
plt.xlabel("Time --->", fontsize=16)
plt.ylabel("Weekly Sales --->", fontsize=16)
plt.title("Average weekly sales over time for department 1 to department 10",, fontsize=18)
plt.legend()
plt.show()

Similarly, we can plot the same data for 11 to 20 and 20 to 30

Observations

Each Department has its own behaviour.
For some of the department patterns of sales, the number is similar to the average sales number of all departments.
We should have different models for different departments.

Here I have decided to train a model for each store and department.

Average weekly sales with store category and size

In the below plot I am trying to plot average sales with the store and the size of the bubble is denoting the size of the store.

avg_sales = train_all.groupby(['Store','Size', 'Type'])[['Weekly_Sales']].mean().reset_index()

fig = px.scatter(avg_sales, x="Store", y="Weekly_Sales",
	         size="Size", color="Type",
                 hover_name="Store", log_y=True,
                 size_max=60, title="Average Weekly Sales number with size and categoty of the store")
fig.show()

Here, I am trying to plot average sales with the store and size of the bubble denoting the size of the store

Observations

Some stores are in A or B category but the sizes of those stores are less than the size of any C category store.
Here we can see a broad pattern that the store has more size and has more sale numbers.

Store wise Average Sales Numbers Holiday vs Non Holiday week

weekly_sales_dept = train_all.groupby(["Store","IsHoliday"])[["Weekly_Sales"]].mean().reset_index()
plt.figure(figsize=(20,10))
sns.set_style("whitegrid")
ax = sns.barplot(x="Store", y="Weekly_Sales",hue="IsHoliday", data=weekly_sales_dept)
plt.title('Average Sales - per Department', fontsize=18)
plt.ylabel('Sales Number', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()

Here we are trying to compare average sales of a store on a holiday week and non-holiday week.

Observations

Most of the stores have better sales numbers in the holiday week.

Store wise average percentage sale

weekly_sales_dept = train_all.groupby(["Store", 'Type'])[["Weekly_Sales"]].mean().reset_index()
weekly_sales_mean = weekly_sales_dept["Weekly_Sales"].sum()
weekly_sales_dept['Percentage'] = weekly_sales_dept['Weekly_Sales']/weekly_sales_mean*100
weekly_sales_dept = weekly_sales_dept.sort_values('Percentage', ascending=False).reset_index()
weekly_sales_dept['Store Number'] = str(weekly_sales_dept['Store'])

plt.figure(figsize=(24,10))
sns.set_style("whitegrid")
ax = sns.barplot(x="Store", y="Percentage", data=weekly_sales_dept, palette='deep',order=weekly_sales_dept['Store'],hue='Type', dodge=False )

for p in ax.patches:
    ax.annotate(str(format(p.get_height(), '.2f'))+"%", 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   size=12,
                   xytext = (0, -20),
                   rotation=90,
                   weight='bold',
                   color='white',
                   textcoords = 'offset points')
    
plt.title('Average Sales - per Department', fontsize=18)
plt.ylabel('Average Sales Number', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()

Observations

Some A category stores have less average sales number than some B or C category stores.
Most A category stores have more numbers than B or C category stores.

Department wise Average Sales Numbers Holiday vs Non Holiday week

weekly_sales_dept = train_all.groupby(["Dept","IsHoliday"])[["Weekly_Sales"]].mean().reset_index()
plt.figure(figsize=(20,10))
sns.set_style("whitegrid")
ax = sns.barplot(x="Dept", y="Weekly_Sales",hue="IsHoliday", data=weekly_sales_dept)
plt.title('Average Sales - per Department', fontsize=18)
plt.grid(axis='y',color='gray', linestyle='--', linewidth=1)
plt.ylabel('Sales Number', fontsize=16)
plt.xlabel('Departments', fontsize=16)
plt.show()

Here we are trying to compare average sales of a department on a holiday week and non-holiday week.

Observations

For Some departments, the increases in Holiday sales are bigger than the others.
But in most of the departments, the holiday sales are in the range of the average sales.

Store vs Departments

Let find the highest number of department of a store

store_Dept = train_all.groupby(['Store'])[['Dept']].nunique().reset_index().sort_values('Dept', ascending=False)
store_Dept.head()

Observations

Here we have a critical observation. For each store the number of departments available in that store is different. If we have a model for each store and department then there may be an issue if a new department is introduced in future.
We have to check if the above scenario is present or not in our test data.

Let us find the distinct count of Store and Department in the training data is 3331 and the same distinct count of test data is 3169. But between these two distinct lists of Store and Department, 3158 combinations are common. That means there are 11 combinations of Store and Department available in test data but not available in train data.

Average Sales behaviour Over a month

Let's find the average sales of a week for each month. In the plot, Months are shown in numbers. That means Month 1 is January.

train_all['Month'] = train_all['Date'].dt.month
train_all['Week'] = train_all['Date'].dt.week
train_all['Week_Of_Month'] = (np.floor(train_all['Date'].dt.day/7+1))

avg_sale_week_month = train_all.groupby(['Month','Week_Of_Month'])[['Weekly_Sales']].mean().reset_index()

plt.figure(figsize=(20,12))
for k in range(1,13):
    ax = plt.subplot(3,4,k)
    avg_sale_curr = avg_sale_week_month[avg_sale_week_month['Month'] == k]
    
    sns.set_style("whitegrid")
    ax1 = sns.barplot(x="Week_Of_Month", y="Weekly_Sales", data=avg_sale_curr,ax=ax)
    ax1.set_title("Average sales on month " + str(k));
plt.show()

Observations

For most of the months, the first week of the month has more sales than the last week of that month. I added a feature called the week of a month.
In November, I have seen different behaviour than in other months. The last week of November has Thanksgiving day and the sales number is high for this week.
In December, we have the Christmas holiday in the last week of the year. But sales numbers are pretty low this week. People did their Christmas shopping in the prior week.

Let try to analyze the the Sales of last 2 weeks

# Find all records for last 2 weeks
train_all['Year'] = train_all['Date'].dt.year
train_all_Dec = train_all[(train_all['Month'] == 12) & ((train_all['Week'] == 51) | (train_all['Week'] == 52))]

train_all_Dec_grp1 = train_all_Dec.groupby('Date')[['Weekly_Sales']].mean().reset_index()
train_all_Dec_grp1['Date2'] = train_all_Dec_grp1['Date'].apply(lambda x  : str(x.year) + "-" + str(x.month)+"-"+str(x.day))
sns.set_style("whitegrid")
ax = sns.barplot(x="Date2", y="Weekly_Sales", data=train_all_Dec_grp1)
for index, row in train_all_Dec_grp1.iterrows():
    ax.text(row.name,row.Weekly_Sales, round(row.Weekly_Sales,2), color='black', ha="center")
plt.title('Average Sales of last 2 weeks of the year', fontsize=18)
plt.grid(axis='y',color='gray', linestyle='--', linewidth=1)
plt.ylabel('Sales Number', fontsize=16)
plt.xlabel('Date', fontsize=16)
plt.show()

Observations

In 2010, there were no days between the second last week and Christmas day but in 2011, there was one day. This is one of the reasons to have a bigger last week sales number in 2011 compared to 2010.
Sum of sales of the last 2 week of a year has a similar number.

Compare sales of a holiday week with previous and post week

train_all['Year'] = train_all['Date'].dt.year
train_all['Week'] = train_all['Date'].dt.week

all_holiday_week = train_all[train_all['IsHoliday'] == True]['Week'].unique()
train_all_filtered = train_all[train_all['Week'].isin(all_holiday_week) | (train_all['Week']+1).isin(all_holiday_week) |(train_all['Week'] == 1) | (train_all['Week']-1).isin(all_holiday_week)]
filtered_week_sum = train_all_filtered.groupby('Week')[['Weekly_Sales']].mean().reset_index()
#train_all_filtered['Week'].unique()
all_holiday = ['Super Bowl','Labour Day', 'ThanksGiving', 'Christmas']
filtered_week_sum['HolidayCat'] = filtered_week_sum.apply(labelHoliday,axis=1)
filtered_week_sum['PreOrPost'] = filtered_week_sum.apply(preOrPost,axis=1)
#Taking the pivot
train_pivot = filtered_week_sum.pivot(index='HolidayCat', columns='PreOrPost', values='Weekly_Sales').reset_index()
train_pivot['Percentage_Sale_Increase_From_PreviousWeek'] = train_pivot.apply(lambda x: 100*(x['HolidayWeek'] - x['PreWeek'])/x['HolidayWeek'], axis=1)
train_pivot['Percentage_Sale_Decrease_In_PostWeek'] = train_pivot.apply(lambda x: 100*(x['HolidayWeek'] - x['PostWeek'])/x['HolidayWeek'], axis=1)
print(tabulate(train_pivot[['HolidayCat','Percentage_Sale_Increase_From_PreviousWeek','Percentage_Sale_Decrease_In_PostWeek']], headers='keys', tablefmt='psql', showindex=False))

Observation

For Christmas Week, people do their shopping in the prior week.
Thanksgiving week has an almost 30% sales jump compared to the previous week and an almost 25% sale decrease in the post week.

Find the importance of Fuel price

fig = plt.figure(figsize=(20,6))
sns.set_style("whitegrid")
col = 'Fuel_Price'
m = CalculateAverege(col, 'high', 1)
plt.title("For high Fuel Price Month " + str(m),fontsize = 16)
plt.ylabel("Percentage increase from previous month",fontsize = 16)
m = CalculateAverege(col, 'min', 2)
plt.title("For Min Fuel Price Month " + str(m),fontsize = 16)
plt.ylabel("Percentage increase from previous month",fontsize = 16)
m = CalculateAverege(col, 'median', 3)
plt.title("For Median Fuel Price Month " + str(m),fontsize = 16)
plt.ylabel("Percentage increase from previous month",fontsize = 16)
fig.suptitle("Variation of Sales with previous month with Fuel price",fontsize = 22)

plt.show()

Observations

We have seen that average weekly sales have increased from the previous month irrespective of fuel price.
In the minimum fuel price month, price fell by 3.5 per cent and the sale was increased by 11.8 per cent. There may be a reason for the fuel price decrease. We have to consider that this month has a holiday.

Similarly, we can get a similar plot for Temperature, CPI and Unemployment Rate

Observations

For the Minimum temperature month, there is a huge decrease in sales compared to the previous month.
When the temperature is high then also the sales number is low compared to the previous month.

Observations

There is not much variation in CPI. It seems that the Sales number is not dependent on CPI.

Observations

The max unemployment rate in December and in this month, huge sales increase in December compared to the previous month.

Lets find the correlation between different features

corr = train_all.corr()
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=True, cmap="YlGnBu")
plt.plot()
plt.title("Pearson Corelation Heatmap", fontsize=18)
plt.show()

Let's plot Spearman Correlation

Observations

There are very low correlations with the Sales number for most of the columns.

I am keeping all columns except markdowns. We will create some extra columns for the date. You can find it in the data Preprocessing section.

Part-3: Data Pre-Processing & Feature Extraction

In this section, we have done the below steps -

Dropped all markdowns.
Extracted some property from the Date columns
- Year
- Month
- Day
- Week of Month
- Week Number of the Year
- Days to Christmas
One hot encoding for the Store type
Drop the Date & Type

#This is the function which will do all the featurization and add new feature
def add_drop_feature(df):
    #dropping all markdowns
    columns_drop = ['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']
    for col in columns_drop:
        if col in df.columns:
            df.drop(col, axis=1, inplace=True)
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df['Date'].dt.year
    df['Month'] = df['Date'].dt.month
    df['Day'] = df['Date'].dt.day
    df['Week_Of_Month'] = (np.floor(df['Date'].dt.day/7+1))    
    df['Week'] = df['Date'].dt.week
    
    df['Days_to_Christmas'] = (pd.to_datetime(df['Year'].astype(str)+"-12-25", format="%Y-%m-%d") - df['Date']).dt.days.astype(int)
    
    ohe_type = pd.get_dummies(df.Type)
    
    df = pd.concat([df, ohe_type], axis=1)
    
    df = df.drop(['Date', 'Type'], axis=1)
    
    return df
train_df_final = add_drop_feature(train_df)
train_df_final.head()

test_df_final = add_drop_feature(test_df)
test_df_final.head()

Now, Save Train and Tesr data to a .csv file and we can use those files while training and testing the module.

train_df_final.to_csv('train_all_features.csv', index=False)
test_df_final.to_csv('test_all_features.csv', index=False)

Part-4: Machine Learning Regression Models

I have trained a model for each store and department combinations available in the train data. But I have noticed that for some stores and departments there are very few records available in the train data. So I decided that I will train a model when at least 5 records are available in the train data. We will split the data between train and test of a Store and Department. The test size is 0.2. I have tried 4 different models -

Random Forest Regressor
XGBoost Regressor
Extra Trees Regressor
Prophet

I have prepared a function to calculate WMAE. I have used this function extensively.

def calculate_WMAE(dt, y , yHat):
    '''
        This is the method to calculate Performance metric
    '''    
    w = dt['IsHoliday'].apply(lambda x : 5 if x else 1)
    return np.round(np.sum(w*abs(y-yHat))/np.sum(w),4)

1. Random Forest Regressor

At first, split the data into train and cross-validation datasets. Train the Random Forest regressor on the training dataset. After that find the WAME error on the cross-validation datasets. Repeat these steps for multiple hyperparameters.

After that, pick the best model basis on WAME(minimum) scores.

Train the Random Forest regressor model with the hyperparameters of the best model. This time train data will be the complete data for the current Store and Department.

Save a Random Forest regressor model for all applicable Store & Department combinations. After completing the training, there are 3227 trained Random Forest Regressor models stored on the hard disk.

All hyperparameters used to train the model are given below.

#Selecting some random hyper Parameters
RF_HF = dict(n_estimators=[10,20,30,40,50,60,75],
            max_depth=[3,5,7,9,11,15],
            max_features = [2,3,4,5,6,7,8,9],
            min_samples_split = [2,3,4],
            min_samples_leaf = [1,2,3])

Next, randomly pick the hyperparameters from the above list and train the regressor model.

esti = choice(RF_HF['n_estimators'])
md = choice(RF_HF['max_depth'])
mf = choice(RF_HF['max_features'])
mss = choice(RF_HF['min_samples_split'])
msl = choice(RF_HF['min_samples_leaf'])

#Initialize the model
rf = RandomForestRegressor(n_estimators=esti
      ,max_depth=md
       ,max_features = mf
       ,min_samples_split = mss
       ,min_samples_leaf = msl
       ,n_jobs=-1)

#Train the Random Forest model
rf.fit(train_X,train_y)

Now, predict the sales number and calculate WMAE and keep this error number to pick the best-tuned model.

y_hat = rf.predict(test_X)
wmae_score = calculate_WMAE(test_X,test_y,y_hat)

After running these steps multiple times(20), pick the best model and train this with complete data. After training, save this model in a pickle file.

filename = 'Models/RandomForest/RF_'+str(s)+'_'+str(d)+'V1.pkl'

with open(filename, 'wb') as file:
	pickle.dump(model, file)

Let’s find the feature importance of the Random Forest regressor model for some random Store and Department.

 with open('Models/RandomForest/RF_1_1V1.pkl', 'rb') as file:
    pickle_model = pickle.load(file)
    
plt.figure(figsize=(20,10))
sns.set_style("whitegrid")
ax = sns.barplot(x=train_df_final.columns[3:], y=pickle_model.feature_importances_)
plt.title('Random Forest - Feature Importance Store 1 Dept 1 ', fontsize=18)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()

2. XGBoost Regressor

We will follow the same approach for XGBoost Regressor like Random Forest regressor. We will split the data in the same ratio to train and cross-validate the model. After training multiple models for multiple hyperparameters, we will pick the best model which has min WMAE and we will train the best model with the complete data and save the model in a pickle file.

All hyperparameters used to train the model are given below.

XGB_HF = dict(n_estimators=[10,20,30,40,50,60,75],
    max_depth=[2,3,5,7,9,11,15],
    learning_rate = [0.001,0.003,0.01,0.03,0.1])

Next, randomly pick the hyperparameters from the above list and train the regressor model.

esti = choice(XGB_HF['n_estimators'])
md = choice(XGB_HF['max_depth'])
lr = choice(XGB_HF['learning_rate'])

#Initialize the model
rf = xgb.XGBRegressor(n_estimators=esti
		      ,max_depth=md
		       ,learning_rate = lr
		       ,n_jobs=-1)

#Train the Random Forest model
rf.fit(train_X,train_y)

#Calculate WMAE 
y_hat = rf.predict(test_X)
wmae_score = calculate_WMAE(test_X,test_y,y_hat)

After running these steps multiple times(20), pick the best model and train this with complete data. After training, save this model in a pickle file.

filename = 'Models/XGBoost/XGB_'+str(s)+'_'+str(d)+'V'+str(v)+'.pkl'

with open(filename, 'wb') as file:
    pickle.dump(model, file)

Let’s find the feature importance of the XGBoost regressor model for the same Store and Department for which we have seen in the case of Random Forest regressor.

 with open('Models/XGBoost/XGB_1_1V1.pkl', 'rb') as file:
    pickle_model = pickle.load(file)
  
plt.figure(figsize=(16,6))
sns.set_style("whitegrid")
ax = sns.barplot(x=train_df_final.columns[3:], y=pickle_model.feature_importances_)
plt.title('XGBoost - Feature Importance Store 1 Dept 1', fontsize=14)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()

3. Extra Trees Regressor

We will follow the same approach as the Random Forest or XGBoost regressor. We will split the data in the same ratio to train and cross-validate the model. After training multiple models for multiple hyperparameters, we will pick the best model which has min WMAE and we will train the best model with the complete data and save the model in a pickle file.

All hyperparameters used to train the model are given below.

#Hyper parameter for the Extra Trees
ET_HF = dict(n_estimators=[10,20,30,40,50,60,75],
    max_depth=[3,5,7,9,11,15],
    max_features = [2,3,4,5,6,7,8,9],
    min_samples_split = [2,3,4],
    min_samples_leaf = [1,2,3])

Next, randomly pick the hyperparameters from the above list and train the regressor model.

#Selecting some random hyper Parameter
esti = choice(ET_HF['n_estimators'])
md = choice(ET_HF['max_depth'])
mf = choice(ET_HF['max_features'])
mss = choice(ET_HF['min_samples_split'])
msl = choice(ET_HF['min_samples_leaf'])

#Initialize the model
rf = ExtraTreesRegressor(n_estimators=esti
		      ,max_depth=md
		       ,max_features = mf
		       ,min_samples_split = mss
		       ,min_samples_leaf = msl
		       ,n_jobs=-1)

#Train the Extra Trees model
rf.fit(train_X,train_y)

#Calculate WMAE 
y_hat = rf.predict(test_X)
wmae_score = calculate_WMAE(test_X,test_y,y_hat)

After running these steps multiple times(20), pick the best model and train this with complete data. After training, save this model in a pickle file.

filename = 'Models/XtraTrees/ET_'+str(s)+'_'+str(d)+'V'+str(v)+'.pkl'
with open(filename, 'wb') as file:
    pickle.dump(model, file)

Let’s find the feature importance of the EXtra Trees regressor model for the same Store and Department for which we have seen the feature importance of other models.

 with open('Models/XtraTrees/ET_1_1V1.pkl', 'rb') as file:
    pickle_model = pickle.load(file)
    
plt.figure(figsize=(16,6))
sns.set_style("whitegrid")
ax = sns.barplot(x=train_df_final.columns[3:], y=pickle_model.feature_importances_)
plt.title('XGBoost - Feature Importance Store 1 Dept 1', fontsize=14)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()

Let’s compare the feature importance of three Tree based ensembles for some stores & departments which we have seen previously.

s= 1 
d=1    
with open('Models/RandomForest/RF_{}_{}V1.pkl'.format(s,d), 'rb') as file:
    pickle_model1 = pickle.load(file)
with open('Models/XGBoost/XGB_{}_{}V1.pkl'.format(s,d), 'rb') as file:
    pickle_model2 = pickle.load(file)
with open('Models/XtraTrees/ET_{}_{}V1.pkl'.format(s,d), 'rb') as file:
    pickle_model3 = pickle.load(file)
    
col = list(train_df_final.columns[3:])

data = {'Feature': np.concatenate((col,col,col), axis=None), 
     'Feature_Importtance': np.concatenate((pickle_model1.feature_importances_,pickle_model2.feature_importances_,pickle_model3.feature_importances_), axis=None), 
     'Model_Type':np.concatenate((['RandomForest']*15,['XGBoost']*15, ['XtraTrees']*15), axis=None)
    }
df = pd.DataFrame(data=data)

plt.figure(figsize=(16,6))
sns.set_style("whitegrid")
ax = sns.barplot(x='Feature', y='Feature_Importtance', hue='Model_Type', data=df)

plt.title('Featur Importance Store {} Dept {}'.format(s,d), fontsize=14)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()

4. Prophet

This is a time series forecasting model developed by Facebook. This is based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Click here to find the documentation of this. Click here to download the research paper.

Here, only date & holiday information are used to train a prophet model of a store and department. While training a Prophet model, you can select seasonality. Here, yearly seasonality is set to True and weekly and daily seasonality are set to False.

Training a prophet model

train_Data['Month'] = train_Data['Month'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))
train_Data['Day'] = train_Data['Day'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))
train_Data['Date'] = train_Data.apply(lambda x: str(x['Year'])+"-"+str(x['Month'])+"-"+str(x['Day']), axis=1)
train_Data['Date'] = pd.to_datetime(train_Data['Date'])

prophrt_data = train_Data[['Date', 'Weekly_Sales']]
prophrt_data = prophrt_data.rename(columns={'Date': 'ds', 'Weekly_Sales': 'y'})

prophet = Prophet(holidays=holidays, yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False)
prophet.fit(prophrt_data)

Saving a model

Unlike other models, you should save a Prophet model in a JSON file. Please find the below code for doing this. There is a function called model_to_json available in the Prophet library which is used to save a model in a JSON file.

filename = 'Models/Prophet/PP_'+str(s)+'_'+str(d)+'V'+str(v)+'.json'
        
with open(filename, 'w') as file:
    json.dump(model_to_json(model), file)

Part-5: Evaluation of the Models

Now comes the most important part. What are the sales numbers of the test data and what is the WMAE score for these sales numbers?

Here the sales numbers of the test data are not available. Have to create a CSV file with the id column & sales number mentioned in the Kaggle problem and submit it in the Kaggle to get the score.

But the sales numbers of some Stores & Departments are not available in our train data but we have to predict the sales number. For these Stores & Departments, I have taken an approach. The approach is very simple. I will calculate the similar store in terms of average sales numbers/size and pick the model of the similar store for that department.

def get_nearestStore(store, field, model_info='sales'):
    '''
        This is a function which will give us a similar store in term of Field. 
        By default it will give us the nearest store based on the weekly sales
    '''
    avg_sales = train_df_final.groupby(['Store'])[['Weekly_Sales']].mean().reset_index()
    if field == 'size':
        avg_sales = train_df_final.groupby(['Store'])[['Size']].mean().reset_index()
        
    
    avg_sales['ABS_Diff'] = abs(avg_sales.iloc[:,1] - avg_sales[avg_sales['Store'] == store].reset_index().iloc[0,1])
    
    avg_sales = avg_sales.merge(model_info, on='Store')
    alt_store = avg_sales[avg_sales['Store'] != store].reset_index()[['Store']].iloc[0,0]
    
    return int(alt_store);

For each Store and Department, we have to load the saved models and forecast the sales numbers.

After that, I generated the CSV files to submit in Kaggle and get the score.

weekly_sales_test['Month'] = weekly_sales_test['Month'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))
weekly_sales_test['Day'] = weekly_sales_test['Day'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))

weekly_sales_test['Id'] = weekly_sales_test.apply(lambda x: str(x['Store']) + "_" + str(x['Dept'])+"_"+str(x['Year'])+"-"+str(x['Month'])+"-"+str(x['Day']), axis=1)

weekly_sales_test[['Id','Weekly_Sales']].to_csv('Submissions/Submission_v16_All_VAll.csv', index=False);

Here you can find the WMAE score of all models.

Models	Private Score	Public Score
Random Forest	3089.84	2921.30448
Average of 2 Random Forests	3041.27575	2892.53128
Best Model of 2 Random Forests	3021.37934	2881.91710
XGBoost	3269.41583	3065.33341
Average of XGBoost + Random Forest	3011.23272	2875.34598
ExtraTrees	3123.9821	2946.12504
Average of Random Forest + XGBoost + ExtraTrees	2983.59973	2839.19310
Prophet	3073.71809	2966.65869
Average of all Models	2770.29707	2639.92581
`Average of Random Forest + Prophet`	`2736.33345`	`2627.53877`


Average of Random Forest + Prophet


Average of all models

Part-6: Future Work

I want to try some Deep learning based model to improve the score further.
I also try some statistical model like ARIMA

Please find the GitHub link of the project.

Thank you for your attention and for reading my work. If you liked this story, share it with your friends and colleagues!. Also, follow me on LinkedIn.

I want to thank all the team members of the Applied AI course, especially Srikanth sir. Please visit https://www.appliedaicourse.com/ for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
images		images
All_model_details.csv		All_model_details.csv
README.md		README.md
Waltmat_EDA.ipynb		Waltmat_EDA.ipynb
_config.yml		_config.yml
applyModels.ipynb		applyModels.ipynb
featurization.ipynb		featurization.ipynb

arnab000007/Walmart-Store-Sales-Forecasting

Folders and files

Latest commit

History

Repository files navigation

Walmart Store Sales Forecasting

Kaggle Problem Definition

The Big Question - Why we need a Machine Learning approach?

Overview of this Blog:

Part-1: Kaggle Data

Part-2: Exploratory Data Analysis (EDA)

Observations

Store Type Vs Size

Observations

Highest Sales of a Week for a Department

Observations

Average Weekly sales over the time

Let zoom the sales number over the year

Observations

Observations

Average weekly sales with store category and size

Observations

Store wise Average Sales Numbers Holiday vs Non Holiday week

Observations

Store wise average percentage sale

Observations

Department wise Average Sales Numbers Holiday vs Non Holiday week

Observations

Store vs Departments

Observations

Average Sales behaviour Over a month

Observations

Let try to analyze the the Sales of last 2 weeks

Observations

Compare sales of a holiday week with previous and post week

Observation

Find the importance of Fuel price

Observations

Similarly, we can get a similar plot for Temperature, CPI and Unemployment Rate

Observations

Observations

Observations

Lets find the correlation between different features

Let's plot Spearman Correlation

Observations

Part-3: Data Pre-Processing & Feature Extraction

Part-4: Machine Learning Regression Models

1. Random Forest Regressor

2. XGBoost Regressor

3. Extra Trees Regressor

4. Prophet

Training a prophet model

Saving a model

Part-5: Evaluation of the Models

Part-6: Future Work

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages