Walmart is a supermarket chain in the USA. Currently, they have opened their stores all over the world. Predicting future sales for a company like Walmart is one of the most important aspects of strategic planning. Based on the sales number they can take an informed decision for the short term and long term. The future sales number also helps to recruit contract employees based on sales. It will also help a Walmart store to work more efficiently.
They have multiple stores in various locations across the world. There may be different sales patterns in different stores placed in different locations. To identify this pattern and predict future sales, we should have a complete solution. Here, we can take a machine learning approach to create this complete solution. If we have a trained model, predicting the sales number of a store shouldn’t be a tedious task to do. It will save lots of human time.
- Part-1: Kaggle Data
- Part-2: Exploratory Data Analysis (EDA)
- Part-3: Data Pre-Processing & Feature Extraction
- Part-4: Machine Learning Regression Models
- Part-5: Evaluation of the Models
- Part-6: Future Work
We are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you are tasked with predicting the department-wide sales for each store.
In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.
stores.csv
This file contains anonymized information about the 45 stores, indicating the type and size of the store.
train.csv
This is the historical training data covering 2010–02–05 to 2012–11–01. Within this file you will find the following fields:
- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales - sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week
test.csv
This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.
features.csv This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
- Store - the store number
- Date - the week
- Temperature - the average temperature in the region
- Fuel_Price - the cost of fuel in the region
- MarkDown1–5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011 and is not available for all stores all the time. Any missing value is marked with an NA.
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday - whether the week is a special holiday week
There are 4 holidays mentioned in the dataset
- Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
- Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
- Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
- Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
The performance Metric of this competition is the Weighted Mean Absolute Error(WMAE). As per the Kaggle page, the weight of the week which has a holiday is 5 otherwise 1. That means Walmart is more focused on the holiday week and they want to have less error on that week. We also focus on that part as well and try to reduce the error.
At first, we want to load all CSV files
#Sales Data
train_data = pd.read_csv("train/train.csv")
test_data = pd.read_csv("test/test.csv")
#Features.csv
feature_data = pd.read_csv("features.csv")
#Stores.csv
store_data = pd.read_csv("stores.csv")
Currently, we have multiple data frames. Let's join those data frames.
#Join The Train data
train_all = train_data.set_index("Store").join(store_data.set_index("Store"))
train_all.reset_index(inplace=True)
train_all = train_all.set_index(["Store","Date"]).join(feature_data.set_index(["Store","Date"]))
train_all.reset_index(inplace=True)
#Train data size
print(train_all.shape)
train_all.head()
Now, check the number of NaN values for each column for both Train and Test data.
Train Data
print(train_all.isnull().sum())
print(train_all.isnull().sum()*100.0/train_all.shape[0])
Test Data
print(test_all.isnull().sum())
print(test_all.isnull().sum()*100.0/test_all.shape[0])
- There are more than 60 per cent markdowns that are NULL. As per the competition page Markdowns are available after Nov 2011. For the data before Nov 2011, the markdowns should be zero.
- For one-third of the test data the CPI and Unemployment. These were dependent on the location and time. But it cannot be changed drastically. So we can impute store location-wise mean of these columns.
Let's fill in the NaN values -
#Replace CPI & Unemployment
test_all['CPI'] = test_all['CPI'].fillna(train_all.groupby('Store')['CPI'].transform('mean'))
test_all['Unemployment'] = test_all['Unemployment'].fillna(train_all.groupby('Store')['Unemployment'].transform('mean'))
#Replace the Markdowns with zero
train_all.fillna(0, inplace=True)
test_all.fillna(0, inplace=True)
sns.set_style("whitegrid")
ax = sns.boxplot(x='Type', y='Size', data=train_all).set_title("Box Plot of Type vs Size")
- Store Size is very different for different types of store.
train_all_new = train_all.sort_values('Weekly_Sales',ascending=False)
train_all_new = train_all_new.head(50)
train_all_new['Week'] = train_all_new['Date'].dt.week
train_all_new['Month'] = train_all_new['Date'].dt.month
train_all_new['Year'] = train_all_new['Date'].dt.year
#Pick top 100 rows
print(tabulate(train_all_new[['Store','Dept','Size','Type','Week','Month','Year','Weekly_Sales']], headers='keys', tablefmt='psql', showindex=False))
- The highest sales are available on the week of Thanksgiving and the week before Christmas.
- Most of the sales take place in the department of 7 and 72
avg_sales = train_all.groupby('Date')[['Weekly_Sales']].mean().reset_index()
fig = plt.figure(figsize=(18,6))
plt.plot(avg_sales['Date'], avg_sales['Weekly_Sales'])
plt.xlabel("Time --->", fontsize=16)
plt.ylabel("Weekly Sales --->", fontsize=16)
plt.title("Average weekly sales over time", fontsize=18)
plt.show()
avg_sales = train_all.groupby('Date')[['Weekly_Sales']].agg(['mean','median']).reset_index()
avg_sales['Week'] = avg_sales['Date'].dt.week
avg_sales_2010 = avg_sales[avg_sales['Date'].dt.year == 2010]
avg_sales_2011 = avg_sales[avg_sales['Date'].dt.year == 2011]
avg_sales_2012 = avg_sales[avg_sales['Date'].dt.year == 2012]
#Plot the model
fig = plt.figure(figsize=(18,6))
plt.plot(avg_sales_2010['Week'], avg_sales_2010['Weekly_Sales']['mean'], label='Year 2010 Mean Sales')
plt.plot(avg_sales_2010['Week'], avg_sales_2010['Weekly_Sales']['median'], label='Year 2010 Median Sales')
plt.plot(avg_sales_2011['Week'], avg_sales_2011['Weekly_Sales']['mean'], label='Year 2011 Mean Sales')
plt.plot(avg_sales_2011['Week'], avg_sales_2011['Weekly_Sales']['median'], label='Year 2011 Median Sales')
plt.plot(avg_sales_2012['Week'], avg_sales_2012['Weekly_Sales']['mean'], label='Year 2012 Mean Sales')
plt.plot(avg_sales_2012['Week'], avg_sales_2012['Weekly_Sales']['median'], label='Year 2012 Median Sales')
plt.xticks(np.arange(1, 53, step=1))
plt.grid(axis='both',color='grey', linestyle='--', linewidth=1)
plt.legend()
plt.xlabel("Week --->", fontsize=16)
plt.ylabel("Weekly Sales --->", fontsize=16)
plt.title("Average/Median weekly sales of the Different Week Over a year", fontsize=18)
- The Sales Number from last November to December are much bigger than a normal week.
- The Sales numbers are minimum after December.
- We can see a spike for the Super Bowl.
- In January the average sales are low.
Now, Let's try to find the pattern of Weekly sales for some departments.
avg_sales_depart = train_all.groupby(['Date','Dept'])[['Weekly_Sales']].mean().reset_index()
#PLot for department 1 to 10
fig = plt.figure(figsize=(18,6))
for depart in range(1,11):
avg_sales_depa_curr = avg_sales_depart[avg_sales_depart['Dept'] == depart]
plt.plot(avg_sales_depa_curr['Date'], avg_sales_depa_curr['Weekly_Sales'], label='Department ' + str(depart))
plt.xlabel("Time --->", fontsize=16)
plt.ylabel("Weekly Sales --->", fontsize=16)
plt.title("Average weekly sales over time for department 1 to department 10",, fontsize=18)
plt.legend()
plt.show()
Similarly, we can plot the same data for 11 to 20 and 20 to 30
- Each Department has its own behaviour.
- For some of the department patterns of sales, the number is similar to the average sales number of all departments.
- We should have different models for different departments.
Here I have decided to train a model for each store and department.
In the below plot I am trying to plot average sales with the store and the size of the bubble is denoting the size of the store.
avg_sales = train_all.groupby(['Store','Size', 'Type'])[['Weekly_Sales']].mean().reset_index()
fig = px.scatter(avg_sales, x="Store", y="Weekly_Sales",
size="Size", color="Type",
hover_name="Store", log_y=True,
size_max=60, title="Average Weekly Sales number with size and categoty of the store")
fig.show()
Here, I am trying to plot average sales with the store and size of the bubble denoting the size of the store
- Some stores are in A or B category but the sizes of those stores are less than the size of any C category store.
- Here we can see a broad pattern that the store has more size and has more sale numbers.
weekly_sales_dept = train_all.groupby(["Store","IsHoliday"])[["Weekly_Sales"]].mean().reset_index()
plt.figure(figsize=(20,10))
sns.set_style("whitegrid")
ax = sns.barplot(x="Store", y="Weekly_Sales",hue="IsHoliday", data=weekly_sales_dept)
plt.title('Average Sales - per Department', fontsize=18)
plt.ylabel('Sales Number', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()
Here we are trying to compare average sales of a store on a holiday week and non-holiday week.
- Most of the stores have better sales numbers in the holiday week.
weekly_sales_dept = train_all.groupby(["Store", 'Type'])[["Weekly_Sales"]].mean().reset_index()
weekly_sales_mean = weekly_sales_dept["Weekly_Sales"].sum()
weekly_sales_dept['Percentage'] = weekly_sales_dept['Weekly_Sales']/weekly_sales_mean*100
weekly_sales_dept = weekly_sales_dept.sort_values('Percentage', ascending=False).reset_index()
weekly_sales_dept['Store Number'] = str(weekly_sales_dept['Store'])
plt.figure(figsize=(24,10))
sns.set_style("whitegrid")
ax = sns.barplot(x="Store", y="Percentage", data=weekly_sales_dept, palette='deep',order=weekly_sales_dept['Store'],hue='Type', dodge=False )
for p in ax.patches:
ax.annotate(str(format(p.get_height(), '.2f'))+"%",
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
size=12,
xytext = (0, -20),
rotation=90,
weight='bold',
color='white',
textcoords = 'offset points')
plt.title('Average Sales - per Department', fontsize=18)
plt.ylabel('Average Sales Number', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()
- Some A category stores have less average sales number than some B or C category stores.
- Most A category stores have more numbers than B or C category stores.
weekly_sales_dept = train_all.groupby(["Dept","IsHoliday"])[["Weekly_Sales"]].mean().reset_index()
plt.figure(figsize=(20,10))
sns.set_style("whitegrid")
ax = sns.barplot(x="Dept", y="Weekly_Sales",hue="IsHoliday", data=weekly_sales_dept)
plt.title('Average Sales - per Department', fontsize=18)
plt.grid(axis='y',color='gray', linestyle='--', linewidth=1)
plt.ylabel('Sales Number', fontsize=16)
plt.xlabel('Departments', fontsize=16)
plt.show()
Here we are trying to compare average sales of a department on a holiday week and non-holiday week.
- For Some departments, the increases in Holiday sales are bigger than the others.
- But in most of the departments, the holiday sales are in the range of the average sales.
Let find the highest number of department of a store
store_Dept = train_all.groupby(['Store'])[['Dept']].nunique().reset_index().sort_values('Dept', ascending=False)
store_Dept.head()
- Here we have a critical observation. For each store the number of departments available in that store is different. If we have a model for each store and department then there may be an issue if a new department is introduced in future.
- We have to check if the above scenario is present or not in our test data.
Let us find the distinct count of Store and Department in the training data is 3331 and the same distinct count of test data is 3169. But between these two distinct lists of Store and Department, 3158 combinations are common. That means there are 11 combinations of Store and Department available in test data but not available in train data.
Let's find the average sales of a week for each month. In the plot, Months are shown in numbers. That means Month 1 is January.
train_all['Month'] = train_all['Date'].dt.month
train_all['Week'] = train_all['Date'].dt.week
train_all['Week_Of_Month'] = (np.floor(train_all['Date'].dt.day/7+1))
avg_sale_week_month = train_all.groupby(['Month','Week_Of_Month'])[['Weekly_Sales']].mean().reset_index()
plt.figure(figsize=(20,12))
for k in range(1,13):
ax = plt.subplot(3,4,k)
avg_sale_curr = avg_sale_week_month[avg_sale_week_month['Month'] == k]
sns.set_style("whitegrid")
ax1 = sns.barplot(x="Week_Of_Month", y="Weekly_Sales", data=avg_sale_curr,ax=ax)
ax1.set_title("Average sales on month " + str(k));
plt.show()
- For most of the months, the first week of the month has more sales than the last week of that month. I added a feature called the week of a month.
- In November, I have seen different behaviour than in other months. The last week of November has Thanksgiving day and the sales number is high for this week.
- In December, we have the Christmas holiday in the last week of the year. But sales numbers are pretty low this week. People did their Christmas shopping in the prior week.
# Find all records for last 2 weeks
train_all['Year'] = train_all['Date'].dt.year
train_all_Dec = train_all[(train_all['Month'] == 12) & ((train_all['Week'] == 51) | (train_all['Week'] == 52))]
train_all_Dec_grp1 = train_all_Dec.groupby('Date')[['Weekly_Sales']].mean().reset_index()
train_all_Dec_grp1['Date2'] = train_all_Dec_grp1['Date'].apply(lambda x : str(x.year) + "-" + str(x.month)+"-"+str(x.day))
sns.set_style("whitegrid")
ax = sns.barplot(x="Date2", y="Weekly_Sales", data=train_all_Dec_grp1)
for index, row in train_all_Dec_grp1.iterrows():
ax.text(row.name,row.Weekly_Sales, round(row.Weekly_Sales,2), color='black', ha="center")
plt.title('Average Sales of last 2 weeks of the year', fontsize=18)
plt.grid(axis='y',color='gray', linestyle='--', linewidth=1)
plt.ylabel('Sales Number', fontsize=16)
plt.xlabel('Date', fontsize=16)
plt.show()
- In 2010, there were no days between the second last week and Christmas day but in 2011, there was one day. This is one of the reasons to have a bigger last week sales number in 2011 compared to 2010.
- Sum of sales of the last 2 week of a year has a similar number.
train_all['Year'] = train_all['Date'].dt.year
train_all['Week'] = train_all['Date'].dt.week
all_holiday_week = train_all[train_all['IsHoliday'] == True]['Week'].unique()
train_all_filtered = train_all[train_all['Week'].isin(all_holiday_week) | (train_all['Week']+1).isin(all_holiday_week) |(train_all['Week'] == 1) | (train_all['Week']-1).isin(all_holiday_week)]
filtered_week_sum = train_all_filtered.groupby('Week')[['Weekly_Sales']].mean().reset_index()
#train_all_filtered['Week'].unique()
all_holiday = ['Super Bowl','Labour Day', 'ThanksGiving', 'Christmas']
filtered_week_sum['HolidayCat'] = filtered_week_sum.apply(labelHoliday,axis=1)
filtered_week_sum['PreOrPost'] = filtered_week_sum.apply(preOrPost,axis=1)
#Taking the pivot
train_pivot = filtered_week_sum.pivot(index='HolidayCat', columns='PreOrPost', values='Weekly_Sales').reset_index()
train_pivot['Percentage_Sale_Increase_From_PreviousWeek'] = train_pivot.apply(lambda x: 100*(x['HolidayWeek'] - x['PreWeek'])/x['HolidayWeek'], axis=1)
train_pivot['Percentage_Sale_Decrease_In_PostWeek'] = train_pivot.apply(lambda x: 100*(x['HolidayWeek'] - x['PostWeek'])/x['HolidayWeek'], axis=1)
print(tabulate(train_pivot[['HolidayCat','Percentage_Sale_Increase_From_PreviousWeek','Percentage_Sale_Decrease_In_PostWeek']], headers='keys', tablefmt='psql', showindex=False))
- For Christmas Week, people do their shopping in the prior week.
- Thanksgiving week has an almost 30% sales jump compared to the previous week and an almost 25% sale decrease in the post week.
fig = plt.figure(figsize=(20,6))
sns.set_style("whitegrid")
col = 'Fuel_Price'
m = CalculateAverege(col, 'high', 1)
plt.title("For high Fuel Price Month " + str(m),fontsize = 16)
plt.ylabel("Percentage increase from previous month",fontsize = 16)
m = CalculateAverege(col, 'min', 2)
plt.title("For Min Fuel Price Month " + str(m),fontsize = 16)
plt.ylabel("Percentage increase from previous month",fontsize = 16)
m = CalculateAverege(col, 'median', 3)
plt.title("For Median Fuel Price Month " + str(m),fontsize = 16)
plt.ylabel("Percentage increase from previous month",fontsize = 16)
fig.suptitle("Variation of Sales with previous month with Fuel price",fontsize = 22)
plt.show()
- We have seen that average weekly sales have increased from the previous month irrespective of fuel price.
- In the minimum fuel price month, price fell by 3.5 per cent and the sale was increased by 11.8 per cent. There may be a reason for the fuel price decrease. We have to consider that this month has a holiday.
- For the Minimum temperature month, there is a huge decrease in sales compared to the previous month.
- When the temperature is high then also the sales number is low compared to the previous month.
- There is not much variation in CPI. It seems that the Sales number is not dependent on CPI.
- The max unemployment rate in December and in this month, huge sales increase in December compared to the previous month.
corr = train_all.corr()
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=True, cmap="YlGnBu")
plt.plot()
plt.title("Pearson Corelation Heatmap", fontsize=18)
plt.show()
- There are very low correlations with the Sales number for most of the columns.
I am keeping all columns except markdowns. We will create some extra columns for the date. You can find it in the data Preprocessing section.
In this section, we have done the below steps -
- Dropped all markdowns.
- Extracted some property from the Date columns
- Year
- Month
- Day
- Week of Month
- Week Number of the Year
- Days to Christmas
- One hot encoding for the Store type
- Drop the Date & Type
#This is the function which will do all the featurization and add new feature
def add_drop_feature(df):
#dropping all markdowns
columns_drop = ['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']
for col in columns_drop:
if col in df.columns:
df.drop(col, axis=1, inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Week_Of_Month'] = (np.floor(df['Date'].dt.day/7+1))
df['Week'] = df['Date'].dt.week
df['Days_to_Christmas'] = (pd.to_datetime(df['Year'].astype(str)+"-12-25", format="%Y-%m-%d") - df['Date']).dt.days.astype(int)
ohe_type = pd.get_dummies(df.Type)
df = pd.concat([df, ohe_type], axis=1)
df = df.drop(['Date', 'Type'], axis=1)
return df
train_df_final = add_drop_feature(train_df)
train_df_final.head()
test_df_final = add_drop_feature(test_df)
test_df_final.head()
Now, Save Train and Tesr data to a .csv file and we can use those files while training and testing the module.
train_df_final.to_csv('train_all_features.csv', index=False)
test_df_final.to_csv('test_all_features.csv', index=False)
I have trained a model for each store and department combinations available in the train data. But I have noticed that for some stores and departments there are very few records available in the train data. So I decided that I will train a model when at least 5 records are available in the train data. We will split the data between train and test of a Store and Department. The test size is 0.2. I have tried 4 different models -
- Random Forest Regressor
- XGBoost Regressor
- Extra Trees Regressor
- Prophet
I have prepared a function to calculate WMAE. I have used this function extensively.
def calculate_WMAE(dt, y , yHat):
'''
This is the method to calculate Performance metric
'''
w = dt['IsHoliday'].apply(lambda x : 5 if x else 1)
return np.round(np.sum(w*abs(y-yHat))/np.sum(w),4)
At first, split the data into train and cross-validation datasets. Train the Random Forest regressor on the training dataset. After that find the WAME error on the cross-validation datasets. Repeat these steps for multiple hyperparameters.
After that, pick the best model basis on WAME(minimum) scores.
Train the Random Forest regressor model with the hyperparameters of the best model. This time train data will be the complete data for the current Store and Department.
Save a Random Forest regressor model for all applicable Store & Department combinations. After completing the training, there are 3227 trained Random Forest Regressor models stored on the hard disk.
All hyperparameters used to train the model are given below.
#Selecting some random hyper Parameters
RF_HF = dict(n_estimators=[10,20,30,40,50,60,75],
max_depth=[3,5,7,9,11,15],
max_features = [2,3,4,5,6,7,8,9],
min_samples_split = [2,3,4],
min_samples_leaf = [1,2,3])
Next, randomly pick the hyperparameters from the above list and train the regressor model.
esti = choice(RF_HF['n_estimators'])
md = choice(RF_HF['max_depth'])
mf = choice(RF_HF['max_features'])
mss = choice(RF_HF['min_samples_split'])
msl = choice(RF_HF['min_samples_leaf'])
#Initialize the model
rf = RandomForestRegressor(n_estimators=esti
,max_depth=md
,max_features = mf
,min_samples_split = mss
,min_samples_leaf = msl
,n_jobs=-1)
#Train the Random Forest model
rf.fit(train_X,train_y)
Now, predict the sales number and calculate WMAE and keep this error number to pick the best-tuned model.
y_hat = rf.predict(test_X)
wmae_score = calculate_WMAE(test_X,test_y,y_hat)
After running these steps multiple times(20), pick the best model and train this with complete data. After training, save this model in a pickle file.
filename = 'Models/RandomForest/RF_'+str(s)+'_'+str(d)+'V1.pkl'
with open(filename, 'wb') as file:
pickle.dump(model, file)
Let’s find the feature importance of the Random Forest regressor model for some random Store and Department.
with open('Models/RandomForest/RF_1_1V1.pkl', 'rb') as file:
pickle_model = pickle.load(file)
plt.figure(figsize=(20,10))
sns.set_style("whitegrid")
ax = sns.barplot(x=train_df_final.columns[3:], y=pickle_model.feature_importances_)
plt.title('Random Forest - Feature Importance Store 1 Dept 1 ', fontsize=18)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()
We will follow the same approach for XGBoost Regressor like Random Forest regressor. We will split the data in the same ratio to train and cross-validate the model. After training multiple models for multiple hyperparameters, we will pick the best model which has min WMAE and we will train the best model with the complete data and save the model in a pickle file.
All hyperparameters used to train the model are given below.
XGB_HF = dict(n_estimators=[10,20,30,40,50,60,75],
max_depth=[2,3,5,7,9,11,15],
learning_rate = [0.001,0.003,0.01,0.03,0.1])
Next, randomly pick the hyperparameters from the above list and train the regressor model.
esti = choice(XGB_HF['n_estimators'])
md = choice(XGB_HF['max_depth'])
lr = choice(XGB_HF['learning_rate'])
#Initialize the model
rf = xgb.XGBRegressor(n_estimators=esti
,max_depth=md
,learning_rate = lr
,n_jobs=-1)
#Train the Random Forest model
rf.fit(train_X,train_y)
#Calculate WMAE
y_hat = rf.predict(test_X)
wmae_score = calculate_WMAE(test_X,test_y,y_hat)
After running these steps multiple times(20), pick the best model and train this with complete data. After training, save this model in a pickle file.
filename = 'Models/XGBoost/XGB_'+str(s)+'_'+str(d)+'V'+str(v)+'.pkl'
with open(filename, 'wb') as file:
pickle.dump(model, file)
Let’s find the feature importance of the XGBoost regressor model for the same Store and Department for which we have seen in the case of Random Forest regressor.
with open('Models/XGBoost/XGB_1_1V1.pkl', 'rb') as file:
pickle_model = pickle.load(file)
plt.figure(figsize=(16,6))
sns.set_style("whitegrid")
ax = sns.barplot(x=train_df_final.columns[3:], y=pickle_model.feature_importances_)
plt.title('XGBoost - Feature Importance Store 1 Dept 1', fontsize=14)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()
We will follow the same approach as the Random Forest or XGBoost regressor. We will split the data in the same ratio to train and cross-validate the model. After training multiple models for multiple hyperparameters, we will pick the best model which has min WMAE and we will train the best model with the complete data and save the model in a pickle file.
All hyperparameters used to train the model are given below.
#Hyper parameter for the Extra Trees
ET_HF = dict(n_estimators=[10,20,30,40,50,60,75],
max_depth=[3,5,7,9,11,15],
max_features = [2,3,4,5,6,7,8,9],
min_samples_split = [2,3,4],
min_samples_leaf = [1,2,3])
Next, randomly pick the hyperparameters from the above list and train the regressor model.
#Selecting some random hyper Parameter
esti = choice(ET_HF['n_estimators'])
md = choice(ET_HF['max_depth'])
mf = choice(ET_HF['max_features'])
mss = choice(ET_HF['min_samples_split'])
msl = choice(ET_HF['min_samples_leaf'])
#Initialize the model
rf = ExtraTreesRegressor(n_estimators=esti
,max_depth=md
,max_features = mf
,min_samples_split = mss
,min_samples_leaf = msl
,n_jobs=-1)
#Train the Extra Trees model
rf.fit(train_X,train_y)
#Calculate WMAE
y_hat = rf.predict(test_X)
wmae_score = calculate_WMAE(test_X,test_y,y_hat)
After running these steps multiple times(20), pick the best model and train this with complete data. After training, save this model in a pickle file.
filename = 'Models/XtraTrees/ET_'+str(s)+'_'+str(d)+'V'+str(v)+'.pkl'
with open(filename, 'wb') as file:
pickle.dump(model, file)
Let’s find the feature importance of the EXtra Trees regressor model for the same Store and Department for which we have seen the feature importance of other models.
with open('Models/XtraTrees/ET_1_1V1.pkl', 'rb') as file:
pickle_model = pickle.load(file)
plt.figure(figsize=(16,6))
sns.set_style("whitegrid")
ax = sns.barplot(x=train_df_final.columns[3:], y=pickle_model.feature_importances_)
plt.title('XGBoost - Feature Importance Store 1 Dept 1', fontsize=14)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()
Let’s compare the feature importance of three Tree based ensembles for some stores & departments which we have seen previously.
s= 1
d=1
with open('Models/RandomForest/RF_{}_{}V1.pkl'.format(s,d), 'rb') as file:
pickle_model1 = pickle.load(file)
with open('Models/XGBoost/XGB_{}_{}V1.pkl'.format(s,d), 'rb') as file:
pickle_model2 = pickle.load(file)
with open('Models/XtraTrees/ET_{}_{}V1.pkl'.format(s,d), 'rb') as file:
pickle_model3 = pickle.load(file)
col = list(train_df_final.columns[3:])
data = {'Feature': np.concatenate((col,col,col), axis=None),
'Feature_Importtance': np.concatenate((pickle_model1.feature_importances_,pickle_model2.feature_importances_,pickle_model3.feature_importances_), axis=None),
'Model_Type':np.concatenate((['RandomForest']*15,['XGBoost']*15, ['XtraTrees']*15), axis=None)
}
df = pd.DataFrame(data=data)
plt.figure(figsize=(16,6))
sns.set_style("whitegrid")
ax = sns.barplot(x='Feature', y='Feature_Importtance', hue='Model_Type', data=df)
plt.title('Featur Importance Store {} Dept {}'.format(s,d), fontsize=14)
plt.ylabel('Feature Importance', fontsize=16)
plt.xlabel('Feature', fontsize=16)
plt.show()
This is a time series forecasting model developed by Facebook. This is based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Click here to find the documentation of this. Click here to download the research paper.
Here, only date & holiday information are used to train a prophet model of a store and department. While training a Prophet model, you can select seasonality. Here, yearly seasonality is set to True and weekly and daily seasonality are set to False.
train_Data['Month'] = train_Data['Month'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))
train_Data['Day'] = train_Data['Day'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))
train_Data['Date'] = train_Data.apply(lambda x: str(x['Year'])+"-"+str(x['Month'])+"-"+str(x['Day']), axis=1)
train_Data['Date'] = pd.to_datetime(train_Data['Date'])
prophrt_data = train_Data[['Date', 'Weekly_Sales']]
prophrt_data = prophrt_data.rename(columns={'Date': 'ds', 'Weekly_Sales': 'y'})
prophet = Prophet(holidays=holidays, yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False)
prophet.fit(prophrt_data)
Unlike other models, you should save a Prophet model in a JSON file. Please find the below code for doing this. There is a function called model_to_json available in the Prophet library which is used to save a model in a JSON file.
filename = 'Models/Prophet/PP_'+str(s)+'_'+str(d)+'V'+str(v)+'.json'
with open(filename, 'w') as file:
json.dump(model_to_json(model), file)
Now comes the most important part. What are the sales numbers of the test data and what is the WMAE score for these sales numbers?
Here the sales numbers of the test data are not available. Have to create a CSV file with the id column & sales number mentioned in the Kaggle problem and submit it in the Kaggle to get the score.
But the sales numbers of some Stores & Departments are not available in our train data but we have to predict the sales number. For these Stores & Departments, I have taken an approach. The approach is very simple. I will calculate the similar store in terms of average sales numbers/size and pick the model of the similar store for that department.
def get_nearestStore(store, field, model_info='sales'):
'''
This is a function which will give us a similar store in term of Field.
By default it will give us the nearest store based on the weekly sales
'''
avg_sales = train_df_final.groupby(['Store'])[['Weekly_Sales']].mean().reset_index()
if field == 'size':
avg_sales = train_df_final.groupby(['Store'])[['Size']].mean().reset_index()
avg_sales['ABS_Diff'] = abs(avg_sales.iloc[:,1] - avg_sales[avg_sales['Store'] == store].reset_index().iloc[0,1])
avg_sales = avg_sales.merge(model_info, on='Store')
alt_store = avg_sales[avg_sales['Store'] != store].reset_index()[['Store']].iloc[0,0]
return int(alt_store);
For each Store and Department, we have to load the saved models and forecast the sales numbers.
After that, I generated the CSV files to submit in Kaggle and get the score.
weekly_sales_test['Month'] = weekly_sales_test['Month'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))
weekly_sales_test['Day'] = weekly_sales_test['Day'].apply(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x))
weekly_sales_test['Id'] = weekly_sales_test.apply(lambda x: str(x['Store']) + "_" + str(x['Dept'])+"_"+str(x['Year'])+"-"+str(x['Month'])+"-"+str(x['Day']), axis=1)
weekly_sales_test[['Id','Weekly_Sales']].to_csv('Submissions/Submission_v16_All_VAll.csv', index=False);
Here you can find the WMAE score of all models.
Models | Private Score | Public Score |
---|---|---|
Random Forest | 3089.84 | 2921.30448 |
Average of 2 Random Forests | 3041.27575 | 2892.53128 |
Best Model of 2 Random Forests | 3021.37934 | 2881.91710 |
XGBoost | 3269.41583 | 3065.33341 |
Average of XGBoost + Random Forest | 3011.23272 | 2875.34598 |
ExtraTrees | 3123.9821 | 2946.12504 |
Average of Random Forest + XGBoost + ExtraTrees | 2983.59973 | 2839.19310 |
Prophet | 3073.71809 | 2966.65869 |
Average of all Models | 2770.29707 | 2639.92581 |
Average of Random Forest + Prophet |
2736.33345 |
2627.53877 |
Average of Random Forest + Prophet |
Average of all models |
- I want to try some Deep learning based model to improve the score further.
- I also try some statistical model like ARIMA
Please find the GitHub link of the project.
Thank you for your attention and for reading my work. If you liked this story, share it with your friends and colleagues!. Also, follow me on LinkedIn.
I want to thank all the team members of the Applied AI course, especially Srikanth sir. Please visit https://www.appliedaicourse.com/ for more information.