# Walmart Recruiting - Store Sales Forecasting

### <i>Use historical markdown data to predict store sales</i>

One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

<img src='https://logos-download.com/wp-content/uploads/2016/02/Walmart_logo_transparent_png_blue.png' width='600px' />

In this recruiting competition, job-seekers are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants must project the sales for each department in each store. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact.

Want to work in a great environment with some of the world's largest data sets? This is a chance to display your modeling mettle to the Walmart hiring teams.

## Data Breifing
<hr>
You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

##### stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

##### train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

Store - the store number
Dept - the department number
Date - the week
Weekly_Sales -  sales for the given department in the given store
IsHoliday - whether the week is a special holiday week
test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

##### features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store - the store number
Date - the week
Temperature - average temperature in the region
Fuel_Price - cost of fuel in the region
MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI - the consumer price index
Unemployment - the unemployment rate
IsHoliday - whether the week is a special holiday week
For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

<br>
<b>
<li>Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13</li>
<li>Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13</li>
<li>Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13</li>
<li>Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13</li>
</b>

<hr>

In [1]:
# Imported required packages for analysis and model creation.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import VarianceThreshold, RFECV,SelectKBest,SelectFromModel,RFE
from 
from sklearn.metrics import mean_absolute_error

%matplotlib inline

In [2]:
import os
os.chdir("/Users/ashisht/Documents/NLP/wallmart_sales/")

In [3]:
#imported data using pandas library
train_df = pd.read_csv('train.csv')

In [4]:
# One glance at imported data set
train_df.head(10)

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-02-05,24924.5,False
1,1,1,2010-02-12,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-03-05,21827.9,False
5,1,1,2010-03-12,21043.39,False
6,1,1,2010-03-19,22136.64,False
7,1,1,2010-03-26,26229.21,False
8,1,1,2010-04-02,57258.43,False
9,1,1,2010-04-09,42960.91,False


## <u>Data Inspection</u> 

In [5]:
train_df.shape

(421570, 5)

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         421570 non-null  int64  
 1   Dept          421570 non-null  int64  
 2   Date          421570 non-null  object 
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  bool   
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB


In [7]:
train_df.describe()

Unnamed: 0,Store,Dept,Weekly_Sales
count,421570.0,421570.0,421570.0
mean,22.200546,44.260317,15981.258123
std,12.785297,30.492054,22711.183519
min,1.0,1.0,-4988.94
25%,11.0,18.0,2079.65
50%,22.0,37.0,7612.03
75%,33.0,74.0,20205.8525
max,45.0,99.0,693099.36


## <u>Data Cleaning</u>

In [9]:
#1 Drop Duplicates records if exists, its not good to feed same Data to ML model.
train_df.drop_duplicates(keep='first',inplace=True)

In [10]:
train_df.shape

(421570, 5)

In [11]:
def check_data_missing_columns(dataframe):
    data_missing_columns = [col for col in train_df.columns if train_df[col].isnull().sum() > 0]
    if len(data_missing_columns) == 0:
        print("No Data missing from Dataset.")
    else:
        print("Data missing in columns : ",",".join(data_missing_columns))
        
check_data_missing_columns(train_df)

No Data missing from Dataset.


## <u>Exploratory Data Analysis</u>