## Goal:

we are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and we are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

## Evaluation Metric

This competition is evaluated on the weighted mean absolute error WMAE:

$\frac{1}{\sum{w}_i} \sum \limits _{i=1} ^{n} {w}_i \lvert {y}_i - \hat{y}_i \rvert$

Where: 

$\hat{y}_i$ is the predicted sales<br>
${y}_i$  is the actual sales<br>
${w}_i$ are weights. w = 5 if the week is a holiday week, 1 otherwise


## Read the Data

In [1]:
# import libraries
import os
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns 

# display settings & code formatting
pd.options.display.max_columns = 999
%matplotlib inline
%load_ext nb_black

# project paths
project_root_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)

data_path = os.path.join(project_root_dir, "data")
os.makedirs(data_path, exist_ok=True)

image_path = os.path.join(project_root_dir, "images")
os.makedirs(image_path, exist_ok=True)

# function for loading data
def load_data(filename, data_path=data_path):
    csv_path = os.path.join(data_path, filename)
    return pd.read_csv(csv_path)

# function for saving data as csv file
def save_dataframe(df, filename, file_path=data_path):
    path = os.path.join(file_path, filename)
    df.to_csv(path, index=False)


<IPython.core.display.Javascript object>

In [2]:
train = load_data("train.csv")
test = load_data("test.csv")
stores = load_data("stores.csv")
features = load_data("features.csv")
sample_submission = load_data("sampleSubmission.csv")

<IPython.core.display.Javascript object>

## Training Data

In [3]:
train.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-02-05,24924.5,False
1,1,1,2010-02-12,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-03-05,21827.9,False


<IPython.core.display.Javascript object>

In [4]:
def describe_date(date):
    """
    This function takes a pandas date column and give some summary
    information between two dates.
    """
    min_date = date.min()
    max_date = date.max()
    total_months = (
        pd.to_datetime(date.max()).year - pd.to_datetime(date.min()).year
    ) * 12 + (pd.to_datetime(date.max()).month - pd.to_datetime(date.min()).month)
    total_days = str(pd.to_datetime(date.max()) - pd.to_datetime(date.min())).split(
        " "
    )[0]
    print("--------")
    print("Min date:", min_date)
    print("Max date:", max_date)
    print("Total Months:", total_months)
    print("Total Days:", total_days)

<IPython.core.display.Javascript object>

In [5]:
print("The Training set contains weekly sales data:")
describe_date(train["Date"])

The Training set contains weekly sales data:
--------
Min date: 2010-02-05
Max date: 2012-10-26
Total Months: 32
Total Days: 994


<IPython.core.display.Javascript object>

### Feature Descriptions : 

* Store - the store number
* Dept - the department number
* Date - the week
* Weekly_Sales -  sales for the given department in the given store
* IsHoliday - whether the week is a special holiday week

## Test Data

In [6]:
test.head()

Unnamed: 0,Store,Dept,Date,IsHoliday
0,1,1,2012-11-02,False
1,1,1,2012-11-09,False
2,1,1,2012-11-16,False
3,1,1,2012-11-23,True
4,1,1,2012-11-30,False


<IPython.core.display.Javascript object>

In [7]:
print("The Test set")
describe_date(test["Date"])

The Test set
--------
Min date: 2012-11-02
Max date: 2013-07-26
Total Months: 8
Total Days: 266


<IPython.core.display.Javascript object>

### Feature Descriptions : 

This file is identical to train.csv, except the weekly sales are withheld. we must predict the sales for each triplet of store, department, and date in this file.

## Stores Data

In [8]:
stores.head()

Unnamed: 0,Store,Type,Size
0,1,A,151315
1,2,A,202307
2,3,B,37392
3,4,A,205863
4,5,B,34875


<IPython.core.display.Javascript object>

This file contains anonymized information about the 45 stores, indicating the type and size of store.

In [9]:
stores["Type"].value_counts()

A    22
B    17
C     6
Name: Type, dtype: int64

<IPython.core.display.Javascript object>

In [10]:
stores["Store"].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45])

<IPython.core.display.Javascript object>

## Features Data

In [11]:
features.head()

Unnamed: 0,Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,,211.350143,8.106,False


<IPython.core.display.Javascript object>

In [12]:
print("Features set:")
describe_date(features["Date"])

Features set:
--------
Min date: 2010-02-05
Max date: 2013-07-26
Total Months: 41
Total Days: 1267


<IPython.core.display.Javascript object>

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

* Store - the store number
* Date - the week
* Temperature - average temperature in the region
* Fuel_Price - cost of fuel in the region
* MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
* CPI - the consumer price index
* Unemployment - the unemployment rate
* IsHoliday - whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

* Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
* Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
* Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
* Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

In [13]:
features["IsHoliday"].value_counts()

False    7605
True      585
Name: IsHoliday, dtype: int64

<IPython.core.display.Javascript object>

In [15]:
features["MarkDown1"].sample(10)

2405         NaN
1422     9652.09
3123         NaN
2044         NaN
137      6352.30
5079    14572.76
7610        9.53
1818     7816.71
6858      576.00
6686         NaN
Name: MarkDown1, dtype: float64

<IPython.core.display.Javascript object>

In [16]:
features["MarkDown5"].sample(10)

850      1246.89
7644         NaN
954          NaN
6954         NaN
1525         NaN
2845     1659.22
1197     6266.04
6078         NaN
5605    12568.29
706      3526.18
Name: MarkDown5, dtype: float64

<IPython.core.display.Javascript object>

## Submission Format

For each row in the test set store+department+datetriplet, you should predict the weekly sales of that department. The Id column is formed by concatenating the Store, Dept, and Date with underscores e.g. Store_Dept_2012−11−02.  The file should have a header and looks like the following:

In [17]:
sample_submission.head()

Unnamed: 0,Id,Weekly_Sales
0,1_1_2012-11-02,0
1,1_1_2012-11-09,0
2,1_1_2012-11-16,0
3,1_1_2012-11-23,0
4,1_1_2012-11-30,0


<IPython.core.display.Javascript object>

In [18]:
test.shape

(115064, 4)

<IPython.core.display.Javascript object>

In [19]:
sample_submission.shape

(115064, 2)

<IPython.core.display.Javascript object>