# M5 Dataset

For our experiment, we use hierarchical sales data from Walmart. The dataset was already used for the M5 forecasting challenge on kaggle. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. 
In this notebook, we preprocess the original data for our experiment. 

## Install and import packages

In [None]:
# pip install pandas, numpy, pyreadr, tsfresh

In [None]:
import pandas as pd
import numpy as np
import pyreadr

from utilities import add_lag_features, day_to_string, month_to_string
from tsfresh import extract_features
from tsfresh.feature_extraction import MinimalFCParameters
from tsfresh.utilities.dataframe_functions import roll_time_series

## Get data

In [None]:
# read dataset
result = pyreadr.read_r('raw/M5_dataset.Rdata')

In [None]:
# check available information
result.keys()

In [None]:
# we only consider the historical sales data in train- and testset as well as the calendar data
calendar = result['calendar']
trainset = result['trainset']
testset = result['testset']

In [None]:
# look at trainset
trainset

In [None]:
# look at testset
testset

In [None]:
# concat train- and testset to have sales data for all 1969 days
data = pd.concat([trainset, testset.iloc[:,5:]], axis=1)

In [None]:
# there are 10 different stores
data["store_id"].unique()

In [None]:
# we need to change the data formate to wide to long 
data["id"] = data.index
data =pd.wide_to_long(data, stubnames='d_', i= ['id'], j='day')
data = data.reset_index()

## Select only top 10 foods items

For our experiment, we only consider products from the food category, since they are most relevant for the newsvendor problem due to their perishable nature. Since we want to avoid intermittent demand, we only consider the 10 products with the fewest zero sales in the time series. Given that there are 10 different stores, our final dataset will consist of 100 different time series. 

In [None]:
# Select only foods
data = data[data["cat_id"]=="FOODS"]

In [None]:
data = data.drop(["dept_id", "id", "cat_id"], axis=1)

In [None]:
data

In [None]:
data.rename(columns={'d_':'demand'}, inplace=True)

In [None]:
# get the id of the 10 products with the fewest zero sales
non_zero = data[data["demand"]!=0]
non_zero_agg = non_zero.groupby(['item_id'])
size = non_zero_agg.size()
size = size.sort_values(ascending=False)
size = pd.DataFrame(size).reset_index()
top_products = size.head(10)["item_id"]

In [None]:
# select top 10 products 
data = data[data["item_id"].isin(top_products)]

## Add calendar features

In [None]:
# look at calendar data
calendar

In [None]:
calendar.reset_index(inplace=True)

In [None]:
calendar['index'] = calendar['index']+1

### Preprocess events into OHE

In [None]:
# add indicator if there is an event on that day
calendar['is_event'] = 1-(calendar.event_name_1.isnull())

In [None]:
#add indicator of the type of event
calendar['is_sporting_event'] = ((calendar.event_type_1=='Sporting')|(calendar.event_type_2=='Sporting')).astype(int)
calendar['is_cultural_event'] = ((calendar.event_type_1=='Cultural')|(calendar.event_type_2=='Cultural')).astype(int)
calendar['is_national_event'] = ((calendar.event_type_1=='National')|(calendar.event_type_2=='National')).astype(int)
calendar['is_religious_event'] = ((calendar.event_type_1=='Religious')|(calendar.event_type_2=='Religious')).astype(int)

### Merge 

In [None]:
data = data.merge(calendar.loc[:,['index', 'date', 'weekday', 'wday', 'month', 'year', 'is_sporting_event',
       'is_cultural_event', 'is_national_event', 'is_religious_event']], left_on='day', right_on='index')

### Pre-processing

In [None]:
# select columns
data = data[["index", "wday", "month", "year", "item_id", "store_id", "state_id", "is_sporting_event", "is_cultural_event", "is_national_event", "is_religious_event", "demand"]]

In [None]:
data

In [None]:
data = data.rename(columns={"wday": "weekday"})

In [None]:
data['month'] = data['month'].apply(month_to_string)
data['weekday'] = data['weekday'].apply(day_to_string)

In [None]:
data.rename(columns={"item_id": "item", "store_id": "store"}, inplace=True)

In [None]:
data

## Add Snap feature

SNAP are benefits for low income Americans to spend on food - payment takes place on different days depending on state and other local factors

In [None]:
snap = pd.concat([pd.DataFrame({'is_snap_day':calendar.snap_CA, 'state':'CA'}), 
                 pd.DataFrame({'is_snap_day':calendar.snap_TX, 'state':'TX'}), 
                 pd.DataFrame({'is_snap_day':calendar.snap_WI, 'state':'WI'})])

snap.reset_index(inplace=True)
snap['index'] = snap['index']+1

In [None]:
snap

In [None]:
data = data.merge(snap, left_on=['state_id', 'index'], right_on=['state', 'index'])

In [None]:
data.drop(columns=['state_id', 'state'], inplace=True)

## Add lag features

We add a numer of lag features using the python library tsfresh. The lag features contain basic statistics like median, mean, and standard deviation for the time windows 7, 14, and 28.

In [None]:
#split in X and y 
y = pd.DataFrame(data['demand'])
X = data.drop(columns=['demand'])

In [None]:
# set lag features
fc_parameters = MinimalFCParameters()

In [None]:
# delete length features
del fc_parameters['length']

In [None]:
# print all lag features
print("Lag features:", fc_parameters)

In [None]:
# create lag features
X, y  = add_lag_features(X=X, y=y, column_id=['item',"store"], column_sort='index', 
                        feature_dict=fc_parameters, time_windows = [(7,7),(14,14),(28,28)])

In [None]:
X.drop(columns=["index"],inplace=True)

## Save final data

In [None]:
X.to_csv("final/m5_data.csv.zip", index=False, compression="zip")
y.to_csv("final/m5_target.csv.zip", index=False, compression="zip")