# SID Dataset

For our experiment, we use a relatively simple and clean dataset including 5 years of store-item sales data. The dataset was also used in the Store Item Demand Forecasting Challenge on Kaggle. In this notebook, we preprocess the original data for our experiment. 

## Install and import packages

In [None]:
# pip install numpy, pandas

In [1]:
import numpy as np
import pandas as pd
from datetime import date
import calendar
from utilities import add_lag_features, day_to_string, month_to_string
from tsfresh import extract_features
from tsfresh.feature_extraction import MinimalFCParameters
from tsfresh.utilities.dataframe_functions import roll_time_series

  from pandas import Int64Index as NumericIndex


## Get data

In [13]:
# load data
data = pd.read_csv("raw/SID_dataset.csv")

In [14]:
# look at data
data

Unnamed: 0,date,store,item,sales
0,2013-01-01,1,1,13
1,2013-01-02,1,1,11
2,2013-01-03,1,1,14
3,2013-01-04,1,1,13
4,2013-01-05,1,1,10
...,...,...,...,...
912995,2017-12-27,10,50,63
912996,2017-12-28,10,50,59
912997,2017-12-29,10,50,74
912998,2017-12-30,10,50,62


In [16]:
data.rename(columns={"sales": "demand"}, inplace=True)

In [17]:
# split in X and y
X = data.drop(columns=["demand"])
y = data[["demand"]]

## Calendar features

In [18]:
# convert to date column to datetime
X['date'] =  pd.to_datetime(X['date'], format='%Y-%m-%d')

In [19]:
# cleare column for month, weekday, and year
X['month'] = X['date'].dt.month
X["weekday"] = X["date"].dt.dayofweek+1
X['year'] = X['date'].dt.year

In [20]:
X['month'] = X['month'].apply(month_to_string)
X['weekday'] = X['weekday'].apply(day_to_string)

In [21]:
X

Unnamed: 0,date,store,item,month,weekday,year
0,2013-01-01,1,1,JAN,TUE,2013
1,2013-01-02,1,1,JAN,WED,2013
2,2013-01-03,1,1,JAN,THU,2013
3,2013-01-04,1,1,JAN,FRI,2013
4,2013-01-05,1,1,JAN,SAT,2013
...,...,...,...,...,...,...
912995,2017-12-27,10,50,DEC,WED,2017
912996,2017-12-28,10,50,DEC,THU,2017
912997,2017-12-29,10,50,DEC,FRI,2017
912998,2017-12-30,10,50,DEC,SAT,2017


## Add lag features

We add a numer of lag features using the python library tsfresh. The lag features contain basic statistics like median, mean, and standard deviation for the time windows 7, 14, and 28.

In [22]:
# set lag features
fc_parameters = MinimalFCParameters()

In [23]:
# delete length features
del fc_parameters['length']

In [24]:
# print all lag features
print("Lag features:", fc_parameters)

Lag features: {'sum_values': None, 'median': None, 'mean': None, 'standard_deviation': None, 'variance': None, 'root_mean_square': None, 'maximum': None, 'absolute_maximum': None, 'minimum': None}


In [25]:
# create lag features
X, y  = add_lag_features(X=X, y=y, column_id=['item',"store"], column_sort='date', 
                        feature_dict=fc_parameters, time_windows = [(7,7),(14,14),(28,28)])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y["time"] = X["time"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y["id"] = X["id"]
Rolling: 100%|██████████| 20/20 [02:30<00:00,  7.52s/it]
Feature Extraction: 100%|██████████| 20/20 [03:41<00:00, 11.05s/it]
Rolling: 100%|██████████| 20/20 [02:56<00:00,  8.83s/it]
Feature Extraction: 100%|██████████| 20/20 [03:53<00:00, 11.65s/it]
Rolling: 100%|██████████| 20/20 [03:00<00:00,  9.03s/it]
Feature Extraction: 100%|██████████| 20/20 [03:54<00:00, 11.71s/it]


In [26]:
X.drop(columns=["date"],inplace=True)

## Save final data

In [28]:
X.to_csv("final/SID_data.csv.zip", index=False, compression="zip")
y.to_csv("final/SID_target.csv.zip", index=False, compression="zip")