<div style="background-color: lightblue; color: black; padding: 20px; font-weight: bold; font-size: 20px;">Feature Engineering</div><br>

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">
In this notebook we are developing features to improve the performance of our prediction.
Different types of features are generated. <br><br>
First, simple calendar-based features are developed. In addition, dummy variables are created for categorial variables. Since decision-tree based models cannot calculate well with dummy variables, we reduce them to the bare minimum.<br><br>
Then, features that inform the algorithm about certain seasonal patterns (here: christmas and thanksgiving) are generated.  In addition to these seasonality indicators, we use lag features to introduce delayed versions of our target variables. Thus, we are able to process our timeseries data with non-timeseries based algorithms. <br><br>

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">Import modules</div>

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=RuntimeWarning)

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">Reading file</div>

In [2]:
df = pd.read_pickle('data/data_combined_clean_2.pkl')

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">Columns for Date</div>

In [3]:
# create month and calender week column
df['Month'] = df['Date'].dt.month
df['Month'] = df['Month'].astype(int)
df['Week'] = df['Date'].dt.strftime('%U')
df['Week'] = df['Week'].astype(int)

#change year column dtype
df['Year'] = df['Year'].astype(int)

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">Engineering features</div>

Now columns are created, each containing the information whether it is a Christmas, Super Bowl, Labor Day or Thanksgiving week.<br>
Dates:<br>
Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12 <br>
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12 <br>
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12 <br>
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12 <br>



In [4]:
df['Date'] = pd.to_datetime(df['Date'])

df['Thanksgiv'] = [1 if x==pd.to_datetime('2010-11-26') 
                   or x==pd.to_datetime('2011-11-25') 
                   or x==pd.to_datetime('2012-11-23') 
                   else 0 for x in df['Date']]

df['Christmas'] = [1 if x==pd.to_datetime('2010-12-31') 
                   or x==pd.to_datetime('2011-12-30') 
                   or x==pd.to_datetime('2012-12-28') 
                   else 0 for x in df['Date']]


We only develop features for Christmas and Thanksgiving. After also developing features for Labor Day and Super Bowl, we decided not to include them in our final version because they did not pass cost-benefit considerations. They were time-consuming to calculate and did not improve the model much.

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">Feature engineering with shifting</div>

In [5]:
df = df.sort_values(by=['Store', 'Dept', 'Date'])

In [6]:
df['Shifted_Rolling_Avg_4'] = df['Weekly_Sales'].shift(periods=1)
df['Shifted_Rolling_Avg_5'] = df['Weekly_Sales'].shift(periods=2)
df['Shifted_Rolling_Avg_6'] = df['Weekly_Sales'].shift(periods=4)
df['Shifted_Rolling_Avg_7'] = df['Weekly_Sales'].shift(periods=52)

df = df.dropna()

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">Creating dummies for holiday, type, store and department column</div>

In [7]:
df = pd.get_dummies(df, columns=['IsHoliday'], prefix='IsHoliday', drop_first=True, dtype=int)
df = pd.get_dummies(df, columns=['Type'], prefix='Type', drop_first=True, dtype=int)

<div style="background-color: lightblue; color: black; padding: 10px; font-weight: bold; font-size: 15px;">Next safing new file.</div>

In [12]:
df.to_pickle('data/data_combined_clean_features_9.pkl')