In [27]:
import numpy as np
import pandas as pd

# Data Collection

The bulk of our data comes from a Kaggle competition which we cite below:

    Alexis Cook, DanB, inversion, and Ryan Holbrook. Store Sales - Time Series Forecasting. 
    https://kaggle.com/competitions/store-sales-time-series-forecasting, 2021. Kaggle.
    

In [28]:
# Import the data
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
stores = pd.read_csv('../data/stores.csv')
transactions = pd.read_csv('../data/transactions.csv')
oil = pd.read_csv('../data/oil.csv')
holidays = pd.read_csv('../data/holidays_events.csv')

# Dictionary for all datasets
datasets = {'train':train, 'test':test, 'stores':stores, 'transactions':transactions, 'oil':oil, 'holidays':holidays}

# Convert dates to pandas Timestamp
for df in iter(datasets.values()):
    if 'date' in df.columns:
        df['date'] = pd.to_datetime( df['date'] )

# Summary Statistics

In [29]:
def summary_stats(df_dict):
    
    # df_dict : dictionary of DataFrames which we want to generate summary statistics for
    
    for df in df_dict:
        summary = pd.DataFrame(df_dict[df].describe().transpose())  # basic stats for each column   
        summary.insert(0, 'dtype', df_dict[df].dtypes)              # data type for each column
        summary.insert(1, '#null',df_dict[df].isnull().sum())       # number of missing values in each column
        print(df,":", datasets[df].columns.to_list())               # print list of features
        print("shape:",datasets[df].shape)                          # print shape of dataframe
        print(summary)                                              # print summary statistics
        print('------------------------------------------------------------','\n') 

In [30]:
summary_stats(datasets)

train : ['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion']
shape: (3000888, 6)
                      dtype  #null      count                           mean  \
id                    int64      0  3000888.0                      1500443.5   
date         datetime64[ns]      0    3000888  2015-04-24 08:27:04.703088384   
store_nbr             int64      0  3000888.0                           27.5   
sales               float64      0  3000888.0                     357.775749   
onpromotion           int64      0  3000888.0                        2.60277   

                             min                  25%                  50%  \
id                           0.0            750221.75            1500443.5   
date         2013-01-01 00:00:00  2014-02-26 18:00:00  2015-04-24 12:00:00   
store_nbr                    1.0                 14.0                 27.5   
sales                        0.0                  0.0                 11.0   
onpromotion                  0.0       

# Initial Feature List

|  train/test  |         |
|:--------|:--------|
`id`            |   an extra row index   
`date`          |   date
`store_nbr`     |   store number
`family`        |   family of good sold (e.g. beauty, food, etc.)
`sales`         |   quantity of items sold (may be float, e.g. 1.5 for 1.5kg of cheese sold)
`onpromotion`   |   number of items in family that had a promotion

|  stores  |         |
|:--------|:--------|
`store_nbr` |   store number
`city`      |   city store is located in
`state`     |   state store is located in
`type`      |   (?) guess: store type (?)
`cluster`   |   categorical variable grouping "similar" stores together

|  transactions  |         |
|:--------|:--------|
`date`              |   date
`store_nbr`         |   store number
`transactions`      |   (?) guess: number of transactions or amount of money earned (?)

|  oil  |         |
|:--------|:--------|
`date`          |   date
`dcoilwtico`    |   oil price (daily crude oil west texas intermediate, cushing, oklahoma)

|  holidays  |         |
|:--------|:--------|
`date`          |   date of holiday
`type`          |   Categorical variable with values: Holiday, Transfer, Additional, Bridge, Work Day, or Event. <br> Holiday= holiday, day off <br> Transfer = holiday, day off moved <br> Additional = extra days off bc of proximity to a big holiday <br> Bridge = extra day off for longer weekends or to "bridge over" between close holidays (e.g. christmas and new years eve) <br> Work Day = 1-1 correspondence with Bridge, extra work day to off-set work days lost to Bridge <br> Event = one-off special events (soccer championships, 2016 earthquake, Black Friday)
`locale`        |   Local, Regional, National
`locale_name`   |   name of town, city, state, or country where holiday applies (e.g. Ecuador for National holiday)
`description`   |   name of holiday
`transferred`   |   Boolean. If True, the "day off" for holiday was observed on a different day.


# Preliminary Analysis

### Feature Engineering

We agreed that we would break down the `date` data into smaller pieces (year, month, week, day, day_of_week).

These levels will likely aid in capturing different levels of seasonality in the data.

### Training Set

We verify that the `id` column in the training set is nothing but an extra row index, which we can thus safely delete.

In [31]:
train.id.size == train.id.nunique()     # all the entries of train.id are different

True

### Oil Dataset

We notice below that the oil prices are missing on weekends.

In [32]:
# prints any dates in oil dataset that land on a Saturday(day_of_week==5) or Sunday(day_of_week==5)
# nothing printed means oil prices are only recorded on weekdays
for i in oil.index:
    if (oil.date.iloc[i].day_of_week == 5) or (oil.date.iloc[i].day_of_week == 6):
        print(oil.iloc[i,0])

The summary statistics also shows the oil dataset has 43 missing `dcoilwtico` values, shown below.

Checking the corresponding `date` for the 43 missing values, we see that virtually all of them are one-day gaps (1 exception is a 2 day gap) , so using linear interpolation to fill them seems okay.

It makes sense to first add the weekend days to the oil dataframe, with oil price NaN. The worst that can happen is a three day gap (Fri-Sat-Sun or Sat-Sun-Mon). If we interpolate once after adding these weekend days, then these gaps will be filled with equidistant steps using (Thur+Mon values or Fri+Tues values respectively).

In [33]:
oil[ oil['dcoilwtico'].isnull() ]

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
14,2013-01-21,
34,2013-02-18,
63,2013-03-29,
104,2013-05-27,
132,2013-07-04,
174,2013-09-02,
237,2013-11-28,
256,2013-12-25,
261,2014-01-01,


### Holiday Dataset

#### Holiday Time Span and Frequencies

One thing to note is that the holiday dataset starts in March 2012 and ends December 2017. Deleting them in cleaning is unnecessary since we will left or inner join along dates.
- Most annual holidays should therefore occur about 5 times. 
- All holidays with 6 or 7 occurences have an extra coming from 2012.
- The two holidays with 7 occurences have another extra coming from improper formatting from being transfers.
- All holidays with occurences < 5 are all special (transfers, events, bridges). The exceptions are Black Friday and Cyber Monday, which are only on record in 2014,2015, 2016.

In [None]:
holidays['description'].value_counts().head(60)

description
Carnaval                                      10
Fundacion de Cuenca                            7
Fundacion de Ibarra                            7
Fundacion de Quito                             6
Provincializacion de Santo Domingo             6
Provincializacion Santa Elena                  6
Independencia de Guaranda                      6
Independencia de Latacunga                     6
Independencia de Ambato                        6
Fundacion de Quito-1                           6
Fundacion de Manta                             6
Dia de Difuntos                                6
Navidad-4                                      6
Cantonizacion de Salinas                       6
Navidad-3                                      6
Navidad-2                                      6
Navidad-1                                      6
Navidad                                        6
Navidad+1                                      6
Fundacion de Loja                              6
Independ

#### Bridge and Work Days

Holidays with `type` labelled "Bridge" (extra days off to "bridge" between weekends and other holidays) are in bijective correspondence with those labelled "Work Day" (extra work days to recover labor lost to bridge holidays).

To Do: Decide how to tag bridge and work day holidays. For example, Work Day should perhaps be set to not-a-holiday, they are extra weekend work days, so we should not expect them to behave like holidays. Bridge holidays have days off, so perhaps these should remain holidays.

In [34]:
holidays[ holidays.type == 'Bridge']

Unnamed: 0,date,type,locale,locale_name,description,transferred
35,2012-12-24,Bridge,National,Ecuador,Puente Navidad,False
39,2012-12-31,Bridge,National,Ecuador,Puente Primer dia del ano,False
156,2014-12-26,Bridge,National,Ecuador,Puente Navidad,False
160,2015-01-02,Bridge,National,Ecuador,Puente Primer dia del ano,False
277,2016-11-04,Bridge,National,Ecuador,Puente Dia de Difuntos,False


In [35]:
holidays[ holidays.type == 'Work Day']

Unnamed: 0,date,type,locale,locale_name,description,transferred
42,2013-01-05,Work Day,National,Ecuador,Recupero puente Navidad,False
43,2013-01-12,Work Day,National,Ecuador,Recupero puente primer dia del ano,False
149,2014-12-20,Work Day,National,Ecuador,Recupero Puente Navidad,False
161,2015-01-10,Work Day,National,Ecuador,Recupero Puente Primer dia del ano,False
283,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


#### Transfer Holidays

Transferred holidays are holidays that occur on a certain day, but remain work days. 

Their corresponding "day-off" is "transferred" to a different day to give people longer weekends/breaks.
- Transferred holidays have the `transferred` column set to True.
- The added "day-off" for the transferred holiday has its `transferred` column as False.
- With 2 exceptions, the added "day-off" for the transferred holiday has "Traslado" in its `description` (holiday name).
    - The exceptions are the transfer holidays in rows 303 and 328 (resp. Fundacion de Cuenca, Fundacion de Ibarra)
    - Their corresponding days-off occur in rows 304 and 329. Their description *should* read 'Traslado Fundacion de Cuenca' (for row 304) and 'Traslado Fundacion de Ibarra' (for row 329).

In [36]:
holidays[ holidays.transferred == True ]

Unnamed: 0,date,type,locale,locale_name,description,transferred
19,2012-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
72,2013-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
135,2014-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
255,2016-05-24,Holiday,National,Ecuador,Batalla de Pichincha,True
266,2016-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,True
268,2016-08-10,Holiday,National,Ecuador,Primer Grito de Independencia,True
297,2017-01-01,Holiday,National,Ecuador,Primer dia del ano,True
303,2017-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,True
312,2017-05-24,Holiday,National,Ecuador,Batalla de Pichincha,True
324,2017-08-10,Holiday,National,Ecuador,Primer Grito de Independencia,True


In [37]:
holidays[ holidays['description'].str.contains("Traslado")]

Unnamed: 0,date,type,locale,locale_name,description,transferred
20,2012-10-12,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
73,2013-10-11,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
136,2014-10-10,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
256,2016-05-27,Transfer,National,Ecuador,Traslado Batalla de Pichincha,False
265,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False
269,2016-08-12,Transfer,National,Ecuador,Traslado Primer Grito de Independencia,False
298,2017-01-02,Transfer,National,Ecuador,Traslado Primer dia del ano,False
313,2017-05-26,Transfer,National,Ecuador,Traslado Batalla de Pichincha,False
325,2017-08-11,Transfer,National,Ecuador,Traslado Primer Grito de Independencia,False
342,2017-12-08,Transfer,Local,Quito,Traslado Fundacion de Quito,False


In [43]:
# finding the added day-off corresponding to the transfer holiday in row 303
holidays[ holidays['description'].str.contains("Fundacion de Cuenca")]
    # day off for row 303 seems to be in row 304

Unnamed: 0,date,type,locale,locale_name,description,transferred
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
48,2013-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
97,2014-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
167,2015-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
217,2016-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
303,2017-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,True
304,2017-04-13,Transfer,Local,Cuenca,Fundacion de Cuenca,False


In [15]:
# finding the added day-off corresponding to the transfer holiday in row 328
holidays[ holidays['description'].str.contains("Fundacion de Ibarra")]
    # day off for row 328 seems to be in row 329

Unnamed: 0,date,type,locale,locale_name,description,transferred
17,2012-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
70,2013-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
133,2014-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
188,2015-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
272,2016-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
328,2017-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,True
329,2017-09-29,Transfer,Local,Ibarra,Fundacion de Ibarra,False


#### Event Holidays

There are 56 event holidays
- 31 come from the earthquake 4/16/2016 to 5/16/2016 (description: 'Terremoto Manabi' or f'Terremoto Manabi+{day_count} ).
- 14 come from the soccer world cup in June/July 2014 (Ecuador was eliminated on June 25, the fourth event)
-  5 are annual 'Dia de la Madre' holiday in early May.
- The remaining 6 are Black Friday and Cyber Monday in 2014, 2015, 2016.

In [None]:
# Number of earthquak holiday events
holidays[ holidays.description.str.contains('Terremoto') ].shape[0]

31

In [90]:
holidays[ holidays.type == 'Event'][ holidays[ holidays.type == 'Event'].description.str.contains('Mundial') ].shape

(14, 6)

In [None]:
# mother's day holiday events
holidays[ holidays.type == 'Event'][ holidays[ holidays.type == 'Event'].description.str.contains('Madre') ]

Unnamed: 0,date,type,locale,locale_name,description,transferred
55,2013-05-12,Event,National,Ecuador,Dia de la Madre,False
103,2014-05-11,Event,National,Ecuador,Dia de la Madre,False
172,2015-05-10,Event,National,Ecuador,Dia de la Madre,False
245,2016-05-08,Event,National,Ecuador,Dia de la Madre,False
311,2017-05-14,Event,National,Ecuador,Dia de la Madre,False


In [91]:
holidays[ holidays.type == 'Event']

Unnamed: 0,date,type,locale,locale_name,description,transferred
55,2013-05-12,Event,National,Ecuador,Dia de la Madre,False
103,2014-05-11,Event,National,Ecuador,Dia de la Madre,False
106,2014-06-12,Event,National,Ecuador,Inauguracion Mundial de futbol Brasil,False
107,2014-06-15,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Suiza,False
108,2014-06-20,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Honduras,False
113,2014-06-25,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Francia,False
114,2014-06-28,Event,National,Ecuador,Mundial de futbol Brasil: Octavos de Final,False
115,2014-06-29,Event,National,Ecuador,Mundial de futbol Brasil: Octavos de Final,False
116,2014-06-30,Event,National,Ecuador,Mundial de futbol Brasil: Octavos de Final,False
117,2014-07-01,Event,National,Ecuador,Mundial de futbol Brasil: Octavos de Final,False


#### Additional Holidays

The 51 additional holidays are additional days off surrounding big holidays.
- 30 additionals come from Christmas. Every Christmas gets 4 days off prior and one day off after (Christmas-4,-3,-2,-1 Christmas, +1).
- 5 come from day before New Year ('Primer dia del ano-1')
- 5 come from the day before Mother's Day in May ('Dia de la Madre-1')
- 6 come from 'Fundacion de Quito-1', a local Quito holiday (1 per year, extra comes from 2012)
- 4 come from 'Fundacion de Guayaquil', but there are some formatting errors here:
    - (row 182) The missing 2015 Additional Fundacion de Guayaquil should be labelled Additional
    - (row 322) The 2017 Fundacion de Guayaquil holiday should  be labelled as Holiday
    - (row 266) The 2016 Fundacion de Guayaquil (on a Monday) was given the standard additional the day prior (Sunday) but also got *transferred* to Sunday, which is redundant. We can just drop row 264, which gets rid of the redundant Additional and keeps the transfer to Sunday.

In [130]:
holidays.loc[30:40]

Unnamed: 0,date,type,locale,locale_name,description,transferred
30,2012-12-08,Holiday,Local,Loja,Fundacion de Loja,False
31,2012-12-21,Additional,National,Ecuador,Navidad-4,False
32,2012-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
33,2012-12-22,Additional,National,Ecuador,Navidad-3,False
34,2012-12-23,Additional,National,Ecuador,Navidad-2,False
35,2012-12-24,Bridge,National,Ecuador,Puente Navidad,False
36,2012-12-24,Additional,National,Ecuador,Navidad-1,False
37,2012-12-25,Holiday,National,Ecuador,Navidad,False
38,2012-12-26,Additional,National,Ecuador,Navidad+1,False
39,2012-12-31,Bridge,National,Ecuador,Puente Primer dia del ano,False


In [129]:
holidays[ holidays.type == 'Additional'][ holidays[ holidays.type == 'Additional'].description.str.contains('Navidad')]

Unnamed: 0,date,type,locale,locale_name,description,transferred
31,2012-12-21,Additional,National,Ecuador,Navidad-4,False
33,2012-12-22,Additional,National,Ecuador,Navidad-3,False
34,2012-12-23,Additional,National,Ecuador,Navidad-2,False
36,2012-12-24,Additional,National,Ecuador,Navidad-1,False
38,2012-12-26,Additional,National,Ecuador,Navidad+1,False
84,2013-12-21,Additional,National,Ecuador,Navidad-4,False
85,2013-12-22,Additional,National,Ecuador,Navidad-3,False
87,2013-12-23,Additional,National,Ecuador,Navidad-2,False
88,2013-12-24,Additional,National,Ecuador,Navidad-1,False
90,2013-12-26,Additional,National,Ecuador,Navidad+1,False


In [113]:
holidays[ holidays.type == 'Additional'][ holidays[ holidays.type == 'Additional'].description.str.contains('Primer dia del ano')].shape

(5, 6)

In [114]:
holidays[ holidays.type == 'Additional'][ holidays[ holidays.type == 'Additional'].description.str.contains('Madre')].shape

(5, 6)

In [None]:
holidays[ holidays.type == 'Additional'][ holidays[ holidays.type == 'Additional'].description.str.contains('Quito')]
    # extra one comes from 2012

Unnamed: 0,date,type,locale,locale_name,description,transferred
28,2012-12-05,Additional,Local,Quito,Fundacion de Quito-1,False
81,2013-12-05,Additional,Local,Quito,Fundacion de Quito-1,False
146,2014-12-05,Additional,Local,Quito,Fundacion de Quito-1,False
200,2015-12-05,Additional,Local,Quito,Fundacion de Quito-1,False
286,2016-12-05,Additional,Local,Quito,Fundacion de Quito-1,False
339,2017-12-05,Additional,Local,Quito,Fundacion de Quito-1,False


In [None]:
holidays[ holidays.type == 'Additional'][ holidays[ holidays.type == 'Additional'].description.str.contains('Guayaquil')]
    # missing 2015
    # also 2017 Fundacion de Guayaquil holiday (row 322) is erroneously labelled as Additional

Unnamed: 0,date,type,locale,locale_name,description,transferred
64,2013-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
127,2014-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
264,2016-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
321,2017-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
322,2017-07-25,Additional,Local,Guayaquil,Fundacion de Guayaquil,False


In [None]:
holidays[ holidays.description.str.contains('Fundacion de Guayaquil')]
    # The missing 2015 Additional Fundacion de Guayaquil (row 182) should be labelled Additional
    # The 2017 Fundacion de Guayaquil holiday (row 322) should  be labelled as Holiday
    # The 2016 Fundacion de Guayaquil (row 266, a Monday) was given the standard additional the day prior (Sunday) but also 
    # got transferred to Sunday, which is redundant.

Unnamed: 0,date,type,locale,locale_name,description,transferred
64,2013-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
65,2013-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,False
127,2014-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
128,2014-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,False
182,2015-07-24,Holiday,Local,Guayaquil,Fundacion de Guayaquil-1,False
183,2015-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,False
264,2016-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
265,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False
266,2016-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,True
321,2017-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False


#### Locale

- Holidays with `locale`=='Local' are holidays occuring in certain cities.
- Holidays with `local`=='Regional' are holidays occuring in certain states.
- Holidays with `local`=='National' are holidays occuring in the whole country (`locale_name`=='Ecuador').