In [2]:
import numpy as np
import pandas as pd

In [3]:
# Import the data
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
stores = pd.read_csv('../data/stores.csv')
transactions = pd.read_csv('../data/transactions.csv')
oil = pd.read_csv('../data/oil.csv')
holidays = pd.read_csv('../data/holidays_events.csv')

# Dictionary for all datasets
datasets = {'train':train, 'test':test, 'stores':stores, 'transactions':transactions, 'oil':oil, 'holidays':holidays}

# Cleaning

We convert any dates in the datasets to pandas Timestamps.

In [4]:
for df in iter(datasets.values()):
    if 'date' in df.columns:
        df['date'] = pd.to_datetime( df['date'] )

We drop the `id` column from the training set. The preliminary analysis proved this is just a redundant row indexer.

In [5]:
train = train.drop('id',axis=1)

The preliminary analysis shows 43 missing daily oil price values and missing weekend oil price values. 

We add the weekend days first as more null values, then interpolate 

Also, the oil price for the very first day (2013-01-01) is missing. We can manually add that here using the oil prices from 2012-12-31 and 2013-01-02 from (https://fred.stlouisfed.org/data/DCOILWTICO). We separately verified that is safe to do, since these oil prices match the ones in the oil dataset.

In [7]:
# Manually add oil price for the first day using average of 2012-12-31 and 2013-01-02 oil prices
oil.iloc[0,1] = 92.485                                  

In [8]:
# Create DataFrame with all dates in desired range, including weekends
dates = pd.DataFrame(pd.date_range(start='1/1/2013', end='8/31/2017',freq='D'), columns=['date'])
# Merge with oil data set, so that weekend dates are added to oil with null values
oil = dates.merge(oil,how='left', on='date')
# Interpolate all missing values in oil (all but possibly one of the gaps are of size 1,2, or 3)
oil['dcoilwtico'] = oil['dcoilwtico'].interpolate()

# Merging

We first inner join the store and transaction data along the `store_nbr` column.

This guarantees no duplicates or new null values.

In [9]:
# Merge stores with transactions on date and store_nbr
X = stores.merge(transactions, how='inner', on='store_nbr')
X = X.sort_values(by=['date','store_nbr'],axis=0).reset_index(drop=True)
X = X[['date','store_nbr','type','cluster','city','state','transactions']]

Before merging more data, we break down the `date` column into year, month, week, day, and day of week.

In [10]:
X = X.assign(**{'year': pd.Series( [X.date[i].year for i in X.index]), 
            'month': pd.Series( [X.date[i].month for i in X.index]), 
            'week_number': pd.Series( [X.date[i].week for i in X.index]), 
            'day':pd.Series( [X.date[i].day for i in X.index]), 
            'day_of_week': pd.Series( [X.date[i].dayofweek for i in X.index]) })
X = X[['date','year', 'month', 'week_number', 'day', 'day_of_week','store_nbr','type','cluster','city','state','transactions']]

Next, we inner join oil prices along the `date` column.

Since there is one oil price per date and we filled null values in oil, this won't give duplicates or new null values.

In [11]:
X = X.merge(oil, how='left', on='date')
X = X.sort_values(by=['date','store_nbr'],axis=0).reset_index(drop=True)

Here we join the holiday data,

In [352]:
# to be determined

Finally, we inner join our training data along both `date` and `store_nbr`.

In [12]:
X = X.merge(train, how='inner', on=['date','store_nbr'])

In [354]:
X.to_csv("../data/combined.csv", index = False)

In [15]:
X

Unnamed: 0,date,year,month,week_number,day,day_of_week,store_nbr,type,cluster,city,state,transactions,dcoilwtico,family,sales,onpromotion
0,2013-01-01,2013,1,1,1,1,25,D,1,Salinas,Santa Elena,770,92.485,AUTOMOTIVE,0.000,0
1,2013-01-01,2013,1,1,1,1,25,D,1,Salinas,Santa Elena,770,92.485,BABY CARE,0.000,0
2,2013-01-01,2013,1,1,1,1,25,D,1,Salinas,Santa Elena,770,92.485,BEAUTY,2.000,0
3,2013-01-01,2013,1,1,1,1,25,D,1,Salinas,Santa Elena,770,92.485,BEVERAGES,810.000,0
4,2013-01-01,2013,1,1,1,1,25,D,1,Salinas,Santa Elena,770,92.485,BOOKS,0.000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2755099,2017-08-15,2017,8,33,15,1,54,C,3,El Carmen,Manabi,802,47.570,POULTRY,59.619,0
2755100,2017-08-15,2017,8,33,15,1,54,C,3,El Carmen,Manabi,802,47.570,PREPARED FOODS,94.000,0
2755101,2017-08-15,2017,8,33,15,1,54,C,3,El Carmen,Manabi,802,47.570,PRODUCE,915.371,76
2755102,2017-08-15,2017,8,33,15,1,54,C,3,El Carmen,Manabi,802,47.570,SCHOOL AND OFFICE SUPPLIES,0.000,0


# Holiday

Add the holiday data into the dataset. 

- Dongyu 


## Type of holidays 
The different types of holiday indicates for a particular store in a particlular day, it can be:

- holiday, celebrated
- holiday, not celebrated
- not holiday, celebrated 
- not holiday, not celebrated
- event


We don't know if for a particular holiday, whether it's a day off or not is going to affect the sale. So I'll also add "whether it's celebrated (day off)" as a feature. 

## location and description 
There're three locations. Based on the location, we can determine wheter a particular store celebrate the holiday. 

Different holidays will likely have different shopping culture. Holidays should be considered separately. 

So at first glance, I will add three columns:
- Local holiday (0 is none, 1 - 27 is the different discription of the holidays.)
- Regional holiday (0 is None, 1 - 4 is four different holidays.)
- National holiday (0 is None, 1 - 29 are different holidays.)
- National Events (0 is None, 1 - 43 are different events.) 
  - Event is not recruiting. Hence it should be labeled separately. 

One problem, 2012-12-31	is both Puente Primer dia del ano	(Bridge) and Primer dia del ano-1	(Additional)
It termed out Puente Primer dia del ano	means "The bridge for new year" , Primer dia del ano means "new year eve".

Other than that, there's no overlapping holidays. 

Notice that "Bridge", "Work Day" and "Transfer" Changes every year based on the day of the week. 
While "Additional" only goes with the holiday it self. 

# Different holidays 

Not all holiday is repeated in the same day every year. 

In [58]:
holidays[holidays.locale == 'National']

Unnamed: 0,date,type,locale,locale_name,description,transferred
14,2012-08-10,Holiday,National,Ecuador,Primer Grito de Independencia,False
19,2012-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
20,2012-10-12,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
21,2012-11-02,Holiday,National,Ecuador,Dia de Difuntos,False
22,2012-11-03,Holiday,National,Ecuador,Independencia de Cuenca,False
31,2012-12-21,Additional,National,Ecuador,Navidad-4,False
33,2012-12-22,Additional,National,Ecuador,Navidad-3,False
34,2012-12-23,Additional,National,Ecuador,Navidad-2,False
35,2012-12-24,Bridge,National,Ecuador,Puente Navidad,False
36,2012-12-24,Additional,National,Ecuador,Navidad-1,False


In [52]:
pd.set_option('display.max_rows', None)
holidays[holidays.locale == 'Local']

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
5,2012-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
6,2012-06-23,Holiday,Local,Guaranda,Cantonizacion de Guaranda,False
8,2012-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
9,2012-06-25,Holiday,Local,Machala,Fundacion de Machala,False
10,2012-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False
11,2012-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False


In [13]:
holidays

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
...,...,...,...,...,...,...
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
346,2017-12-23,Additional,National,Ecuador,Navidad-2,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
348,2017-12-25,Holiday,National,Ecuador,Navidad,False


In [22]:
holidays[holidays.type == 'Transfer']

Unnamed: 0,date,type,locale,locale_name,description,transferred
20,2012-10-12,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
73,2013-10-11,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
136,2014-10-10,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
256,2016-05-27,Transfer,National,Ecuador,Traslado Batalla de Pichincha,False
265,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False
269,2016-08-12,Transfer,National,Ecuador,Traslado Primer Grito de Independencia,False
298,2017-01-02,Transfer,National,Ecuador,Traslado Primer dia del ano,False
304,2017-04-13,Transfer,Local,Cuenca,Fundacion de Cuenca,False
313,2017-05-26,Transfer,National,Ecuador,Traslado Batalla de Pichincha,False
325,2017-08-11,Transfer,National,Ecuador,Traslado Primer Grito de Independencia,False
