## Objectives:
* Step 1: prepare the training and evaluation data set
* Ignore the Air reservation for now (too noisy except for the end of the data set)
* The approach: train a RNN, with the following outputs: the visits obviosuly, but also the total of reservations done at -1, -3, -7, -15 days

Tricky: we only have good reservations data on HPG, not on Air data sets, with only a few (150) shared restaurants
> * we must probably learn on HPG for the reservations time series
* we must train on Air for the visits prediction
<

So probably 2 Neural network with different objectives:
* One for prediction the visits time series (which could be trained on all Air and HPG restaurants), or maybe that need to be spit in 3: only Air, only HPG, both
* One RNN to predict the visits, taking the visits prediction as inputs, but trainable only on the Air restaurants

##### Warning: hpg_reserve must be dezipped first

In [2]:
import pandas as pd
import numpy as np
import datetime
import calendar

In [35]:
#Load all Files (hey must be in input directory in a brother directory of the notebook)
data_load = {
    'air_reserve': pd.read_csv('../input/air_reserve.csv',parse_dates=['visit_datetime','reserve_datetime']), 
    'hpg_reserve': pd.read_csv('../input/hpg_reserve.csv',parse_dates=['visit_datetime','reserve_datetime']), 
    'air_store': pd.read_csv('../input/air_store_info.csv'),
    'hpg_store': pd.read_csv('../input/hpg_store_info.csv'),
    'air_visit': pd.read_csv('../input/air_visit_data.csv',parse_dates=['visit_date']),
    'store_id': pd.read_csv('../input/store_id_relation.csv'),
    'sample_sub': pd.read_csv('../input/sample_submission.csv'),
    'holiday_dates': pd.read_csv('../input/date_info.csv',parse_dates=['calendar_date']).rename(columns={'calendar_date':'visit_date'})
    }

###  First issue: retrieve the HPG stores that are also in the Air database (to get visits)

#### Get the list of stores in the AIR database, and associate the HPG id

In [36]:
air_list = data_load['air_visit'].groupby(['air_store_id'], as_index=False).first()['air_store_id'].tolist()

In [37]:
print("Number of restaurants in Air database:", len(air_list))

Number of restaurants in Air database: 829


In [38]:
data_load['store_id'].describe()

Unnamed: 0,air_store_id,hpg_store_id
count,150,150
unique,150,150
top,air_63b13c56b7201bd9,hpg_b3de3ba6437e5a11
freq,1,1


Aye, the number of restaurants on both database is very low: what to do with all those HPG data?
It means we have much more series regarding the reservations that series with number of visits. Maybe, we should split the learning between both, e.g. predicts the reservation time series on one side, predict the vists time series on the other, fed with the results of the first ones.

#### Second issue: equivalence for the type between both restaurants (TO DO)

### Let's try to work only on the air database for now, to simplify

##### For this, we can try to work with a single RNN, fed with all data at once 
What I want to use for the features:
One line per restaurant and per day with:
* Week day > to convert in number
* Holiday flag
* Area name > to convert in number
* Genre name > to convert in number
* Total of reservations at D-1
* Total of reservations at D-3
* Total of reservations at D-7
* Total of reservations at D-15
The last values are to be able to use the reservations already done on April 24th. In the future, we may need to have much smaller intervals.

The labels would be:
* The number of visits
* Total of reservations at D-1
* Total of reservations at D-3
* Total of reservations at D-7
* Total of reservations at D-15
All those data prediction being used for the next step (in addition to actual data when available)

The final matrix must be:
    * along the first dimension:n_samples samples, each sample being a restaurant
    * along the second dimension: n_timesteps steps, each step being a day of visit
    * along the third dimension: n_features features (week day, genre,...)
We will then take slice of this matrix along the second dimension to construct our training/evaluation/test sets

In [39]:
Data_visit = data_load['air_visit'].copy()

In [40]:
Data_visit.groupby('air_store_id')['visitors'].sum().reset_index()[0:10]

Unnamed: 0,air_store_id,visitors
0,air_00a91d42b08b08d9,6051
1,air_0164b9927d20bcc3,1378
2,air_0241aa3964b7f861,3919
3,air_0328696196e46f18,921
4,air_034a3d5b40d5b1b1,3722
5,air_036d4f1ee7285390,6310
6,air_0382c794b73b51ad,7059
7,air_03963426c9312048,16877
8,air_04341b588bde96cd,16931
9,air_049f6d5b402a31b2,2975


In [41]:
Data_reserve = data_load['air_reserve'].copy()

# Create the date from date-time columns
Data_reserve['visit_date'] = Data_reserve['visit_datetime'].apply(lambda x: x.date())
Data_reserve['reserve_date'] = Data_reserve['reserve_datetime'].apply(lambda x: x.date())

# To test
Data_reserve.groupby('air_store_id')['reserve_visitors'].sum().reset_index().describe(include = 'all')

Unnamed: 0,air_store_id,reserve_visitors
count,314,314.0
unique,314,
top,air_6b15edd1b4fbb96a,
freq,1,
mean,,1318.519108
std,,1593.284506
min,,1.0
25%,,127.0
50%,,808.5
75%,,1762.25


##### We only have 319 restaurants with reservations. Also some restaurants have only one day reservation, and very old => they are useless

#### Optional: Let see examples of how many reservations we get per restaurant per visit day

In [42]:
# Regroup the lines per restaurant and per visit date, and count the number of reservation
Data_reserve.groupby(['air_store_id','visit_date'])['reserve_visitors'].sum().reset_index()[0:20]

Unnamed: 0,air_store_id,visit_date,reserve_visitors
0,air_00a91d42b08b08d9,2016-10-31,2
1,air_00a91d42b08b08d9,2016-12-05,9
2,air_00a91d42b08b08d9,2016-12-14,18
3,air_00a91d42b08b08d9,2016-12-17,2
4,air_00a91d42b08b08d9,2016-12-20,4
5,air_00a91d42b08b08d9,2017-02-18,9
6,air_00a91d42b08b08d9,2017-02-23,12
7,air_00a91d42b08b08d9,2017-03-01,3
8,air_00a91d42b08b08d9,2017-03-14,4
9,air_00a91d42b08b08d9,2017-03-21,3


##### This is a bad joke: the data are full of void, a lot of visit days are missing. How can we manage that if we have to predict time series?

### Let's calculate the total of reservations done at D-1, D-3, D-7, D-15, D-38 for the training - evaluation set
Check this ref: https://stackoverflow.com/questions/17266129/python-pandas-conditional-sums
and https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

Optional: Here is a step by step example

In [43]:
l1 = Data_reserve.groupby(['air_store_id','visit_date'])

In [44]:
l2 = l1.get_group(('air_fea5dc9594450608',datetime.date(2017, 4, 21)))
l2

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors,visit_date,reserve_date
89482,air_fea5dc9594450608,2017-04-21 18:00:00,2017-04-03 18:00:00,6,2017-04-21,2017-04-03
89642,air_fea5dc9594450608,2017-04-21 19:00:00,2017-04-19 00:00:00,2,2017-04-21,2017-04-19
89643,air_fea5dc9594450608,2017-04-21 19:00:00,2017-04-12 20:00:00,2,2017-04-21,2017-04-12


In [45]:
l2[l2['visit_date']-l2['reserve_date']>datetime.timedelta(3)]

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors,visit_date,reserve_date
89482,air_fea5dc9594450608,2017-04-21 18:00:00,2017-04-03 18:00:00,6,2017-04-21,2017-04-03
89643,air_fea5dc9594450608,2017-04-21 19:00:00,2017-04-12 20:00:00,2,2017-04-21,2017-04-12


In [46]:
l2[l2['visit_date']-l2['reserve_date']>datetime.timedelta(3)]['reserve_visitors'].sum()

8

##### Clearer now! So to count the reservation at different points before the visit date:

#### The merge is done 'outer' which will keep all samples,even the one with no visits (the one measured after April 23rd)
It can take a while depending on your machine

In [47]:
# Adding the total number of reservations for each (air_store_id, visit_date )
print("Merging for total")
Data_visit = Data_visit.merge(  
                      Data_reserve.groupby(['air_store_id','visit_date']).apply((lambda x: x['reserve_visitors'].sum())).reset_index(name='total_reservation'), 
                      how='outer',
                      on=['air_store_id','visit_date'])

# Adding the number of reservations at D-1
print("Merging for D-1")
Data_visit = Data_visit.merge(  
                      Data_reserve.groupby(['air_store_id','visit_date']).apply((lambda x: x[x['visit_date']-x['reserve_date']>=datetime.timedelta(1)]['reserve_visitors'].sum())).reset_index(name='total_reservation_minus_1'), 
                      how='outer',
                      on=['air_store_id','visit_date'])

# Adding the number of reservations at D-3
print("Merging for D-3")
Data_visit = Data_visit.merge(
                      Data_reserve.groupby(['air_store_id','visit_date']).apply((lambda x: x[x['visit_date']-x['reserve_date']>=datetime.timedelta(3)]['reserve_visitors'].sum())).reset_index(name='total_reservation_minus_3'), 
                      how='outer',
                      on=['air_store_id','visit_date'])

# Adding the number of reservations at D-7
print("Merging for D-7")
Data_visit = Data_visit.merge( 
                      Data_reserve.groupby(['air_store_id','visit_date']).apply((lambda x: x[x['visit_date']-x['reserve_date']>=datetime.timedelta(7)]['reserve_visitors'].sum())).reset_index(name='total_reservation_minus_7'), 
                      how='outer',
                      on=['air_store_id','visit_date'])

# Adding the number of reservations at D-15
print("Merging for D-15")
Data_visit = Data_visit.merge( 
                      Data_reserve.groupby(['air_store_id','visit_date']).apply((lambda x: x[x['visit_date']-x['reserve_date']>=datetime.timedelta(15)]['reserve_visitors'].sum())).reset_index(name='total_reservation_minus_15'), 
                      how='outer',
                      on=['air_store_id','visit_date']) 
            
# Adding the number of reservations at D-38 (this value will always be available in the test set)
print("Merging for D-38")
Data_visit = Data_visit.merge(
                      Data_reserve.groupby(['air_store_id','visit_date']).apply((lambda x: x[x['visit_date']-x['reserve_date']>=datetime.timedelta(38)]['reserve_visitors'].sum())).reset_index(name='total_reservation_minus_30'), 
                      how='outer',
                      on=['air_store_id','visit_date'])            

Merging for total
Merging for D-1
Merging for D-3
Merging for D-7
Merging for D-15
Merging for D-38


#### Add the visit and reservations statistics over a reference period ( from 2017/01/02 to 2017/03/13) to allow the learning to consider the historic

In [48]:
Data_visit.describe(include = 'all')

Unnamed: 0,air_store_id,visit_date,visitors,total_reservation,total_reservation_minus_1,total_reservation_minus_3,total_reservation_minus_7,total_reservation_minus_15,total_reservation_minus_30
count,253874,253874,252108.0,29830.0,29830.0,29830.0,29830.0,29830.0,29830.0
unique,829,517,,,,,,,
top,air_a083834e7ffe187e,2017-03-17 00:00:00,,,,,,,
freq,484,799,,,,,,,
first,,2016-01-01 00:00:00,,,,,,,
last,,2017-05-31 00:00:00,,,,,,,
mean,,,20.973761,13.879149,11.025478,8.449983,5.927489,3.287395,0.883875
std,,,16.757007,23.729264,23.607064,23.141884,22.509064,21.333184,18.930713
min,,,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,,,9.0,4.0,2.0,0.0,0.0,0.0,0.0


In [49]:
# Adding the mean of visit per day, the max per day, the min per day

D_mean = Data_visit.groupby(['air_store_id']).apply(lambda x: x[(datetime.date(2017,1, 1) < x['visit_date']) & (x['visit_date']< datetime.date(2017,3, 14))]['visitors'].mean()).reset_index(name = 'visit_mean')
D_max = Data_visit.groupby(['air_store_id']).apply(lambda x: x[(datetime.date(2017,1, 1) < x['visit_date']) & (x['visit_date']< datetime.date(2017,3, 14))]['visitors'].max()).reset_index(name = 'visit_max')
D_min = Data_visit.groupby(['air_store_id']).apply(lambda x: x[(datetime.date(2017,1, 1) < x['visit_date']) & (x['visit_date']< datetime.date(2017,3, 14))]['visitors'].min()).reset_index(name = 'visit_min')

Data_visit = Data_visit.merge( 
                      D_mean,
                      how = 'left', 
                      on='air_store_id').merge(D_max,
                                               how = 'left',
                                               on='air_store_id').merge(D_min,
                                                                        how = 'left',
                                                                        on='air_store_id')

# We do not remove the lines with no visit for the reference period
# Data_visit = Data_visit[np.isfinite(Data_visit['visit_mean'])]

In [50]:
Data_visit.groupby('air_store_id').get_group('air_c3585b0fba3998d0')

Unnamed: 0,air_store_id,visit_date,visitors,total_reservation,total_reservation_minus_1,total_reservation_minus_3,total_reservation_minus_7,total_reservation_minus_15,total_reservation_minus_30,visit_mean,visit_max,visit_min
64693,air_c3585b0fba3998d0,2016-07-02,24.0,,,,,,,7.548387,35.0,1.0
64694,air_c3585b0fba3998d0,2016-07-03,37.0,,,,,,,7.548387,35.0,1.0
64695,air_c3585b0fba3998d0,2016-07-05,4.0,,,,,,,7.548387,35.0,1.0
64696,air_c3585b0fba3998d0,2016-07-07,3.0,,,,,,,7.548387,35.0,1.0
64697,air_c3585b0fba3998d0,2016-07-08,16.0,,,,,,,7.548387,35.0,1.0
64698,air_c3585b0fba3998d0,2016-07-09,18.0,,,,,,,7.548387,35.0,1.0
64699,air_c3585b0fba3998d0,2016-07-10,9.0,,,,,,,7.548387,35.0,1.0
64700,air_c3585b0fba3998d0,2016-07-11,5.0,,,,,,,7.548387,35.0,1.0
64701,air_c3585b0fba3998d0,2016-07-12,2.0,,,,,,,7.548387,35.0,1.0
64702,air_c3585b0fba3998d0,2016-07-14,3.0,,,,,,,7.548387,35.0,1.0


##### The problem at that stage is that we have (air_store_id, visit_date) with no reservations data, or what is worse, with no visitors data
* Lines with no visitors cannot be used for learning (and are very dubious) => they are part of the training
* Lines with no reservations data are more difficult to interpret: it can be a weak day, with really no reservation, or it can be that the restaurant never receives reservations through the site (only 319 does). => we will treat them as 0, and to inform the machine, we will add the statistics data on reservation on the reference period, for each restaurant

We also have missing statistics value for the visits: some restaurants had no visitors during the reference period!
=> we will also treat that as 0

In [51]:
# To add the reservations statistics
min_date = datetime.date(2017,1, 2)
max_date = datetime.date(2017,3, 13)

D_mean = Data_visit.groupby(['air_store_id']).apply(lambda x: x[(min_date <= x['visit_date']) & (x['visit_date']<= max_date)]['total_reservation'].mean()).reset_index(name = 'reservation_mean')
D_max = Data_visit.groupby(['air_store_id']).apply(lambda x: x[(min_date <= x['visit_date']) & (x['visit_date']<= max_date)]['total_reservation'].max()).reset_index(name = 'reservation_max')
D_min = Data_visit.groupby(['air_store_id']).apply(lambda x: x[(min_date <= x['visit_date']) & (x['visit_date']<= max_date)]['total_reservation'].min()).reset_index(name = 'reservation_min')

Data_visit = Data_visit.merge( 
                      D_mean,
                      how = 'left', 
                      on='air_store_id').merge(D_max,
                                               how = 'left',
                                               on='air_store_id').merge(D_min,
                                                                        how = 'left',
                                                                        on='air_store_id')

In [52]:
Data_visit.describe(include='all')

Unnamed: 0,air_store_id,visit_date,visitors,total_reservation,total_reservation_minus_1,total_reservation_minus_3,total_reservation_minus_7,total_reservation_minus_15,total_reservation_minus_30,visit_mean,visit_max,visit_min,reservation_mean,reservation_max,reservation_min
count,253874,253874,252108.0,29830.0,29830.0,29830.0,29830.0,29830.0,29830.0,252812.0,252812.0,252812.0,84478.0,84478.0,84478.0
unique,829,517,,,,,,,,,,,,,
top,air_a083834e7ffe187e,2017-03-17 00:00:00,,,,,,,,,,,,,
freq,484,799,,,,,,,,,,,,,
first,,2016-01-01 00:00:00,,,,,,,,,,,,,
last,,2017-05-31 00:00:00,,,,,,,,,,,,,
mean,,,20.973761,13.879149,11.025478,8.449983,5.927489,3.287395,0.883875,20.075445,50.349212,3.574538,11.579239,38.992602,2.712126
std,,,16.757007,23.729264,23.607064,23.141884,22.509064,21.333184,18.930713,10.799205,43.3413,4.746645,9.589023,113.699372,4.006116
min,,,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.058824,2.0,1.0,1.0,1.0,1.0
25%,,,9.0,4.0,2.0,0.0,0.0,0.0,0.0,11.183333,30.0,1.0,5.76087,15.0,2.0


#### Joins to be made only at the end of the data preparation to manipulate smaller matrices during merge

In [53]:
# Adding the day information
Data_visit = pd.merge( Data_visit, data_load['holiday_dates'], how = 'left', on='visit_date')

# Adding the store information
Data_visit = pd.merge( Data_visit, data_load['air_store'], how = 'left', on='air_store_id')

# Removing the latitude, longitude features
Data_visit = Data_visit.drop(['latitude', 'longitude'], axis = 1)

#### Transform the categorical features

In [54]:
Data_visit_1H = pd.get_dummies(Data_visit, columns=['air_genre_name', 'day_of_week', 'air_area_name' ])

In [55]:
Data_visit_1H.describe(include = 'all')

Unnamed: 0,air_store_id,visit_date,visitors,total_reservation,total_reservation_minus_1,total_reservation_minus_3,total_reservation_minus_7,total_reservation_minus_15,total_reservation_minus_30,visit_mean,...,air_area_name_Ōsaka-fu Sakai-shi Minamikawaramachi,air_area_name_Ōsaka-fu Suita-shi Izumichō,air_area_name_Ōsaka-fu Ōsaka-shi Fuminosato,air_area_name_Ōsaka-fu Ōsaka-shi Kyōmachibori,air_area_name_Ōsaka-fu Ōsaka-shi Kyūtarōmachi,air_area_name_Ōsaka-fu Ōsaka-shi Nakanochō,air_area_name_Ōsaka-fu Ōsaka-shi Nanbasennichimae,air_area_name_Ōsaka-fu Ōsaka-shi Shinmachi,air_area_name_Ōsaka-fu Ōsaka-shi Ōgimachi,air_area_name_Ōsaka-fu Ōsaka-shi Ōhiraki
count,253874,253874,252108.0,29830.0,29830.0,29830.0,29830.0,29830.0,29830.0,252812.0,...,253874.0,253874.0,253874.0,253874.0,253874.0,253874.0,253874.0,253874.0,253874.0,253874.0
unique,829,517,,,,,,,,,...,,,,,,,,,,
top,air_a083834e7ffe187e,2017-03-17 00:00:00,,,,,,,,,...,,,,,,,,,,
freq,484,799,,,,,,,,,...,,,,,,,,,,
first,,2016-01-01 00:00:00,,,,,,,,,...,,,,,,,,,,
last,,2017-05-31 00:00:00,,,,,,,,,...,,,,,,,,,,
mean,,,20.973761,13.879149,11.025478,8.449983,5.927489,3.287395,0.883875,20.075445,...,0.002462,0.001887,0.002478,0.002545,0.024615,0.002277,0.002103,0.011683,0.029905,0.004053
std,,,16.757007,23.729264,23.607064,23.141884,22.509064,21.333184,18.930713,10.799205,...,0.049556,0.043396,0.049714,0.05038,0.154948,0.047661,0.045815,0.107455,0.170324,0.063536
min,,,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.058824,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,,9.0,4.0,2.0,0.0,0.0,0.0,0.0,11.183333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:
import pickle

Data_visit_1H.to_pickle("../../data/Data_visit-20171214")

##### We now have 140 features, which is probaly not reasonable, in particular for the genre and area

## Let's try a simple random forest only on those data

### Create the training, evaluation based on the period

In [3]:
Data_visit_1H = pd.read_pickle("../../data/Data_visit-20171214")

In [4]:
learning_end_date = datetime.date(2017,3, 13)
evaluation_end_date = datetime.date(2017,4, 22)

# Create the Test set
A = Data_visit_1H[Data_visit_1H['visit_date'] >  evaluation_end_date]

X_test = A.drop(['visitors', 'air_store_id', 'visit_date'  ], axis = 1).fillna(0).as_matrix()
Id_test = A[['air_store_id', 'visit_date']]

#TO DO: complete this test set with missing restaurant and visit dates

# Create the learning set

A = Data_visit_1H[Data_visit_1H['visit_date'] <  learning_end_date]
# Remove the lines with missing visits values
A = A[np.isfinite(A['visitors'])]
X_learn = A.drop(['visitors', 'air_store_id', 'visit_date'  ], axis = 1).fillna(0).as_matrix()
y_learn = A[['visitors']].as_matrix()
Id_learn = A[['air_store_id', 'visit_date']]

# Create the evaluation set

A = Data_visit_1H[(Data_visit_1H['visit_date'] >=  learning_end_date) & (Data_visit_1H['visit_date'] <=  evaluation_end_date)]
# Remove the lines with missing visits values
A = A[np.isfinite(A['visitors'])]
X_eval = A.drop(['visitors', 'air_store_id', 'visit_date'], axis = 1).fillna(0).as_matrix()
y_eval = A[['visitors']].fillna(0).as_matrix()
Id_learn =  A[['air_store_id', 'visit_date']] 



In [5]:
import numpy as np

def rmsle(h, y): 
    """
    Compute the Root Mean Squared Log Error for hypthesis h and targets y
    
    Args:
        h - numpy array containing predictions with shape (n_samples, n_targets)
        y - numpy array containing targets with shape (n_samples, n_targets)
    """
    return np.sqrt(np.square(np.log(h + 1) - np.log(y + 1)).mean())

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(criterion= 'mae', max_depth = 20)
tree_reg.fit(X_learn, y_learn)

In [None]:
y_eval_pred = tree_reg.predict(X_eval)

In [None]:
rmsle(y_eval_pred, y_eval)