## Objectives

* Exploit as much as possible the data we have on HPG site, in order to predict what would be the total reservations for each restaurant for which we have reservations up to April 23rd
* Use this prediction to aliment the primary neural network
* Make a difference on the final score, by having a relevant prediction for the 2nd, 3rd and more weeks of the test set

The specificity of the prediction is that we have complete series of reservations history, and we try to predict the final results with partial ones

### Hypotheses

* We can predict the total of reservations for a specific day, by considering the reservations made 1,2, 39 days in the past
* We do not need additional data than the reservations per store id and date (maybe we will correct that with calendar data)

### Create the data set

In [1]:
import pandas as pd
import numpy as np
import datetime
import calendar
from random import randint

#Load all Files (hey must be in input directory in a brother directory of the notebook)
data_load = {
    'air_reserve': pd.read_csv('../input/air_reserve.csv',parse_dates=['visit_datetime','reserve_datetime']), 
    'hpg_reserve': pd.read_csv('../input/hpg_reserve.csv',parse_dates=['visit_datetime','reserve_datetime']), 
    'air_store': pd.read_csv('../input/air_store_info.csv'),
    'hpg_store': pd.read_csv('../input/hpg_store_info.csv'),
    'air_visit': pd.read_csv('../input/air_visit_data.csv',parse_dates=['visit_date']),
    'store_id': pd.read_csv('../input/store_id_relation.csv'),
    'sample_sub': pd.read_csv('../input/sample_submission.csv'),
    'holiday_dates': pd.read_csv('../input/date_info.csv',parse_dates=['calendar_date']).rename(columns={'calendar_date':'visit_date'})
    }

In [2]:
# Air and HPG sites reservations data
# partially from https://www.kaggle.com/zeemeen/weighted-mean-comparisons-lb-0-497-1st/code

Data = {}

Data['reserve_air'] = data_load['air_reserve'].copy()
Data['reserve_hpg'] = data_load['hpg_reserve'].copy()

#Data['reserve_hpg'] = pd.merge(Data['reserve_hpg'], data_load['store_id'], how='left', on=['hpg_store_id'])

for site in ['air', 'hpg']:

    df = 'reserve_'+site
    
    # Convert to date
    Data[df]['visit_date'] = Data[df]['visit_datetime'].apply(lambda d: d.date())
    Data[df]['reserve_date'] = Data[df]['reserve_datetime'].apply(lambda d: d.date())
        
    # Calculate the total of reservations per day for restaurants with reservations, for each site 
    Data[df] = Data[df].groupby([site + '_store_id','reserve_date','visit_date'],
                                as_index=False)\
                                    .reserve_visitors.sum()\
                                    .rename(columns={'reserve_visitors':'res_store_date_partial_sum'})
           
    # Add a column for the difference between visit and reservation date
    Data[df]['reserve_date_diff'] = Data[df].apply(lambda r: (r['visit_date'] - r['reserve_date']).days, axis = 1)
    Data[df] = Data[df].drop('reserve_date',axis = 1)
    

In [6]:
for site in ['air', 'hpg']:
    
    df = 'reserve_'+site
    df_res_series = site + '_res_series'

    Data[df_res_series] = Data[df].groupby([site + '_store_id', 'visit_date'], as_index = False)\
                                                    .res_store_date_partial_sum.sum()
    
    # WARNING: at that stage, we only have the couple (store, visit_date) with non-zero reservation,
    # which biases the training set
    
    A = Data[df]
    
    # Add a column for each step, with the sum of the reservations at that date
    for i in range(50,-1,-1): # From 50 to 0
        #print('Processing column ', i) 
        Data[df_res_series] = Data[df_res_series].merge(
            A[A['reserve_date_diff']>=i]\
                        .groupby([site + '_store_id','visit_date'], as_index = False)\
                        .res_store_date_partial_sum.sum()\
                        .rename(columns={'res_store_date_partial_sum': 'res_store_date_partial_sum-'+ str(i)}),
            how = 'left',
            on = [site + '_store_id','visit_date'])

    # Add a column for site identification (optional)
    #for site2 in ['air', 'hpg']:
    #    if (site2==site):
    #        Data[df_res_series]['is'+site2] = 1  
    #    else:
    #        Data[df_res_series]['is'+site2] = 0
        
    Data[df_res_series] = Data[df_res_series]\
        .drop('res_store_date_partial_sum', axis = 1)\
        .rename(columns =  {site+'_store_id': 'store_id'})\
        .fillna(0) # All the columns with 0 reservation at a given day for a given visit date would else have a NaN

In [7]:
 Data[df_res_series]

Unnamed: 0,store_id,visit_date,res_store_date_partial_sum-50,res_store_date_partial_sum-49,res_store_date_partial_sum-48,res_store_date_partial_sum-47,res_store_date_partial_sum-46,res_store_date_partial_sum-45,res_store_date_partial_sum-44,res_store_date_partial_sum-43,...,res_store_date_partial_sum-9,res_store_date_partial_sum-8,res_store_date_partial_sum-7,res_store_date_partial_sum-6,res_store_date_partial_sum-5,res_store_date_partial_sum-4,res_store_date_partial_sum-3,res_store_date_partial_sum-2,res_store_date_partial_sum-1,res_store_date_partial_sum-0
0,hpg_001112ef76b9802c,2016-02-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9
1,hpg_001112ef76b9802c,2016-03-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3
2,hpg_001112ef76b9802c,2016-03-31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,5.0,5.0,5.0,5.0,5.0,5
3,hpg_001112ef76b9802c,2016-04-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,13.0,13.0,13.0,13.0,13
4,hpg_001112ef76b9802c,2016-04-18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,9.0,9.0,9.0,9
5,hpg_001112ef76b9802c,2016-07-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3
6,hpg_001112ef76b9802c,2016-07-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,5.0,5.0,5.0,5.0,5.0,5
7,hpg_001112ef76b9802c,2016-07-21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3
8,hpg_001112ef76b9802c,2016-07-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3
9,hpg_001112ef76b9802c,2016-07-28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3


#### Create the training set, test set ; we disregard the HPG values after April 22nd

In [31]:
threshold_date = datetime.date(2017,4, 22)

Data['train'] = pd.concat(
    [Data['air_res_series'][Data['air_res_series']['visit_date'] <= threshold_date],
    Data['hpg_res_series'][Data['hpg_res_series']['visit_date'] <= threshold_date]],
    axis = 0, 
    ignore_index = True)

# To keep only the row where the reservations are not null at D-7 (to avoid a set distorsion, see bottom note)
Data['train'] = Data['train'][Data['train']['res_store_date_partial_sum-7']>0]

Data['test'] = []
Data['test'] = Data['air_res_series'][Data['air_res_series']['visit_date'] > threshold_date].reset_index()

# Inform for each the relevant data

Data['test'].loc[:,'relevant_last_step'] = Data['test']['visit_date'].apply(lambda d: (threshold_date-d).days)

In [32]:
Data['train']

Unnamed: 0,store_id,visit_date,res_store_date_partial_sum-50,res_store_date_partial_sum-49,res_store_date_partial_sum-48,res_store_date_partial_sum-47,res_store_date_partial_sum-46,res_store_date_partial_sum-45,res_store_date_partial_sum-44,res_store_date_partial_sum-43,...,res_store_date_partial_sum-9,res_store_date_partial_sum-8,res_store_date_partial_sum-7,res_store_date_partial_sum-6,res_store_date_partial_sum-5,res_store_date_partial_sum-4,res_store_date_partial_sum-3,res_store_date_partial_sum-2,res_store_date_partial_sum-1,res_store_date_partial_sum-0
7,air_00a91d42b08b08d9,2017-03-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3
12,air_0164b9927d20bcc3,2016-10-28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,2.0,2.0,2.0,12.0,12.0,12.0,12
15,air_0164b9927d20bcc3,2016-11-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9
16,air_0164b9927d20bcc3,2016-11-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6
18,air_0164b9927d20bcc3,2016-11-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,12.0,12.0,12.0,12.0,14.0,14.0,14.0,14.0,17.0,17
21,air_0164b9927d20bcc3,2016-11-18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,13
23,air_0164b9927d20bcc3,2016-11-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9
24,air_0164b9927d20bcc3,2016-11-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2
25,air_0164b9927d20bcc3,2016-11-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10
28,air_0164b9927d20bcc3,2016-12-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,4.0,4


In [33]:
from sklearn.model_selection import train_test_split

X = Data['train'].drop(['store_id', 'visit_date', 'res_store_date_partial_sum-0'], axis = 1).as_matrix()
y = Data['train'][['res_store_date_partial_sum-0']].as_matrix()

X_learn, X_eval, y_learn, y_eval = train_test_split(X, y, test_size=0.25, random_state=42)

### Create the RNN

In [11]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

import tensorflow as tf 
from tensorflow.contrib import rnn
from tensorflow.python.ops import variable_scope
from tensorflow.python.framework import dtypes

import datetime

# Common imports
import numpy as np
import os    

In [42]:
# Graph construction, starting from 
# https://github.com/aaxwaz/Multivariate-Time-Series-forecast-using-seq2seq-in-TensorFlow/blob/master/build_model_multi_variate.py
tf.reset_default_graph()

# Number of time steps (days)
n_steps = 50

# Number of neurons per cell
n_hidden = 32

n_stacked_layers = 1 

# gradient clipping - to avoid gradient exploding
GRADIENT_CLIPPING = 5.0 

#dropout_keep_proba = 0.5

learning_rate = 0.005
lambda_l2_reg = 0.0

def build_graph(relevant_step = -1,
               n_inputs = 1,
               n_outputs = 1):
    
    if n_outputs > n_inputs:
        print('Error: the number of inputs features must be higher than the number of outputs features')
        return 0
    
    # For the logs
    now = datetime.datetime.utcnow().strftime("%Y%m%d%H%M%S")
    root_logdir = "../../tf_logs"
    logdir = "{}/run-{}/".format(root_logdir, now)
    
    tf.reset_default_graph()
    
    global_step = tf.Variable(
                  initial_value=0,
                  name="global_step",
                  trainable=False,
                  collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES])
    
    # Weights and biases for the fully connected output layer (shared across all steps)
    weights = {
        'out': tf.get_variable('Weights_out', \
                               shape = [n_hidden, n_outputs], \
                               dtype = tf.float32, \
                               initializer = tf.truncated_normal_initializer()),
    }
    biases = {
        'out': tf.get_variable('Biases_out', \
                               shape = [n_outputs], \
                               dtype = tf.float32, \
                               initializer = tf.constant_initializer(0.)),
    }
    
    # If we want to use dropout (uncessful so far)
    #input_keep_prob = tf.placeholder_with_default(tf.constant(1.0, dtype=tf.float32), ())
    #output_keep_prob = tf.placeholder_with_default(tf.constant(1.0, dtype=tf.float32), ())
    #state_keep_prob = tf.placeholder_with_default(tf.constant(1.0, dtype=tf.float32), ())
    
    with tf.variable_scope('Seq2vec'):
        # Sequence inputs
        seq_inp = [
            tf.placeholder(tf.float32, shape=(None, n_inputs), name="inp_{}".format(t))
               for t in range(n_steps)
        ]

        # Vector output
        y = tf.placeholder(tf.float32, shape=(None, n_outputs), name="y")

        with tf.variable_scope('LSTMCell'): 
                   
#            def base_cell():
#                  return tf.contrib.rnn.LayerNormBasicLSTMCell(num_units=n_hidden, 
#                                                                activation = tf.nn.relu) 
            def base_cell():
                base_cell = tf.contrib.rnn.BasicLSTMCell(num_units=n_hidden, 
                                                                activation = tf.nn.relu)
                return base_cell
            
            # This is the official and only good way to declare several layers of cells    
            cell = tf.contrib.rnn.MultiRNNCell(
                [base_cell() for _ in range(n_stacked_layers)])            


        def _rnn_seq_to_vec(seq_inputs,
                                cell,
                                relevant_step,
                                loop_function,
                                scope=None):
            """RNN for the sequence-to-vec model.
            Args:
            seq_inputs: A list of 2D Tensors [batch_size x n_inputs].
            initial_state: 2D Tensor with shape [batch_size x cell.state_size].
            cell: rnn_cell.RNNCell defining the cell function and size.
            relevant_step: A variable indicating which input step to consider, and which to discard
            scope: VariableScope for the created subgraph; defaults to "rnn_decoder".
            Returns:
            A tuple of the form (output, state), where:
              output: The input of the last step, a 2D tensor of shape [batch_size x n_outputs] containing the generated output.
              state: The state of each cell at the final time-step.
                It is a 2D Tensor of shape [batch_size x cell.state_size].
                (Note that in some cases, like basic RNN cell or GRU cell, outputs and
                 states can be the same. They are different for LSTM cells though.)
            """ 
            with variable_scope.variable_scope(scope or "rnn_seq_to_vec"):
                state = cell.zero_state(tf.shape(seq_inputs[0])[0], tf.float32)
                prev = None
                for i, inp in enumerate(seq_inputs):
                    # If the step is irrelevant, use the output of previous step instead                                                         
                    if i-n_steps > relevant_step:                                                
                        with variable_scope.variable_scope("loop_function", reuse=True):
                            if n_inputs > n_outputs:
                                inp = tf.concat([loop_function(prev), inp[:,n_outputs:]], axis=1)
                            else:
                                inp = loop_function(prev)
                    if i > 0:
                        variable_scope.get_variable_scope().reuse_variables()
                    output, state = cell(inp, state)
                    prev = output

                # We take the last step outut and apply the fully connected layer to get the final output                                                  
                final_output = tf.matmul(output, weights['out']) + biases['out'] 

            return final_output, state

        def _loop_function(prev):
                  '''Naive implementation of loop function for _rnn_seq_to_vec. Transform prev from 
                  dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be
                  used as input of next time step '''
                  return tf.matmul(prev, weights['out']) + biases['out']
                                                 

        final_output, seq_memory = _rnn_seq_to_vec(
                seq_inp,
                cell,         
                relevant_step,
                loop_function = _loop_function
            )
                                                          
    # Training loss and optimizer
    with tf.variable_scope('Loss'):
        # L1 loss
        vector_error = tf.abs(final_output-y)
        output_loss = tf.reduce_mean(vector_error)

        # L2 regularization for weights and biases: should I keep that
        reg_loss = 0
        for tf_var in tf.trainable_variables():
            if 'Biases_' in tf_var.name or 'Weights_' in tf_var.name:
                reg_loss += tf.reduce_mean(tf.nn.l2_loss(tf_var))

        # To regularize the weights and biases: strange idea...
        loss = output_loss + lambda_l2_reg * reg_loss
        
        #loss = output_loss
        tf.summary.scalar('output_loss',output_loss)
        tf.summary.scalar('reg_loss',reg_loss)

    with tf.variable_scope('Optimizer'):
        optimizer = tf.contrib.layers.optimize_loss(
                loss=loss,
                learning_rate=learning_rate,
                global_step=global_step,
                optimizer='Adam',
                clip_gradients=GRADIENT_CLIPPING)

    saver = tf.train.Saver
      
    file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())  
    merged_summary = tf.summary.merge_all()
    
    return dict(
        seq_inp = seq_inp, 
        y = y,
        final_output = final_output, 
        train_op = optimizer, 
        loss=loss,
        output_loss = output_loss,
        saver = saver,
        file_writer = file_writer,
        merged_summary = merged_summary
        )        
                                                          
#build_graph()

##### Train the RNN

In [43]:
n_epochs = 150
batch_size = 2500


LOAD_FILE = "../../data/RNN-20180108-3.ckpt"
SAVE_FILE = "../../data/RNN-20180108-3.ckpt"

best_error = np.infty
checks_without_progress = 0
max_checks_without_progress = 25

rnn_model = build_graph(relevant_step = -7 , n_inputs = 2)
init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()

    temp_saver = rnn_model['saver']()    
    
    # To continue the training
    temp_saver.restore(sess, LOAD_FILE)

    
    for epoch in range(n_epochs):
        rnd_idx = np.random.permutation(len(X_learn))
        i = 0
        for rnd_indices in np.array_split(rnd_idx, len(X_learn) // batch_size):
            X_batch, y_batch = X_learn[rnd_indices], y_learn[rnd_indices]
            
            input_idx = np.ones((X_batch.shape[0],1))
      
            feed_dict = {rnn_model['seq_inp'][t]: np.concatenate([X_batch[:,t].reshape(-1,1)],
                                                                 input_idx*(t-50),
                                                                 axis = 1) for t in range(n_steps)}
            # input_idx*(t-50),
            # X_batch[:,-10:]
        
            feed_dict.update({rnn_model['y']: y_batch})

            _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict)
            
            if (i%50 == 0):
                merged_summary_str = sess.run(rnn_model['merged_summary'], feed_dict)
                rnn_model['file_writer'].add_summary(merged_summary_str, epoch*len(X_learn) // batch_size + i)
            
            i = i+1
            
        input_idx = np.ones((X_eval.shape[0],1))        
        feed_dict = {rnn_model['seq_inp'][t]: np.concatenate([X_eval[:,t].reshape(-1,1)],
                                                             input_idx*(t-50),
                                                             axis = 1) for t in range(n_steps)}
        feed_dict.update({rnn_model['y']: y_eval})
        eval_error = sess.run(rnn_model['output_loss'], feed_dict)
        print(epoch, "Eval error:", eval_error)

        if eval_error < best_error:
                #Saving the last version of the network
                checks_without_progress = 0
                save_path = temp_saver.save(sess, SAVE_FILE)
                best_error = eval_error

        else: 
            checks_without_progress += 1
            if checks_without_progress > max_checks_without_progress:
                print("Early stopping!")
                stopping = True
                break

with tf.Session() as sess:
    temp_saver.restore(sess, SAVE_FILE)
    print("Final eval error: {:.3f}".format(best_error))

ValueError: linear is expecting 2D arguments: [TensorShape([Dimension(None), Dimension(2), Dimension(1)]), TensorShape([Dimension(None), Dimension(32)])]

##### Several runs
* RNN-20180106-1: 64 hidden neurons, 2 layers, with 50% dropout during training, 0 during evaluation: best result 4.775 very stable
* RNN-20180106-2: idem, without dropout during training, 0 during evaluation: best result:non convegence
Following the addition of a dropout layer: bas results
* RNN-20180106-6, with gradient clipping: 2.5, no regularization, 1 layer, 32 neurons, removing the dropout wrapper, best result: 0.343 >> finally better than the stupid predictor!! But what did the trick??
* RNN-20180106-7, idem without gradient clipping - the behaviour is more erratic, it is not converging
* RNN-20180106-8, idem with higher gradient clipping: 5, higher alpha 0.005 : seems very similar, maybe a bit faster
* RNN-20180107-1, bringing back the regularization (0.001) the system is successful to regularize, but what the point? best: 0.348, with 
* RNN-20180107-2, with a secondary input for the days before prediction date, best: 0.343: not better
* RNN-20180107-3, with 10 more inputs: week of day (1H), holiday flag, air/hpg flag: best: 0.342 it does not seem to improve the prediction much!!
* RNN-20180107-4, with 1 input only, training with relevant_step = -2 best: 0.839 (better than stupid prediction(1.2), slightly better than without retraining: 0.845)
* RNN-20180107-7, after removing the row with no reservation at -1 (to reflect the prediction set conditions), best: 0.232 (not fully converged) worst than stupid prediction: 
* RNN-20180108-1, idem with 2 layers, worse result: 0.236
* RNN-20180108-2, 1 layer, prediction at -7, best: 2.130 (vs 2.152 for stupid prediction) 1% error improvement :(

#### Evaluate the prediction with a single network training

In [40]:
LOAD_FILE = "../../data/RNN-20180108-2.ckpt"

eval_index = -28
# For an evaluation at eval_index
indices = [i for (i,v) in enumerate(X_eval[:,n_steps + eval_index]) if v==0]

X_eval_reduced = np.delete(X_eval, indices, axis = 0)
y_eval_reduced = np.delete(y_eval, indices, axis = 0)
print ("Complete matrix shape", X_eval.shape, "Reduced matrix shape: ", X_eval_reduced.shape)

rnn_model = build_graph(relevant_step = eval_index)
init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()

    temp_saver = rnn_model['saver']()
    temp_saver.restore(sess, LOAD_FILE)
       
    feed_dict = {rnn_model['seq_inp'][t]: X_eval_reduced[:,t].reshape(-1,1) for t in range(n_steps)}
    pred = sess.run(rnn_model['final_output'], feed_dict)

error = np.mean(abs(y_eval_reduced-pred))
print("Mean absolute error on the total of reservations: ", error)

stupid_error =  np.mean(abs(X_eval_reduced[:,n_steps + eval_index].reshape(-1,1)-y_eval_reduced))
print("Stupid mean absolute error using the value at day", eval_index, ":" , stupid_error)

Complete matrix shape (144691, 50) Reduced matrix shape:  (16494, 50)
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180108-2.ckpt
Mean absolute error on the total of reservations:  5.96540876294
Stupid mean absolute error using the value at day -28 : 6.00612343883


#### Predict the reservations with a single network training

In [47]:
LOAD_FILE = "../../data/RNN-20180106-6.ckpt"
    
Data['test']['reservations_prediction'] = 0
    
for i in range(-39, 0, 1):
    print("Processing the values missing at:", i)
    idx = (Data['test']['relevant_last_step']==i)
    X_test = Data['test'][idx].drop(['index', 'store_id', 'visit_date'], axis = 1).as_matrix()
    rnn_model = build_graph(relevant_step = i)
    init = tf.global_variables_initializer()
    
    with tf.Session() as sess:
        init.run()

        temp_saver = rnn_model['saver']()
        temp_saver.restore(sess, LOAD_FILE)

        feed_dict = {rnn_model['seq_inp'][t]: X_test[:,t].reshape(-1,1) for t in range(n_steps)}
        pred = sess.run(rnn_model['final_output'], feed_dict)
    
    Data['test'].loc[idx,['reservations_prediction']] = pred

Processing the values missing at: -39
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -38
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -37
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -36
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -35
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -34
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -33
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -32
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
Processing the values missing at: -31
INFO:tensorflow:Restoring parameters from ../../data/RNN-20180106-6.ckpt
P

In [48]:
Data['test']

Unnamed: 0,index,store_id,visit_date,res_store_date_partial_sum-50,res_store_date_partial_sum-49,res_store_date_partial_sum-48,res_store_date_partial_sum-47,res_store_date_partial_sum-46,res_store_date_partial_sum-45,res_store_date_partial_sum-44,...,res_store_date_partial_sum-7,res_store_date_partial_sum-6,res_store_date_partial_sum-5,res_store_date_partial_sum-4,res_store_date_partial_sum-3,res_store_date_partial_sum-2,res_store_date_partial_sum-1,res_store_date_partial_sum-0,relevant_last_step,reservations_prediction
0,90,air_0164b9927d20bcc3,2017-04-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,-3,-3,3.004171
1,91,air_0164b9927d20bcc3,2017-04-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,-4,-4,2.001787
2,92,air_0164b9927d20bcc3,2017-04-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,4.0,4.0,4.0,4.0,4.0,4.0,-5,-5,4.006970
3,93,air_0164b9927d20bcc3,2017-05-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,-9,-9,3.002837
4,94,air_0164b9927d20bcc3,2017-05-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,-10,-10,2.005226
5,95,air_0164b9927d20bcc3,2017-05-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,-13,-13,2.005579
6,96,air_0164b9927d20bcc3,2017-05-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,-17,-17,3.015620
7,97,air_0164b9927d20bcc3,2017-05-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,10.0,10.0,10.0,10.0,10.0,10.0,-21,-21,10.109029
8,379,air_03963426c9312048,2017-04-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,8.0,8.0,8.0,8.0,-2,-2,8.007951
9,380,air_03963426c9312048,2017-04-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,24.0,24.0,24.0,24.0,24.0,24.0,24.0,-4,-4,23.896526


##### Conclusion:
The network very opportunistically chose the following logic: if there is a reservation at last relevant step, return it, else return an average (2.95)

This reveals that I have made several mistakes:
* I should have added to the training set and prediction set, the day where there was no reservation before and no reservation at the end (here the network always find out there is a reservation at the end, by construction of the training set)
* Similarly, during the training of the main network, I should distinguish the restaurants where there are ever a reservation (missing data, -1) and the restaurants where there are some reservations (in that case the day without reservation should be treated as a real 0)

Another way would be to keep in the training set only the cases where there is a reservation made, so as to be in the same case as the prediction set, and for the other case, just add an average of reservation of the restaurant without reservation at -D

In [None]:
import pickle

DATA_PREDICTION_FILE = "../../data/Data_visit-20180107-1"

tmp = Data['test'][['store_id','visit_date','reservations_prediction']]

tmp.to_pickle(DATA_PREDICTION_FILE)

#### Data augmentation
The idea is to create multiple incomplete series for training for each series
The idea was abandonned after realizing the graph cannot dynamically depend on the relevant step

In [None]:
# Number of incomplete series generated for each  complete series
n_inc_series = 5

Data['train'] = pd.concat([Data['train']]*n_inc_series, ignore_index=True)
Data['train'].loc[:,'relevant_last_step'] = Data['train']['res_store_date_partial_sum-0'].apply(lambda n: randint(-39, -1))

In [39]:
i = [t for t in range(5)] 

SyntaxError: invalid syntax (<ipython-input-39-e79cbe3d932f>, line 1)

In [33]:
X_eval.shape

(312017, 50)

#### Add the day of week (optional: does not bring any improvement)

In [5]:
Data['holidays'] = data_load['holiday_dates'].copy()

wkend_holidays = Data['holidays'].apply(
    (lambda x:(x.day_of_week=='Sunday' or x.day_of_week=='Saturday') and x.holiday_flg==1), axis=1)
Data['holidays'].loc[wkend_holidays, 'holiday_flg'] = 0

Data['holidays']['visit_date'] =  Data['holidays']['visit_date'].apply(lambda d:d.date())

# Week of day info is added as one-hot considering its importance
Data['holidays'] = pd.get_dummies(Data['holidays'], columns=['day_of_week'])

for site in ['air', 'hpg']:
    df_res_series = site + '_res_series'
    Data[df_res_series] = pd.merge(Data[df_res_series], Data['holidays'], how='left', on=['visit_date'])  
