## Advanced use of RNN

There are 3 advances techniques for improving the performance and generalization power of RNNs. 

- Recurrent dropout
- Stacking recurrent layers
- Bidirectional recurrent layers

### A temperature forecasting problem

In [None]:
import os
data_dir = '/kaggle/input/weather-archive-jena/'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')

f = open(fname)
data = f.read()
f.close()

lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]

print(header)
print(len(lines))

We have 420551 lines of data. We should convert them to numpy arrays so we can work with them.

In [None]:
import numpy as np

float_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(',')[1:]]
    float_data[i, :] = values

Now we have a numpy array (matrix) organizes as `lines x values`  
We can plot the temperature:

In [None]:
import matplotlib.pyplot as plt

temp = float_data[:, 1] 
plt.plot(range(len(temp)), temp)

In [None]:
# as each datapoint was collected every 10 minutes, 
# we can plot the temp data of the first 10 days of the recorded data

plt.plot(range(1440), temp[:1440])

We can clearly see the yearly periodicity of the temp, which is not there if we have a look at a few days timespan. 

We want to predict the temperature 24 hours in the future for a given time window, let's try that with NNs. 

### Preparing the data

First, we normalize the data, only using the first 200000 timestamps as training data. 

In [None]:
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

Now we create a generator that returns the timeseries samples and an array of targets temperatures

In [None]:
def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
    '''
    - data : The original array of floating-point data, which you normalized in listing 6.32.
    - lookback : How many timesteps back the input data should go.
    - delay : How many timesteps in the future the target should be.
    - min_index and max_index : Indices in the data array that delimit which timesteps to draw from. 
                                This is useful for keeping a segment of the data for validation and 
                                another for testing.
    - shuffle : Whether to shuffle the samples or draw them in chronological order.
    - batch_size : The number of samples per batch.
    - step : The period, in timesteps, at which you sample data. You抣l set it to 6 in order to draw 
            one data point every hour.
    '''
    if max_index is None:
        max_index = len(data) - delay - 1
    i = min_index + lookback
    
    while 1:
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)
            
        samples = np.zeros((len(rows), 
                            lookback // step, 
                            data.shape[-1]))
        
        targets = np.zeros((len(rows), ))
        
        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
            
        yield samples, targets

We will use the generator function to instantiate 3 generators: one for training, one for validation, and one for testing. Each one will take different temporal segments of the data. 

In [None]:
lookback = 1440
step = 6
delay = 144
batch_size = 128

train_gen = generator(float_data, 
                      lookback=lookback, 
                      delay=delay, 
                      min_index=0, 
                      max_index=200000, 
                      shuffle=True, 
                      step=step, 
                      batch_size=batch_size)

val_gen = generator(float_data, 
                    lookback=lookback, 
                    delay=delay, 
                    min_index=200001, 
                    max_index=300000, 
                    step=step, 
                    batch_size=batch_size)

test_gen = generator(float_data, 
                     lookback=lookback, 
                     delay=delay, 
                     min_index=300001, 
                     max_index=None, 
                     step=step, 
                     batch_size=batch_size)

val_steps = (300000 - 200001 - lookback) // batch_size

test_steps = (len(float_data) - 300001 - lookback) // batch_size

In [None]:
val_steps

In order to have a baseline method to which our network method is compared, we create a naive method that predicts the temperature 24 hours from now to be exactly the same as right now. We evaluate this method with the mean absolute error (MAE):

In [None]:
def evaluate_naive_method():
    batch_maes = [] 
    for step in range(val_steps): 
        samples, targets = next(val_gen)
        preds = samples[:, -1, 1]
        mae = np.mean(np.abs(preds - targets)) 
        batch_maes.append(mae)
    print(np.mean(batch_maes))
    
evaluate_naive_method()

In [None]:
celsius_mae = 0.29 * std[1]
print(celsius_mae)

We have an average absolute error of about `2.57篊` as baseline

### A basic Machine-learning approach

Before using RNN, let's use Dense layers

In [None]:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
import time

model = Sequential()
model.add(layers.Flatten(input_shape=(lookback // step, 
                                      float_data.shape[-1])))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')

start = time.perf_counter()

history = model.fit_generator(train_gen, 
                              steps_per_epoch=500, 
                              epochs=20, 
                              validation_data=val_gen, 
                              validation_steps=val_steps)

elapsed = time.perf_counter() - start
print('Elapsed %.3f seconds.' % elapsed)

local GPU time: 351.782 seconds -> 5.86 minutes  
kaggle GPU time: 400.122 second -> 6.67 minutes

Now we plot the loss curves:

In [None]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'r*', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

We can compare the validation loss from this plot to the MAE value obtained from the baseline calculation (~0.29). We are at about the same value using this simple neural network. Our informed assumption is very powerful in comparison to this simple machine learning approach. 

### First recurrent baseline

Now, by using an RNN we will be reading the data as a sequence, which is perfect for this kind of data. 

We'll be using the Gate Recurrent Unit ([GRU](https://arxiv.org/abs/1412.3555)) layer, which follows the same principle as LSTM but it's cheaper to run. 

In [None]:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
import time

model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')

start = time.perf_counter()

history = model.fit_generator(train_gen, 
                              steps_per_epoch=500, 
                              epochs=20, 
                              validation_data=val_gen, 
                              validation_steps=val_steps)

elapsed = time.perf_counter() - start
print('Elapsed %.3f seconds.' % elapsed)

Local GPU time: 6552.676 seconds -> 1.8 hrs!  
Kaggle GPU time: 3694.019 seconds -> 61.5 minutes

In [None]:
from keras import models

# models.save_model(model, 'first_rnn_jena_climate.h5')
models.save_model(model, '/kaggle/working/first_rnn_jena_climate_k.h5')

In [None]:
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'r*', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Here we can see that the validation loss reached a lower value than the baseline. 

In [None]:
min(val_loss)

In [None]:
celsius_mae_simple_rnn = min(val_loss) * std[1]
print('Mean absolute error = {:.3}篊'.format(celsius_mae_simple_rnn))

We have a MAE of about `2.32篊`, better than the baseline of `2.57篊`.

### Using recurrent dropout to fight overfitting

In the case of recurrent layers, there is another type of dropout usage, the recurrent dropout, that applies the same dropout to the recurrent activation of the layer. This method is the best for RNN according to this [thesis](http://mlg.eng.cam.ac.uk/yarin/blog_2248.html). 

In [None]:
from keras.models import Sequential
from keras  import layers
from keras.optimizers import RMSprop
import time 

model = Sequential()
model.add(layers.GRU(32, 
                     dropout=0.2, 
                     recurrent_dropout=0.2, 
                     input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')

# start = time.perf_counter()

history = model.fit_generator(train_gen, 
                              steps_per_epoch=500, 
                              epochs=2, 
                              validation_data=val_gen, 
                              validation_steps=val_steps)

# elapsed = time.perf_counter() - start
# print('Elapsed %.3f seconds.' % elapsed)

In [None]:
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'r*', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()