# Chapter 6

### Using recurrent dropout to fight overfitting

We’re already familiar with a classic technique for fighting overfitting: 
* dropout, which randomly zeros out input units of a layer in order to break happenstance correlations in the training data that the layer is exposed to. 

But how to correctly apply dropout in recurrent networks isn’t a trivial question. It has long been known that applying dropout before a recurrent layer hinders learning rather than helping with
regularization. In 2015, Yarin Gal, as part of his PhD thesis on Bayesian deep learning, determined the proper way to use dropout with a recurrent network: 
* The same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that varies randomly from timestep to timestep.

What’s more, in order to regularize the representations formed by the recurrent gates
of layers such as `GRU` and `LSTM`, a temporally constant dropout mask should be applied
to the inner recurrent activations of the layer (a recurrent dropout mask). 

Using the same dropout mask at every timestep allows the network to properly propagate its
learning error through time; a temporally random dropout mask would disrupt this
error signal and be harmful to the learning process.
 
Yarin Gal did his research using Keras and helped build this mechanism directly
into Keras recurrent layers. Every recurrent layer in Keras has two dropout-related
arguments: 
* `dropout`, a float specifying the dropout rate for input units of the layer, and
* `recurrent_dropout`, specifying the dropout rate of the recurrent units. 

Let’s add `dropout` and `recurrent dropout` to the `GRU` layer and see how doing so impacts overfitting. Because networks being regularized with dropout always take longer to fully converge, we’ll train the network for twice as many epochs.

In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

In [11]:
# Inspecting the data of the Jena weather dataset

import os

data_dir = 'jena_climate'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')

In [12]:
f = open(fname)
data = f.read()
f.close()

In [13]:
lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]

In [14]:
# Parsing the data

import numpy as np

float_data = np.zeros((len(lines), len(header) - 1))

for i, line in enumerate(lines):
    values = [float(x) for x in line.split(',')[1:]]
    float_data[i, :] = values

In [15]:
# Normalizing the data

mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

In [16]:
# # Training and evaluating a dropout-regularized GRU-based model

model = Sequential()
model.add(layers.GRU(32,dropout=0.2,recurrent_dropout=0.2,input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')

In [17]:
# Generator yielding timeseries samples and their targets

def generator(data, lookback, delay, min_index, max_index,shuffle=False, batch_size=128, step=6):
    
    if max_index is None:
        max_index = len(data) - delay - 1
    
    i = min_index + lookback
    
    while 1:
        
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
        
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)
        
        samples = np.zeros((len(rows),lookback // step,data.shape[-1]))
        targets = np.zeros((len(rows),))
        
        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
        
        yield samples, targets

In [18]:
# Preparing the training, validation, and test generators

lookback = 1440
step = 6
delay = 144
batch_size = 128

In [19]:
train_gen = generator(float_data,lookback=lookback,delay=delay,min_index=0,
                      max_index=200000,shuffle=True,step=step,batch_size=batch_size)

val_gen = generator(float_data,lookback=lookback,delay=delay,min_index=200001,
                    max_index=300000,step=step,batch_size=batch_size)

test_gen = generator(float_data,lookback=lookback,delay=delay,min_index=300001,
                     max_index=None,step=step,batch_size=batch_size)

In [21]:
# How many steps to draw from val_gen in order to see the entire validation set
val_steps = (300000 - 200001 - lookback)

# How many steps to draw from test_gen in order to see the entire test set 
test_steps = (len(float_data) - 300001 - lookback)

In [23]:
history = model.fit_generator(train_gen,steps_per_epoch=500,epochs=40,
                              validation_data=val_gen,validation_steps=val_steps)

In [None]:
# Plotting results

import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Figure above shows the results. Success! We’re no longer overfitting during the first 30
epochs. But although we have more stable evaluation scores, our best scores aren’t
much lower than they were previously. 

### Stacking Recurrent layers

Because we’re no longer overfitting but seem to have hit a performance bottleneck,
we should consider increasing the capacity of the network. Recall the description of
the universal machine-learning workflow: it’s generally a good idea to increase the
capacity of our network until overfitting becomes the primary obstacle (assuming we’re already taking basic steps to mitigate overfitting, such as using dropout). 

As long as we aren’t overfitting too badly, we’re likely under capacity. Increasing network capacity is typically done by 
* Increasing the number of units in the layers or 
* Adding more layers. 

Recurrent layer stacking is a classic way to build more-powerful recurrent networks: for instance, what currently powers the **Google Translate algorithm** is a **stack of seven large `LSTM` layers—that’s huge**.

To stack recurrent layers on top of each other in **Tensorflow/Keras**, all intermediate layers
should return their full sequence of outputs (a 3D tensor) rather than their output at
the last timestep. This is done by specifying `return_sequences=True`.

In [11]:
# Training and evaluating a dropout-regularized, stacked GRU model

model = Sequential()
model.add(layers.GRU(32,dropout=0.1,recurrent_dropout=0.5,return_sequences=True,
                     input_shape=(None, float_data.shape[-1])))
model.add(layers.GRU(64, activation='relu',dropout=0.1,recurrent_dropout=0.5))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')

In [None]:
history = model.fit_generator(train_gen,steps_per_epoch=500,epochs=40,
                              validation_data=val_gen,validation_steps=val_steps)

In [None]:
# Plotting results

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Figure above shows the results. We can see that the added layer does improve the
results a bit, though not significantly. We can draw two conclusions:
* Because we’re still not overfitting too badly, we could safely increase the size of our layers in a quest for validation-loss improvement. This has a non-negligible computational cost, though.
* Adding a layer didn’t help by a significant factor, so we may be seeing diminishing returns from increasing network capacity at this point.