# Chapter 6

### Using bidirectional RNNs

A bidirectional RNN is a common RNN variant that can offer greater performance than a regular RNN on certain tasks. It’s frequently used in natural-language processing—we
could call it the Swiss Army knife of deep learning for natural-language processing.

RNNs are notably 
* order dependent, or 
* time dependent: 

They process the timesteps of their input sequences in order
* Shuffling or reversing the timesteps can completely change the representations the RNN extracts from the sequence. 
* This is precisely the reason they perform well on problems where order is meaningful, such as the temperature-forecasting problem.

A bidirectional RNN exploits the order sensitivity of RNNs: it consists of using two regular RNNs, such as the `GRU` and `LSTM` layers, each of which processes the input sequence in one direction (chronologically and antichronologically), and then merging their representations. 

By processing a sequence both ways, a bidirectional RNN can catch patterns that
may be overlooked by a unidirectional RNN. Remarkably, the fact that the RNN layers have processed sequences in chronological order (older timesteps first) may have been an arbitrary decision. At least, it’s a decision we made no attempt to question so far. 

Could the RNNs have performed
well enough if they processed input sequences in antichronological order, for instance
(newer timesteps first)? Let’s try this in practice and see what happens. 

All we need to do is write a variant of the data generator where the input sequences are reverted along the time dimension (replace the last line with `yield samples[:, ::-1, :], targets`).
Training the same one-GRU-layer network that we used earlier, we get the results shown in figure below.

![image.png](attachment:image.png)

The **reversed-order GRU** strongly underperforms even the common-sense baseline,
indicating that in this case, chronological processing is important to the success of our
approach. 

This makes perfect sense: 
* The underlying GRU layer will typically be better at
remembering the recent past than the distant past, and naturally the more recent
weather data points are more predictive than older data points for the problem (that’s
what makes the common-sense baseline fairly strong). 

Thus the chronological version of the layer is bound to outperform the reversed-order version. Importantly, this isn’t true for many other problems, including natural language: intuitively, the importance of a word in understanding a sentence isn’t usually dependent on its position in the sentence. Let’s try the same trick on the LSTM IMDB.

In [25]:
# Training and evaluating an LSTM using reversed sequences

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

In [26]:
max_features = 10000 # Number of words to consider as features
maxlen = 500 # Cuts off texts after this number of words (among the max_features most common words)

In [27]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

In [4]:
# Reverses sequences

x_train = [x[::-1] for x in x_train]
x_test = [x[::-1] for x in x_test]

In [5]:
# Pads sequences
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

In [7]:
model = Sequential()
model.add(layers.Embedding(max_features, 128))
model.add(layers.LSTM(32))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy',metrics=['acc'])

In [8]:
history = model.fit(x_train, y_train,epochs=10,batch_size=128,validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We get performance nearly identical to that of the chronological-order LSTM.
On such a text dataset, reversed-order processing works just as well as
chronological processing, confirming the hypothesis that, although word order does
matter in understanding language, which order we use isn’t crucial. 

Importantly, an
RNN trained on reversed sequences will learn different representations than one
trained on the original sequences, much as we would have different mental models if
time flowed backward in the real world—if we lived a life where we died on our first
day and were born on our last day. 

In machine learning, representations that are different yet useful are always worth exploiting, and the more they differ, the better: they
offer a new angle from which to look at our data, capturing aspects of the data that
were missed by other approaches, and thus they can help boost performance on a
task. This is the intuition behind ensembling, a concept we’ll explore later.

A bidirectional RNN exploits this idea to improve on the performance of chronologicalorder RNNs. It looks at its input sequence both ways, obtaining potentially richer representations and capturing patterns that may have been missed by the
chronological-order version alone.

![image.png](attachment:image.png)

To instantiate a bidirectional `RNN`, we use the `Bidirectional` layer, which takes
as its first argument a recurrent layer instance. `Bidirectional` creates a second, separate
instance of this recurrent layer and uses one instance for processing the input sequences
in chronological order and the other instance for processing the input sequences in
reversed order. Let’s try it on the IMDB sentiment-analysis task.


In [9]:
# Training and evaluating a bidirectional LSTM

model = Sequential()
model.add(layers.Embedding(max_features, 32))
model.add(layers.Bidirectional(layers.LSTM(32)))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [10]:
history = model.fit(x_train, y_train,epochs=10,batch_size=128, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


* It performs slightly better than the regular LSTM we tried earlier
achieving over `8%` validation accuracy.
* It also seems to overfit more quickly, which is
unsurprising because a bidirectional layer has twice as many parameters as a chronological LSTM. 

With some regularization, the bidirectional approach would likely be a
strong performer on this task.
 
Now let’s try the same approach on the temperature-prediction task

In [12]:
import os

data_dir = 'jena_climate'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')

In [13]:
f = open(fname)
data = f.read()
f.close()

In [14]:
lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]

In [15]:
# Parsing the data

import numpy as np

float_data = np.zeros((len(lines), len(header) - 1))

for i, line in enumerate(lines):
    values = [float(x) for x in line.split(',')[1:]]
    float_data[i, :] = values

In [16]:
# Normalizing the data

mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

In [17]:
# Training a bidirectional GRU

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Bidirectional(layers.GRU(32), input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')

In [19]:
# Generator yielding timeseries samples and their targets

def generator(data, lookback, delay, min_index, max_index,shuffle=False, batch_size=128, step=6):
    
    if max_index is None:
        max_index = len(data) - delay - 1
    
    i = min_index + lookback
    
    while 1:
        
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
        
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)
        
        samples = np.zeros((len(rows),lookback // step,data.shape[-1]))
        targets = np.zeros((len(rows),))
        
        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
        
        yield samples, targets

In [20]:
# Preparing the training, validation, and test generators

lookback = 1440
step = 6
delay = 144
batch_size = 128

In [21]:
train_gen = generator(float_data,lookback=lookback,delay=delay,min_index=0,
                      max_index=200000,shuffle=True,step=step,batch_size=batch_size)

val_gen = generator(float_data,lookback=lookback,delay=delay,min_index=200001,
                    max_index=300000,step=step,batch_size=batch_size)

test_gen = generator(float_data,lookback=lookback,delay=delay,min_index=300001,
                     max_index=None,step=step,batch_size=batch_size)

In [22]:
# How many steps to draw from val_gen in order to see the entire validation set
val_steps = (300000 - 200001 - lookback)

# How many steps to draw from test_gen in order to see the entire test set 
test_steps = (len(float_data) - 300001 - lookback)

In [24]:
history = model.fit_generator(train_gen,steps_per_epoch=500,epochs=40,
                              validation_data=val_gen,validation_steps=val_steps)

This performs about as well as the regular GRU layer. It’s easy to understand why: 
* All the predictive capacity must come from the chronological half of the network, because the
antichronological half is known to be severely underperforming on this task (again,
because the recent past matters much more than the distant past in this case).

### Next Steps

There are many other things we could try, in order to improve performance on the temperature-forecasting problem:
* Adjust the number of units in each recurrent layer in the stacked setup. The
current choices are largely arbitrary and thus probably suboptimal.
* Adjust the learning rate used by the RMSprop optimizer.
* Try using LSTM layers instead of GRU layers.
* Try using a bigger densely connected regressor on top of the recurrent layers:
that is, a bigger Dense layer or even a stack of Dense layers.
* Don’t forget to eventually run the best-performing models (in terms of **validation MAE**) on the test set! Otherwise, we’ll develop architectures that are overfitting to the validation set.

As always, deep learning is more an art than a science. We can provide guidelines that
suggest what is likely to work or not work on a given problem, but, ultimately, every
problem is unique; we’ll have to evaluate different strategies empirically. There is
currently no theory that will tell in advance precisely what we should do to optimally solve a problem. We must iterate.

### Conclusion

* As we first learned in chapter 4, when approaching a new problem, it’s good to
first establish common-sense baselines for our metric of choice. If we don’t
have a baseline to beat, we can’t tell whether we’re making real progress.
* Try simple models before expensive ones, to justify the additional expense.
Sometimes a simple model will turn out to be our best option.
* When we have data where temporal ordering matters, recurrent networks are
a great fit and easily outperform models that first flatten the temporal data.
* To use dropout with recurrent networks, we should use a time-constant dropout mask and recurrent dropout mask. These are built into Keras recurrent layers, so all we have to do is use the dropout and recurrent_dropout arguments
of recurrent layers.
* Stacked RNNs provide more representational power than a single RNN layer.
They’re also much more expensive and thus not always worth it. Although they
offer clear gains on complex problems (such as machine translation), they may
not always be relevant to smaller, simpler problems.
* Bidirectional RNNs, which look at a sequence both ways, are useful on natural language processing problems. But they aren’t strong performers on sequence
data where the recent past is much more informative than the beginning of the
sequence.

**NOTE** There are two important concepts we won’t cover in detail here: 
* recurrent attention and 
* sequence masking.

Both tend to be especially relevant for natural-language processing, and they aren’t particularly applicable to the temperature-forecasting problem. We’ll leave them for future study outside of
this book.