# Bidirectional LSTM on IMDB

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2020/05/03<br>
**Last modified:** 2020/05/03<br>
**Description:** Train a 2-layer bidirectional LSTM on the IMDB movie review sentiment classification dataset.

## Setup

In [None]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review


## Build the model

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)

# dense layers
x = layers.Flatten()(x)
x = layers.Dense(64)(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(64, activation='relu')(x)

# 1 bidirectional LSTMs instead
# x = layers.Bidirectional(layers.LSTM(64))(x)

# Add 2 bidirectional LSTMs
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
# x = layers.Bidirectional(layers.LSTM(64))(x)

# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()


Model: "model_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_10 (InputLayer)       [(None, 200)]             0         
                                                                 
 embedding_9 (Embedding)     (None, 200, 128)          2560000   
                                                                 
 flatten_5 (Flatten)         (None, 25600)             0         
                                                                 
 dense_16 (Dense)            (None, 64)                1638464   
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_17 (Dense)            (None, 64)                4160      
                                                                 
 dense_18 (Dense)            (None, 1)                 65  

The training data represented by model.summary() shows the layer types input_1, embedding, bidirectional, bidirectional_1 and dense. It shows the output shape of each and the number of parameters. At the bottom, there is a summary of the total parameters, the training parameters and non training parameters.

The total parameter count is 2757761, the traininable parameter count is the same, and there are no non-trainable parameters.

Changing the layer amount to only 1 leads to a change in the summary. Now, the listed layers are input_7, embedding_6, bidirectional_12 and dense_6. Total parameters are also less at 2658945.

## Load the IMDB movie review sentiment data

In [None]:
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features
)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)


25000 Training sequences
25000 Validation sequences


## Train and evaluate the model

You can use the trained model hosted on [Hugging Face Hub](https://huggingface.co/keras-io/bidirectional-lstm-imdb) and try the demo on [Hugging Face Spaces](https://huggingface.co/spaces/keras-io/bidirectional_lstm_imdb).

In [None]:
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])

early_stopping = EarlyStopping(
    monitor='val_accuracy',
    patience=2,
    restore_best_weights=True
)

history = model.fit(
    x_train, y_train,
    batch_size = 32,
    epochs=3,
    validation_data=(x_val, y_val),
    callbacks=[early_stopping]
)


Epoch 1/3
Epoch 2/3
Epoch 3/3


Default settings of batch size = 32, epoch = 2.
Training accuracy: 0.8248, 0.8530
Validation accuracy: 0.9230, 8514

**Task 1**

Looking at the results below, you can see the training accuracy increase throughout epochs for batch size = 16 and batch size = 64. This makes sense as each epoch allows the model to better fit the trianing data. However, validation accuracy actually doesn't always increase. It decreased from 0.8394 to 0.7962 for batch size = 16, which is a sign of overfitting. It increased from 0.8541 to 0.8657 for batch size = 64. This increase is very small, and therefore it doesn't necessarily mean the model is better.

Batch size 16:

training accuracy = 0.8204, 0.8969

validation accuracy = 0.8394, 0.7962

Batch size 64:

training accuracy = 0.8222, 0.9137

validation accuracy = 0.8541, 0.8657

By having more epochs, the final training validation accuracy becomes very high as the model can fit very well to the training data. However, the final validation accuracy is actually lower than previous epochs. As we've seen in the first part of task 1, more epochs doesn't necessarily mean a higher validation score. Each epoch beyond the first one, can lead to overfitting. Therefore, more epochs can actually be a bad thing. One remedy for this is to use early stop, to prevent overfitting.

Holding batch size = 32 and epoch = 5.

Final training accuracy = 0.9713

Final validation accuracy = 0.8618

Best training accuracy is the 5th epoch 32 batch training, but what really matters is validation accuracy. The best validation accuracies were 0.8635 for a model with batch size 32 on the third epoch, and 0.8657 with batch size 64 on the second epoch. Since they're close, we'll stick with the third epoch of batch size 32.

**Task 2**

With only one layer, we know from class that the model will be less complex. However, this isn't necessarily a bad thing because it can lower training time and prevent overfitting if the data is actually not that complex. Having implemented that, we have the following:

Training accuracy is the using 1 or 2 layers and batch size 32. With 1 layer, training accuracy was 0.9228 on the third epoch. With 2 layers, training accuracy was 0.9272 on the third epoch.

Validation accuracy was 0.8635 with 2 layers, and 0.8580 with 1 layer. This difference is not significant for us to conclude that 2 layers is better.

**Task 3**

After changing the LSTM model to use MLP with dense layers instead, the parameter count and layers have changed. Now, the layer types input_2, embeding_1, flatten_1, dense_1, dense_2, and dense_3. The total parameters are 4,202,689, which is more parameters than the original bidirectional lstm with 2 layers which had 2,757,761 parameters.

It runs properly as I added a Flatten layer to convert the 2D output of the embedding layer into a 1D array suitable for input into the Dense layers. The Flatten layer will only work if your input sequences are of the same length. The purpose of flattening was to convert the multi-dimensional input into a single-dimensional vector, so it can be fed into a fully connected layer (Dense layer), which does not accept multi-dimensional data as input.






**Wrap up**

In order to test for the best hyperparameters and layer setup, I compared MLP to the original models. Looking only at our MLP model, it looks like early stopping combined with a batch size of 64 performed better than with a batch size of 32. However, the validation accuracy really didn't differ significantly enough for us to say one is better than the other.

Trial 1: earling stopping lstm 2 layer, batch size 32.

accuracy: 0.8278

val_accuracy: 0.8686

Trial 2: early stopping lstm 2 layer, batch size 64

accuracy: 0.8302

val_accuracy: 0.8631

Trial 3: early stopping MLP, batch size 32


accuracy: 0.8145

val_accuracy: 0.8671

Trial 4: early stopping MLP, batch size 64

accuracy: 0.7888

val_accuracy: 0.8705

**Summary**

To summarize what we've learned, I organized the performance of different models and hyperparameters:

**Batch Sizes and Epochs:**
- A batch size of 32 generally provided a good balance between training speed and model performance.
- A batch size of 16 showed signs of slower learning and potentially noisier updates.
- Larger batch sizes (64) did not consistently improve validation performance and could lead to faster overfitting.
- Increasing the training to have more epochs (e.g. 5) without regularization led to overfitting.

**Model Layout:**
- Two-layer bidirectional LSTMs: Has the ability to capture complex patterns in sequence data but tended to overfit shown by the divergence of training and validation accuracy.
- Single-layer bidirectional LSTMs: Performed comparably in terms of validation accuracy but with less overfitting compared to the two-layer one.
- The MLP with dense layers: much higher training accuracy to near-perfect levels, but wasn't matched with much improvements in validation accuracy which suggests overfitting.

**Things that could be improved:**
- Adding regularization such as dropout or L1/L2 to address
overfitting.
  - HOWEVER!!! After adding dropout, it actually didn't seem like the validation accuracy improved by much, so it might need a combination of the different regularization techniques to do better.
- Adding early stopping helped find the best validation accuracy faster, but the validation accuracy itself didn't improve.
- Changing the learning rates could improve validation accuracy.
- Data preprocessing could improve model generalization.
- Cross-validation could better show model performance.

**Best Performing Trials:**
- The two-layer LSTM with early stopping and a batch size of 32 had the best initial validation accuracy, but showed overfitting by the third epoch.
- MLP trained very fast and had high training accuracies but showed significant overfitting, as seen by the large difference between training and validation accuracy.

**Organized Data:**

| Model                          | Batch Size | Epochs | Max Training Accuracy | Max Validation Accuracy | Notes                                 |
|--------------------------------|------------|--------|----------------------|-------------------------|---------------------------------------|
| Default LSTM (2 layers)        | 32         | 2      | 92.30%               | 85.30%                  | Initial good val_accuracy             |
| Default LSTM (2 layers)        | 16         | 2      | 89.69%               | 83.94%                  | Noisy updates, lower val_accuracy     |
| Default LSTM (2 layers)        | 64         | 2      | 91.37%               | 86.57%                  | Best initial val_accuracy             |
| LSTM (2 layers, extended)      | 32         | 5      | 97.13%               | 86.72%                  | Overfitting after 2 epochs            |
| Single-layer LSTM              | 32         | 3      | 92.28%               | 86.40%                  | Comparable to two layers              |
| MLP (2 dense layers)           | 32         | 3      | 99.70%               | 86.45%                  | Significant overfitting               |
| Early Stopping LSTM (2 layers) | 32         | 3      | 95.55%               | 86.86%                  | Early stopping helps                  |
| Early Stopping LSTM (2 layers) | 64         | 3      | 96.10%               | 86.31%                  | Slightly lower val_accuracy           |
| Early Stopping MLP             | 32         | 3      | 99.74%               | 86.71%                  | Early stopping, still overfitting     |
| Early Stopping MLP             | 64         | 3      | 99.91%               | 87.05%                  | Highest initial val_accuracy, overfit |