# Lecture 71: Sequence Models in NLP

This notebook introduces **sequence models** in Natural Language Processing (NLP), focusing on **Recurrent Neural Networks (RNNs)** and **Long Short-Term Memory (LSTM)** networks. We'll use these models for a binary text classification task (sentiment analysis) on the IMDb movie review dataset. The notebook covers:

- Understanding RNNs and LSTMs for sequential data
- Loading and preprocessing the IMDb dataset
- Building and training a simple RNN model
- Building and training an LSTM model
- Comparing model performance
- Visualizing training results

RNNs process sequential data by maintaining a hidden state, while LSTMs address vanishing gradient issues, making them better suited for long sequences.

## Setup and Imports

Let's import the necessary libraries and set up the environment for reproducibility.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, Dense
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

## Understanding Sequence Models

- **RNNs (Recurrent Neural Networks)**:
  - Process sequences by maintaining a hidden state passed from one time step to the next.
  - Pros: Suitable for sequential data like text or time series.
  - Cons: Suffers from vanishing/exploding gradients, struggles with long-term dependencies.
- **LSTMs (Long Short-Term Memory)**:
  - An advanced RNN variant with memory cells and gates (input, forget, output).
  - Pros: Captures long-term dependencies, mitigates vanishing gradient problem.
  - Cons: More computationally expensive than simple RNNs.

We'll use both for sentiment analysis to compare their performance.

## Loading and Preprocessing the IMDb Dataset

The IMDb dataset contains 50,000 movie reviews (25,000 train, 25,000 test) labeled as positive (1) or negative (0). We'll preprocess the data by limiting the vocabulary size, padding sequences to a fixed length, and preparing it for the models.

In [2]:
# Parameters
max_features = 10000  # Number of words to consider (top 10,000 most frequent)
maxlen = 200         # Maximum sequence length

# Load IMDb dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure uniform length
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

# Example of a preprocessed review
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in X_train[0]])
print("\nSample Decoded Review:")
print(decoded_review[:100] + "...")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step
Training data shape: (25000, 200)
Test data shape: (25000, 200)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1us/step

Sample Decoded Review:
and you could just imagine being there robert ? is an amazing actor and now the same being director ...


## Building and Training a Simple RNN Model

We'll create a simple RNN model with an embedding layer to convert words into dense vectors, a SimpleRNN layer for sequence processing, and a dense output layer for binary classification.

In [3]:
# Build RNN model
rnn_model = Sequential([
    Embedding(max_features, 128, input_length=maxlen),
    SimpleRNN(64, return_sequences=False),
    Dense(1, activation='sigmoid')
])

# Compile the model
rnn_model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

# Model summary
rnn_model.summary()

# Train the model
rnn_history = rnn_model.fit(X_train, y_train,
                            epochs=5,
                            batch_size=128,
                            validation_split=0.2,
                            verbose=1)



Epoch 1/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 32ms/step - accuracy: 0.5784 - loss: 0.6555 - val_accuracy: 0.7580 - val_loss: 0.5052
Epoch 2/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 18ms/step - accuracy: 0.8311 - loss: 0.3849 - val_accuracy: 0.8302 - val_loss: 0.4064
Epoch 3/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 19ms/step - accuracy: 0.9208 - loss: 0.2050 - val_accuracy: 0.7432 - val_loss: 0.5806
Epoch 4/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 19ms/step - accuracy: 0.9554 - loss: 0.1331 - val_accuracy: 0.7934 - val_loss: 0.5827
Epoch 5/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 20ms/step - accuracy: 0.9833 - loss: 0.0572 - val_accuracy: 0.7826 - val_loss: 0.6885


## Building and Training an LSTM Model

Next, we'll create an LSTM model with a similar architecture but using an LSTM layer instead of SimpleRNN to better handle long-term dependencies.

In [None]:
# Build LSTM model
lstm_model = Sequential([
    Embedding(max_features, 128, input_length=maxlen),
    LSTM(64, return_sequences=False),
    Dense(1, activation='sigmoid')
])

# Compile the model
lstm_model.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

# Model summary
lstm_model.summary()

# Train the model
lstm_history = lstm_model.fit(X_train, y_train,
                              epochs=5,
                              batch_size=128,
                              validation_split=0.2,
                              verbose=1)

Epoch 1/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 16ms/step - accuracy: 0.6978 - loss: 0.5464 - val_accuracy: 0.8532 - val_loss: 0.3437
Epoch 2/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.8968 - loss: 0.2623 - val_accuracy: 0.8532 - val_loss: 0.3595
Epoch 3/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.9217 - loss: 0.2065 - val_accuracy: 0.8470 - val_loss: 0.4393
Epoch 4/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.9248 - loss: 0.1963 - val_accuracy: 0.8322 - val_loss: 0.3826
Epoch 5/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 16ms/step - accuracy: 0.9461 - loss: 0.1508 - val_accuracy: 0.8614 - val_loss: 0.3753


## Evaluating and Comparing Model Performance

We'll evaluate both models on the test set and visualize their training and validation accuracy/loss to compare performance.

In [None]:
# Evaluate RNN model
rnn_test_loss, rnn_test_accuracy = rnn_model.evaluate(X_test, y_test, verbose=0)
print(f"RNN Test Accuracy: {rnn_test_accuracy:.4f}")
print(f"RNN Test Loss: {rnn_test_loss:.4f}")

# Evaluate LSTM model
lstm_test_loss, lstm_test_accuracy = lstm_model.evaluate(X_test, y_test, verbose=0)
print(f"LSTM Test Accuracy: {lstm_test_accuracy:.4f}")
print(f"LSTM Test Loss: {lstm_test_loss:.4f}")

# Plot training history
plt.figure(figsize=(12, 8))

# Plot accuracy
plt.subplot(2, 2, 1)
plt.plot(rnn_history.history['accuracy'], label='RNN Training Accuracy')
plt.plot(rnn_history.history['val_accuracy'], label='RNN Validation Accuracy')
plt.title('RNN Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(2, 2, 2)
plt.plot(lstm_history.history['accuracy'], label='LSTM Training Accuracy')
plt.plot(lstm_history.history['val_accuracy'], label='LSTM Validation Accuracy')
plt.title('LSTM Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

# Plot loss
plt.subplot(2, 2, 3)
plt.plot(rnn_history.history['loss'], label='RNN Training Loss')
plt.plot(rnn_history.history['val_loss'], label='RNN Validation Loss')
plt.title('RNN Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 2, 4)
plt.plot(lstm_history.history['loss'], label='LSTM Training Loss')
plt.plot(lstm_history.history['val_loss'], label='LSTM Validation Loss')
plt.title('LSTM Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.savefig('sequence_models_history.png')

## Making Predictions

Let's make predictions on a few test reviews to see how the LSTM model (likely the better performer) classifies sentiment.

In [None]:
# Select a few test reviews
num_samples = 5
sample_indices = np.random.choice(X_test.shape[0], num_samples, replace=False)
sample_reviews = X_test[sample_indices]
sample_labels = y_test[sample_indices]

# Predict with LSTM model
predictions = lstm_model.predict(sample_reviews)
predicted_classes = (predictions > 0.5).astype(int).flatten()

# Decode reviews for readability
print("\nSample Predictions (LSTM Model):")
for i, idx in enumerate(sample_indices):
    decoded_review = ' '.join([reverse_word_index.get(word - 3, '?') for word in sample_reviews[i]])
    print(f"\nReview {i+1}: {decoded_review[:100]}...")
    print(f"True Sentiment: {'Positive' if sample_labels[i] == 1 else 'Negative'}")
    print(f"Predicted Sentiment: {'Positive' if predicted_classes[i] == 1 else 'Negative'}")
    print(f"Prediction Probability: {predictions[i][0]:.4f}")

## Explanation

- **Sequence Models**:
  - **RNNs**: Process sequences but struggle with long-term dependencies due to vanishing gradients.
  - **LSTMs**: Use memory cells and gates to retain long-term information, making them more effective for NLP tasks.
- **Dataset**: IMDb dataset with 50,000 reviews, preprocessed to limit vocabulary and pad sequences.
- **Models**:
  - RNN: Simple architecture with embedding and SimpleRNN layers.
  - LSTM: Similar architecture but with an LSTM layer for better sequence modeling.
- **Training**: Both models trained for 5 epochs with Adam optimizer and binary crossentropy loss.
- **Evaluation**: Compared test accuracy and loss, visualized training history to assess overfitting.
- **Predictions**: Demonstrated LSTM predictions on sample reviews, showing practical application.

To extend this work, consider:
- Using bidirectional LSTMs or GRUs for improved performance
- Adding dropout or regularization to prevent overfitting
- Incorporating pre-trained embeddings (e.g., GloVe, Word2Vec)
- Exploring attention mechanisms or transformers for advanced sequence modeling