# Chapter 16: Natural Language Processing with RNNs and Attention

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

When Alan Turing imagined his famous Turing test in 1950, his objective was to evaluate a machine's ability to match human intelligence. He chose a linguistic task: devising a chatbot capable of fooling its interlocutor into thinking it was human. This highlights that mastering language is arguably Homo sapiens's greatest cognitive ability.

A common approach for natural language tasks is to use Recurrent Neural Networks (RNNs). In this chapter, we will explore RNNs for NLP, starting with a **Character RNN** trained to predict the next character in a sentence (allowing us to generate Shakespearean text). Then we will move to **Sentiment Analysis** (classifying movie reviews) using word embeddings. Finally, we will tackle **Neural Machine Translation (NMT)** using Encoder-Decoder architectures, Attention Mechanisms, and ultimately the **Transformer** architecture (Attention Is All You Need).

## 2. Generating Shakespearean Text using a Character RNN

A Character RNN predicts the next character in a sequence. By feeding the predicted character back into the network, we can generate text.

### Creating the Training Dataset

First, we download the Shakespeare text, tokenize it (map characters to integers), and create a `tf.data.Dataset` of windowed sequences.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

# 1. Download Data
shakespeare_url = "https://homl.info/shakespeare"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

# 2. Tokenize (Char Level)
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

# Encode the full text as integers
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1
max_id = len(tokenizer.word_index) # Number of distinct characters
dataset_size = len(encoded)

print(f"Dataset Size: {dataset_size}")
print(f"Distinct Characters: {max_id}")

### Splitting a Sequential Dataset

Training an RNN on a sequence of 1 million characters is infeasible (BPTT would need to unroll 1 million steps). We use the `window()` method to split the sequence into smaller windows of text.

In [None]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

n_steps = 100 # Window length
window_length = n_steps + 1 # Target is input shifted by 1

dataset = dataset.window(window_length, shift=1, drop_remainder=True)
# window() returns a dataset of datasets. flatten() turns it into a dataset of tensors.
dataset = dataset.flat_map(lambda window: window.batch(window_length))

# Split inputs (X) and targets (Y)
dataset = dataset.shuffle(10000).batch(32)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

# One-hot encode the inputs and targets
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id),
                              tf.one_hot(Y_batch, depth=max_id)))
dataset = dataset.prefetch(1)

### Building and Training the Char-RNN Model

We use 2 GRU layers. Note that since the output is a probability distribution over the characters, we use a Dense layer with `softmax` activation.

In [None]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2), # dropout for inputs
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])

model.compile(loss="categorical_crossentropy", optimizer="adam")
# history = model.fit(dataset, epochs=5) # Training takes time

### Generating Fake Shakespeare

To generate text, we feed a seed character, get the predicted probabilities, sample the next character (using `tf.random.categorical`), and append it to the text. We use a **Temperature** parameter to control randomness:
* Low temperature -> The model picks the most likely character (repetitive text).
* High temperature -> The model picks less likely characters (more creative but prone to errors).

In [None]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :] # Get probas for the last character
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

# print(complete_text("t", temperature=0.2)) # Example usage

### Stateful RNN

A standard RNN is *stateless*: the hidden state is reset to zero at the beginning of every batch. A **Stateful RNN** preserves the final state of a batch and uses it as the initial state for the next batch. This allows the model to learn long-term patterns even if the training sequences are short.

**Constraint:** Batches must be sequential (Batch $i$ of Epoch $n$ must follow Batch $i-1$ of Epoch $n$). We cannot shuffle the dataset randomly.

## 3. Sentiment Analysis

We will classify movie reviews from the IMDB dataset as positive or negative.

In [None]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

# Visualize a review (decoded)
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
print(" ".join([id_to_word[id_] for id_ in X_train[0][:10]]))

### Preprocessing and Embeddings

The reviews have different lengths. We must trim long reviews and pad short ones using `pad_sequences`. Then, we use an **Embedding Layer**. An embedding layer maps each word index to a dense vector (e.g., 128 dimensions). This dense representation captures semantic meaning (e.g., "king" and "queen" are close in vector space).

In [None]:
vocab_size = 10000
embed_size = 128

model_sentiment = keras.models.Sequential([
    keras.layers.Embedding(vocab_size, embed_size, input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])

model_sentiment.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

## 4. Encoder-Decoder Network for Neural Machine Translation

For tasks like translation (English to Spanish), the input sequence and output sequence have different lengths. We use an **Encoder-Decoder** architecture.

* **Encoder:** An RNN that reads the input sentence and compresses it into a single context vector (the final hidden state).
* **Decoder:** An RNN that takes the context vector as its initial state and generates the translation step-by-step.

### Bidirectional RNNs
A regular RNN only sees the past. A Bidirectional RNN runs two RNNs: one left-to-right, one right-to-left, and concatenates their outputs. This allows the model to understand the context of a word based on what comes *after* it.

In [None]:
# Example Architecture (Pseudo-code as we need a formatted translation dataset)
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
encoder_embeddings = keras.layers.Embedding(vocab_size, embed_size)(encoder_inputs)

# Bidirectional Encoder
encoder = keras.layers.Bidirectional(
    keras.layers.LSTM(256, return_state=True))
encoder_outputs, state_h_fwd, state_c_fwd, state_h_bwd, state_c_bwd = encoder(encoder_embeddings)

# Concatenate states for the Decoder
state_h = keras.layers.Concatenate()([state_h_fwd, state_h_bwd])
state_c = keras.layers.Concatenate()([state_c_fwd, state_c_bwd])
encoder_states = [state_h, state_c]

decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_embedding = keras.layers.Embedding(vocab_size, embed_size)(decoder_inputs)
decoder_lstm = keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(vocab_size, activation="softmax")
output = decoder_dense(decoder_outputs)

model_nmt = keras.models.Model([encoder_inputs, decoder_inputs], output)

## 5. Attention Mechanisms

The Encoder-Decoder model has a bottleneck: the single context vector must hold the meaning of the entire sentence. For long sentences, this fails.

**Attention** allows the Decoder to look at *all* the Encoder's hidden states (not just the last one) at every step. It computes a weighted sum of these states, focusing (attending) on the words relevant to the current word being translated.

**Bahdanau Attention (Additive):**
$$ e_{(t, i)} = \mathbf{v}^T \tanh(\mathbf{W} \mathbf{s}_{(t-1)} + \mathbf{V} \mathbf{h}_{(i)}) $$
$$ \alpha_{(t, i)} = \text{softmax}(e_{(t, i)}) $$
Where $\alpha$ are the attention weights.

### Attention Is All You Need: The Transformer

In 2017, the Transformer architecture revolutionized NLP by eliminating RNNs entirely. It relies solely on Attention mechanisms (Self-Attention and Cross-Attention), allowing for massive parallelization.

**Positional Encodings:**
Since Transformers have no recurrence, they have no sense of order. We add positional encodings (sine/cosine waves) to the embeddings to give the model information about the position of each word.

In [None]:
class PositionalEncoding(keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))
        pos_emb[0, :, ::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))
    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_embedding[:, :shape[1], :shape[2]]

# Multi-Head Attention is available in Keras
# attention = keras.layers.MultiHeadAttention(num_heads=8, key_dim=embed_size)