<a href="https://colab.research.google.com/github/alvitohawari/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-TensorFlow/blob/main/Chapter_16_nlp_with_rnns_and_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 16 (NLP, RNNs, Attention) — Explanation & Implementation Notebook

This notebook is written to accompany **Chapter 16: NLP with RNNs & Attention** and demonstrates the key implementations:
- IMDb sentiment classification with **Embedding → (Bi)GRU → Dense**
- **Masking** padded tokens
- **Attention pooling** for sequence summarization
- **Multi-Head Self-Attention** classifier
- A small **Transformer encoder** block with positional embeddings

> Designed to be GitHub-friendly: clean structure, runnable cells, and clear comments.


## 0) Setup

We will use TensorFlow/Keras and the IMDb dataset included in Keras.  
The dataset is already tokenized into integer word IDs. We'll pad/truncate sequences so we can batch them.

**Key idea:** Reviews are variable-length sequences → padding makes them same length, and **masking** ensures the model ignores `<pad>` tokens.


In [None]:
import os, random, numpy as np, tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

print("TensorFlow:", tf.__version__)


## 1) Load and prepare IMDb

We limit the vocabulary (`max_features`) and pad sequences to a fixed length (`maxlen`).
- `0` is reserved for padding.
- Setting `mask_zero=True` in the Embedding layer will create and propagate a mask (`token_id != 0`).


In [None]:
max_features = 10_000   # vocabulary size
maxlen = 200           # sequence length after padding/truncation

(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=max_features)

x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)  # pads with 0 by default
x_test  = keras.utils.pad_sequences(x_test,  maxlen=maxlen)

print("Train:", x_train.shape, y_train.shape)
print("Test :", x_test.shape,  y_test.shape)


## 2) Training utility

To keep this notebook lightweight, we train for a small number of epochs by default.
If you want stronger accuracy, increase `EPOCHS`.


In [None]:
EPOCHS = 1          # change to 3–10 for better accuracy
BATCH_SIZE = 128

def compile_and_train(model, name, epochs=EPOCHS):
    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["accuracy"]
    )
    print("\n" + "="*80)
    print("Model:", name)
    model.summary()
    history = model.fit(
        x_train, y_train,
        validation_split=0.2,
        epochs=epochs,
        batch_size=BATCH_SIZE,
        verbose=2
    )
    test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
    print(f"Test accuracy: {test_acc:.4f}")
    return history, test_acc


## 3) Baseline RNN model: Embedding → GRU → Dense

**Concept mapping (Chapter 16):**
- **Embedding** converts word IDs into dense vectors.
- **GRU** reads the sequence and learns temporal dependencies.
- Final **sigmoid** outputs probability of positive sentiment.

We enable masking with `mask_zero=True` so the RNN ignores padding tokens.


In [None]:
def build_gru_baseline():
    inputs = keras.Input(shape=(maxlen,), dtype="int32")
    x = layers.Embedding(max_features, 128, mask_zero=True)(inputs)
    x = layers.GRU(64)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return keras.Model(inputs, outputs, name="gru_baseline")

gru_baseline = build_gru_baseline()
_ = compile_and_train(gru_baseline, "GRU baseline")


## 4) Bidirectional GRU: reading forward + backward

**Concept mapping (Chapter 16):** Bidirectional RNNs can use both past and future context in the sequence.
This often helps NLP classification tasks because the meaning of a word can depend on what comes after it.


In [None]:
def build_bigru():
    inputs = keras.Input(shape=(maxlen,), dtype="int32")
    x = layers.Embedding(max_features, 128, mask_zero=True)(inputs)
    x = layers.Bidirectional(layers.GRU(64))(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return keras.Model(inputs, outputs, name="bigru")

bigru = build_bigru()
_ = compile_and_train(bigru, "Bidirectional GRU")


## 5) Attention pooling: learn which time steps matter

Instead of compressing the whole sequence using the final hidden state, we can learn **attention weights** over time steps:
1. Compute a score for each time step.
2. Softmax over time → weights sum to 1.
3. Weighted sum of hidden states → a single vector representation.

This is a simple "attention for pooling" mechanism for classification.


In [None]:
class AttentionPooling(layers.Layer):
    """Simple attention pooling over time steps for classification."""
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.score_dense = layers.Dense(1)

    def call(self, inputs, mask=None):
        # inputs: (batch, time, features)
        scores = self.score_dense(inputs)                 # (batch, time, 1)

        if mask is not None:
            # mask: (batch, time) boolean; convert to float and add -inf to padded positions
            mask = tf.cast(mask, tf.float32)
            scores += (1.0 - tf.expand_dims(mask, -1)) * (-1e9)

        weights = tf.nn.softmax(scores, axis=1)           # (batch, time, 1)
        return tf.reduce_sum(weights * inputs, axis=1)    # (batch, features)

def build_bigru_attention_pool():
    inputs = keras.Input(shape=(maxlen,), dtype="int32")
    x = layers.Embedding(max_features, 128, mask_zero=True)(inputs)
    x = layers.Bidirectional(layers.GRU(64, return_sequences=True))(x)
    x = AttentionPooling()(x)  # uses the propagated mask from Embedding
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return keras.Model(inputs, outputs, name="bigru_attention_pool")

bigru_attpool = build_bigru_attention_pool()
_ = compile_and_train(bigru_attpool, "BiGRU + AttentionPooling")


## 6) Multi-Head Self-Attention classifier

**Self-attention** lets every token attend to every other token.
To avoid attending to padding, we build an explicit `attention_mask`.

Pipeline:
- Embedding (no mask propagation needed here; we use attention_mask)
- MultiHeadAttention (self-attention)
- Residual + LayerNorm
- GlobalAveragePooling
- Dense classifier


In [None]:
def build_mha_classifier(embed_dim=128, num_heads=4, ff_dim=128, dropout=0.1):
    inputs = keras.Input(shape=(maxlen,), dtype="int32")

    # Token embedding (we'll handle masking manually for attention)
    x = layers.Embedding(max_features, embed_dim)(inputs)

    # Build attention mask: True for real tokens, False for padding
    # Keras MHA expects shape broadcastable to (batch, query_len, key_len)
    padding_mask = tf.not_equal(inputs, 0)  # (batch, time)
    attn_mask = tf.expand_dims(padding_mask, axis=1)      # (batch, 1, time)

    attn_out = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)(
        x, x, attention_mask=attn_mask
    )
    x = layers.Add()([x, attn_out])
    x = layers.LayerNormalization()(x)

    # Small feed-forward block
    ff = keras.Sequential([
        layers.Dense(ff_dim, activation="relu"),
        layers.Dropout(dropout),
        layers.Dense(embed_dim),
    ])
    ff_out = ff(x)
    x = layers.Add()([x, ff_out])
    x = layers.LayerNormalization()(x)

    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(dropout)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return keras.Model(inputs, outputs, name="mha_classifier")

mha_model = build_mha_classifier()
_ = compile_and_train(mha_model, "Multi-Head Self-Attention classifier")


## 7) Transformer encoder block (mini)

**Concept mapping (Chapter 16):**
- Attention alone has no notion of order → add **positional embeddings**
- Transformer encoder = Multi-Head Attention + Feed-Forward + Residual + LayerNorm

Here we implement:
- Learned positional embeddings
- One encoder block
- Pool + classify


In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)
        self.maxlen = maxlen

    def call(self, x):
        positions = tf.range(start=0, limit=self.maxlen, delta=1)
        positions = self.pos_emb(positions)  # (time, embed_dim)
        x = self.token_emb(x)                # (batch, time, embed_dim)
        return x + positions                 # broadcast add

def transformer_encoder(x, attn_mask, embed_dim=128, num_heads=4, ff_dim=256, dropout=0.1):
    attn_out = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)(
        x, x, attention_mask=attn_mask
    )
    attn_out = layers.Dropout(dropout)(attn_out)
    x = layers.Add()([x, attn_out])
    x = layers.LayerNormalization()(x)

    ff = keras.Sequential([
        layers.Dense(ff_dim, activation="relu"),
        layers.Dropout(dropout),
        layers.Dense(embed_dim),
    ])
    ff_out = ff(x)
    x = layers.Add()([x, ff_out])
    x = layers.LayerNormalization()(x)
    return x

def build_transformer_classifier(embed_dim=128, num_heads=4, ff_dim=256, dropout=0.1):
    inputs = keras.Input(shape=(maxlen,), dtype="int32")
    x = PositionalEmbedding(maxlen, max_features, embed_dim)(inputs)

    padding_mask = tf.not_equal(inputs, 0)   # (batch, time)
    attn_mask = tf.expand_dims(padding_mask, axis=1)  # (batch, 1, time)

    x = transformer_encoder(x, attn_mask, embed_dim=embed_dim, num_heads=num_heads, ff_dim=ff_dim, dropout=dropout)
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(dropout)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    return keras.Model(inputs, outputs, name="transformer_classifier")

transformer_model = build_transformer_classifier()
_ = compile_and_train(transformer_model, "Transformer encoder classifier")


## 8) Summary

What you implemented here corresponds to key ideas in the chapter:

- **RNN-based sentiment classifier:** Embedding → (Bi)GRU → Dense(sigmoid)
- **Masking:** ignore padding tokens (either via `mask_zero=True` or explicit attention masks)
- **Attention pooling:** scores → softmax weights → weighted sum for a single sequence representation
- **Self-attention / Transformer blocks:** Multi-head attention + feed-forward + residual + layer norm + positional information

If you include this notebook in GitHub, you can add a short README describing:
- dataset used (IMDb)
- model variants compared
- any training results you observe when increasing epochs
