<a href="https://colab.research.google.com/github/farrelrassya/python-natural-language-Processing-cookbook/blob/main/chapter%2008%20-%20Transformers%20%20/%20Chapter_08_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 8 — Transformers and Their Applications

The **transformer architecture** has fundamentally reshaped NLP. Unlike the feature-engineering pipelines of earlier chapters (bag-of-words, TF-IDF, even static word embeddings), transformers learn contextual representations end-to-end, enabling a single pre-trained model to be adapted to dozens of downstream tasks.

This chapter walks through the practical workflow of using transformers via the Hugging Face ecosystem:

| # | Recipe | Key Concept |
|---|--------|-------------|
| 1 | **Loading a Dataset** | Hugging Face `datasets` library |
| 2 | **Tokenizing Text** | Sub-word tokenization, input IDs, attention masks |
| 3 | **Classification** | Sentiment analysis with a fine-tuned RoBERTa model |
| 4 | **Zero-Shot Classification** | Classify without task-specific training data (BART-MNLI) |
| 5 | **Text Generation** | Autoregressive decoding with GPT-2 |
| 6 | **Language Translation** | Encoder-decoder generation with Google T5 |

We move from the **encoder** side of the transformer (recipes 1--4) to the **decoder** side (recipe 5) and finally to the full **encoder-decoder** architecture (recipe 6).

## 0 — Environment Setup

In [1]:

# 0.1  Install packages

import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"]       = "false"

!pip install -q \
    datasets \
    evaluate \
    transformers \
    accelerate \
    sentencepiece \
    protobuf \
    torch


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:

# 0.2  Core imports & configuration

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Patch jupyter_client to silence datetime.utcnow() spam
from datetime import datetime, timezone
try:
    import jupyter_client.session as _jcs
    _jcs.utcnow = lambda: datetime.now(timezone.utc)
except Exception:
    pass

import torch
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Compute device: {device}")
if device.type == "cuda":
    print(f"  GPU: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("  (Models will run on CPU — inference will be slower)")

print("Setup complete.")


Compute device: cuda
  GPU: Tesla T4
  Memory: 15.6 GB
Setup complete.


We set `HF_HUB_DISABLE_PROGRESS_BARS=1` to prevent Hugging Face progress bars from polluting the notebook metadata (a known issue with GitHub rendering). The `jupyter_client.session` patch silences the `datetime.utcnow()` deprecation warnings that flood Colab output when transformer models run inference.

---

## Recipe 1 — Loading a Dataset

The Hugging Face `datasets` library provides a unified API to thousands of public datasets. It handles downloading, caching, memory-mapping (so large datasets do not need to fit in RAM), and standardized train/validation/test splits.

In [4]:

# 1.1  Load the Rotten Tomatoes dataset

from datasets import load_dataset, get_dataset_split_names

dataset = load_dataset("rotten_tomatoes")

print("Available splits:", get_dataset_split_names("rotten_tomatoes"))
print()
print(dataset)


Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]



Available splits: ['train', 'validation', 'test']

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


The Rotten Tomatoes dataset contains $5{,}331$ positive and $5{,}331$ negative movie review sentences, split into train, validation, and test partitions. It is a **binary sentiment classification** benchmark -- simple enough to demonstrate transformer workflows without lengthy training times.

In [5]:

# 1.2  Inspect the training split

training_data = dataset["train"]

print("Description:")
print(training_data.description[:200], "...")
print()
print("Features:", training_data.features)
print(f"Number of examples: {len(training_data):,}")


Description:
 ...

Features: {'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}
Number of examples: 8,530


In [6]:

# 1.3  Sample first 5 sentences

sentences = training_data["text"][:5]

for i, sentence in enumerate(sentences):
    print(f"[{i}] {sentence}")


[0] the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
[1] the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .
[2] effective but too-tepid biopic
[3] if you sometimes like to go to the movies to have fun , wasabi is a good place to start .
[4] emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .


Each example has two fields: `text` (the review string) and `label` (0 for negative, 1 for positive). The reviews are single sentences -- short and punchy, typical of critics' one-liners. This brevity makes the dataset ideal for testing whether a model can capture sentiment from limited context.

**Production note:** In real projects you rarely work with clean, pre-split datasets. The `datasets` library also supports loading from CSV, JSON, Parquet, and SQL -- use `load_dataset("csv", data_files="path/to/file.csv")`.

---

## Recipe 2 — Tokenizing Text

Transformers do not see words -- they see **token IDs**. The tokenizer converts raw text into a sequence of integer IDs that index into the model's vocabulary. BERT-family models use **WordPiece** tokenization: common words get a single token, while rare words are split into sub-word units.

The tokenizer also adds special tokens:

$$\text{[CLS]} \;\; w_1 \;\; w_2 \;\; \cdots \;\; w_n \;\; \text{[SEP]}$$

where `[CLS]` (classification) marks the start and `[SEP]` (separator) marks the end. For BERT, the `[CLS]` token's representation is typically used as the sentence embedding for classification tasks.

In [7]:

# 2.1  Initialize a BERT tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sentences = [
    "The first sentence, which is the longest one in the list.",
    "The second sentence is not that long.",
    "A very short sentence."
]

tokenized_input = tokenizer(sentences)

print("Keys:", list(tokenized_input.keys()))
print()
for i, sent in enumerate(sentences):
    ids = tokenized_input["input_ids"][i]
    print(f'Sentence {i}: "{sent}"')
    print(f"  Token IDs ({len(ids)} tokens): {ids}")
    print(f"  Attention mask: {tokenized_input['attention_mask'][i]}")
    print()


Keys: ['input_ids', 'token_type_ids', 'attention_mask']

Sentence 0: "The first sentence, which is the longest one in the list."
  Token IDs (15 tokens): [101, 1109, 1148, 5650, 117, 1134, 1110, 1103, 6119, 1141, 1107, 1103, 2190, 119, 102]
  Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sentence 1: "The second sentence is not that long."
  Token IDs (10 tokens): [101, 1109, 1248, 5650, 1110, 1136, 1115, 1263, 119, 102]
  Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sentence 2: "A very short sentence."
  Token IDs (7 tokens): [101, 138, 1304, 1603, 5650, 119, 102]
  Attention mask: [1, 1, 1, 1, 1, 1, 1]



The tokenizer returns three parallel lists:

**`input_ids`** — Integer IDs mapping each token to the model's vocabulary. Token 101 is `[CLS]` and 102 is `[SEP]`.

**`token_type_ids`** — All zeros for single-sentence input. For sentence-pair tasks (e.g., natural language inference), the second sentence gets type ID 1.

**`attention_mask`** — Binary mask indicating real tokens (1) vs. padding (0). Since we have not padded yet, all values are 1.

Notice the sentences have **different lengths** (15, 10, and 7 tokens). When batching, we would need to pad the shorter sentences to match the longest one -- the attention mask then tells the model to ignore the padding positions.

In [8]:

# 2.2  Convert IDs back to tokens

tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"][0])

print("Token IDs:", tokenized_input["input_ids"][0])
print("Tokens:   ", tokens)
print()
print(f"Vocabulary size: {tokenizer.vocab_size:,} tokens")


Token IDs: [101, 1109, 1148, 5650, 117, 1134, 1110, 1103, 6119, 1141, 1107, 1103, 2190, 119, 102]
Tokens:    ['[CLS]', 'The', 'first', 'sentence', ',', 'which', 'is', 'the', 'longest', 'one', 'in', 'the', 'list', '.', '[SEP]']

Vocabulary size: 28,996 tokens


Converting IDs back to tokens confirms the round-trip: every word maps to its original form, with `[CLS]` and `[SEP]` bookending the sequence. The `bert-base-cased` vocabulary contains approximately $28{,}996$ tokens -- a mix of whole words and sub-word pieces (prefixed with `##`).

**Sub-word tokenization** is the key insight that makes transformers handle open vocabularies: even words never seen during pre-training can be represented as a sequence of known sub-word units. For example, "unbelievable" might become `["un", "##bel", "##ie", "##va", "##ble"]`.

---

## Recipe 3 — Classifying Text with Transformers

The Hugging Face `pipeline` abstraction wraps tokenization, model inference, and post-processing into a single callable. For sentiment analysis, we use a RoBERTa model fine-tuned on the Rotten Tomatoes dataset itself.

The architecture is:

$$\text{Input} \xrightarrow{\text{Tokenizer}} \text{Token IDs} \xrightarrow{\text{RoBERTa Encoder}} \mathbf{h}_{\text{[CLS]}} \xrightarrow{\text{Linear + Softmax}} P(\text{pos}), P(\text{neg})$$

The encoder produces a contextual representation $\mathbf{h}_{\text{[CLS]}} \in \mathbb{R}^{768}$ for the `[CLS]` token, which a linear classification head maps to class probabilities.

In [9]:

# 3.1  Initialize sentiment analysis pipeline

from transformers import pipeline

roberta_pipe = pipeline(
    "sentiment-analysis",
    model="textattack/roberta-base-rotten-tomatoes",
    device=device)

print(f"Model: textattack/roberta-base-rotten-tomatoes")
print(f"Running on: {device}")


RobertaForSequenceClassification LOAD REPORT from: textattack/roberta-base-rotten-tomatoes
Key                         | Status     |  | 
----------------------------+------------+--+-
roberta.pooler.dense.weight | UNEXPECTED |  | 
roberta.pooler.dense.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Model: textattack/roberta-base-rotten-tomatoes
Running on: cuda


In [12]:
# 3.2  Predict on a small sample

from datasets import load_dataset
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

# Load dataset
sample = load_dataset("rotten_tomatoes", split="test").select(range(5))

# KeyDataset otomatis mengekstrak kolom "text" sebagai generator string murni
predictions = []
for out in tqdm(roberta_pipe(KeyDataset(sample, "text"), batch_size=8)):
    predictions.append(out)

# Print hasilnya
for idx, text in enumerate(sample["text"]):
    actual = sample["label"][idx]
    pred_label = 1 if predictions[idx]["label"] == "LABEL_1" else 0
    score = predictions[idx]["score"]
    match = "Y" if actual == pred_label else "X"

    print(f'[{match}] actual={actual}  pred={pred_label}  conf={score:.3f}  "{text[:70]}..."')

  0%|          | 0/1 [00:00<?, ?it/s]

[Y] actual=1  pred=1  conf=0.965  "lovingly photographed in the manner of a golden book sprung to life , ..."
[Y] actual=1  pred=1  conf=0.996  "consistently clever and suspenseful ...."
[X] actual=1  pred=0  conf=0.916  "it's like a " big chill " reunion of the baader-meinhof gang , only th..."
[Y] actual=1  pred=1  conf=0.996  "the story gives ample opportunity for large-scale action and suspense ..."
[Y] actual=1  pred=1  conf=0.879  "red dragon " never cuts corners ...."


The pipeline correctly classifies most of the sample. Each prediction comes with a confidence score — the softmax probability of the chosen label. A score near 1.0 indicates high confidence; scores around 0.5 suggest the model is uncertain.

Notice that even for the misclassified example(s), the confidence is lower — the model "knows" it is less sure. This calibration property is valuable in production: you can set a **confidence threshold** below which predictions are routed to human review.

In [14]:
# 3.3  Evaluate on the full test set (Robust Approach)

from datasets import load_dataset
from transformers.pipelines.pt_utils import KeyDataset
from evaluate import combine
from tqdm.auto import tqdm
import time

test_data = load_dataset("rotten_tomatoes", split="test")

# 1. Load metrics
metrics = combine(["accuracy", "precision", "recall", "f1"])

# 2. Run batched inference
print("Running inference on test set...")
start_time = time.time()
predictions = []

# Gunakan batch_size yang sesuai dengan sisa memori GPU T4 kamu (misal: 16 atau 32)
for out in tqdm(roberta_pipe(KeyDataset(test_data, "text"), batch_size=16, truncation=True)):
    # Mapping label string dari model kembali ke integer 0/1
    pred_label = 1 if out["label"] == "LABEL_1" else 0
    predictions.append(pred_label)

end_time = time.time()

# 3. Compute metrics
references = test_data["label"]
eval_results = metrics.compute(predictions=predictions, references=references)

# Kalkulasi throughput manual
total_samples = len(test_data)
throughput = total_samples / (end_time - start_time)

# 4. Print results
print("\nFull test set evaluation:")
for metric in ["accuracy", "precision", "recall", "f1"]:
    print(f"  {metric:>10}: {eval_results[metric]:.4f}")
print(f"  {'throughput':>10}: {throughput:.1f} samples/sec")

Running inference on test set...


  0%|          | 0/67 [00:00<?, ?it/s]


Full test set evaluation:
    accuracy: 0.8874
   precision: 0.9223
      recall: 0.8462
          f1: 0.8826
  throughput: 316.1 samples/sec


The RoBERTa model achieves strong performance on the Rotten Tomatoes test set. The precision-recall balance tells us whether the model is biased toward positive or negative predictions:

$$F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

**Precision** measures "of all reviews predicted positive, how many actually are?" **Recall** measures "of all actually positive reviews, how many did we catch?" An $F_1$ near 0.88 is strong for single-sentence sentiment analysis, where context is limited and sarcasm is common.

**Throughput** matters in production: on a GPU this model processes hundreds of reviews per second; on CPU it drops to tens. For batch inference on millions of reviews, GPU acceleration is essential.

---

## Recipe 4 — Zero-Shot Classification

**Zero-shot classification** lets you classify text into categories the model was never explicitly trained on. The model (here, `facebook/bart-large-mnli`) was trained on **Natural Language Inference (NLI)** — given a premise and hypothesis, predict whether the hypothesis is entailed, contradicted, or neutral.

The trick: for each candidate label, the pipeline constructs the hypothesis *"This text is about {label}"* and checks whether the input text entails it. The label with the highest entailment score wins:

$$\hat{y} = \arg\max_{c \in \mathcal{C}} P(\text{entailment} \mid \text{premise}=\mathbf{x}, \; \text{hypothesis}=\texttt{"This is about } c\texttt{"})$$

This means you can define **any set of labels at inference time** — no retraining required.

In [15]:

# 4.1  Initialize zero-shot pipeline

from transformers import pipeline

zero_shot_pipe = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=device)

print("Model: facebook/bart-large-mnli")
print(f"Running on: {device}")


Model: facebook/bart-large-mnli
Running on: cuda


In [17]:

# 4.2  Classify with custom labels

result1 = zero_shot_pipe(
    "I am so hooked to video games as I cannot get any work done!",
    candidate_labels=["technology", "gaming", "hobby", "art", "computer"])

print(f'Text: "{result1['sequence']}"')
print()
for label, score in zip(result1["labels"], result1["scores"]):
    bar = "#" * int(score * 40)
    print(f"  {label:<12} {score:.3f}  {bar}")


Text: "I am so hooked to video games as I cannot get any work done!"

  gaming       0.847  #################################
  hobby        0.082  ###
  technology   0.068  ##
  computer     0.002  
  art          0.001  


The model assigns **gaming** the highest probability, well above the other candidates. Notice that "hobby" and "technology" have modest scores — they are plausible but less specific. The key advantage: we defined these labels ourselves, with zero training data. The model generalizes from its NLI training to reason about label relevance.

In [19]:

# 4.3  Second example — health domain

result2 = zero_shot_pipe(
    "A early morning exercise regimen can drive many diseases away!",
    candidate_labels=["health", "medical", "weather", "geography", "politics"])

print(f'Text: "{result2['sequence']}"')
print()
for label, score in zip(result2["labels"], result2["scores"]):
    bar = "#" * int(score * 40)
    print(f"  {label:<12} {score:.3f}  {bar}")
print(f"\nPredicted class: {result2['labels'][0]} "
      f"(confidence: {result2['scores'][0]:.2f})")


Text: "A early morning exercise regimen can drive many diseases away!"

  health       0.907  ####################################
  medical      0.069  ##
  weather      0.011  
  geography    0.009  
  politics     0.005  

Predicted class: health (confidence: 0.91)


The model confidently classifies the exercise sentence as **health** with high probability, clearly distinguishing it from the semantically adjacent "medical" label. This distinction is impressive: "health" implies wellness and prevention, while "medical" implies diagnosis and treatment. The model captures this nuance without any labeled examples.

**When to use zero-shot classification:**
- **Prototyping:** Test whether a classification approach works before investing in labeled data
- **Low-resource domains:** When you have fewer than ~100 labeled examples per class
- **Dynamic label sets:** When categories change frequently (e.g., customer support routing)

**When NOT to use it:** When you have abundant labeled data and need maximum accuracy, a fine-tuned model will always outperform zero-shot.

---

## Recipe 5 — Generating Text with GPT-2

GPT-2 is a **decoder-only** transformer that generates text autoregressively — predicting one token at a time, each conditioned on all previous tokens:

$$P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_1, \ldots, w_{t-1})$$

At each step, the model produces a probability distribution over its entire vocabulary ($\sim 50{,}257$ tokens). The **decoding strategy** determines how we sample from this distribution, which dramatically affects output quality.

In [20]:

# 5.1  Basic text generation

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2", device=device)

seed_text = "The cat had no business entering the neighbors garage, but"

# Basic generation with beam search
results_basic = generator(
    seed_text,
    do_sample=True,
    max_length=30,
    num_return_sequences=3,
    num_beams=5,
    pad_token_id=50256)

print("=== Basic beam search ===")
for i, r in enumerate(results_basic):
    print(f"[{i+1}] {r['generated_text']}")
    print()


GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Passing `generation_config` together with generation-related arguments=({'max_length', 'pad_token_id', 'do_sample', 'num_beams', 'num_return_sequences'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


=== Basic beam search ===
[1] The cat had no business entering the neighbors garage, but when he got there, he saw a man with a gun.

"He said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and he said, 'I'm going to kill you,' and I said, 'I'm going to kill you,' and

[2] The cat had no business entering the neighbors garage, but when he got there, he saw a man with a gun.

"He sa

The basic output may contain repetition (the model gets stuck in loops) or incoherent phrasing. Beam search alone keeps the top-$B$ most probable sequences at each step, but it tends to produce **generic, repetitive** text because high-probability tokens are often common words.

We can improve quality with several decoding parameters:

- `no_repeat_ngram_size=2` — Prevents any bigram from appearing twice
- `top_k=50` — Samples from only the 50 highest-probability tokens
- `top_p=0.85` — **Nucleus sampling**: samples from the smallest set of tokens whose cumulative probability exceeds $p$

In [21]:

# 5.2  Improved generation with sampling controls

results_improved = generator(
    seed_text,
    do_sample=True,
    max_length=50,
    num_return_sequences=3,
    num_beams=5,
    no_repeat_ngram_size=2,
    top_k=50,
    top_p=0.85,
    pad_token_id=50256)

print("=== Improved (no-repeat + top-k + top-p) ===")
for i, r in enumerate(results_improved):
    print(f"[{i+1}] {r['generated_text']}")
    print()


Passing `generation_config` together with generation-related arguments=({'max_length', 'pad_token_id', 'do_sample', 'top_p', 'no_repeat_ngram_size', 'num_beams', 'num_return_sequences', 'top_k'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


=== Improved (no-repeat + top-k + top-p) ===
[1] The cat had no business entering the neighbors garage, but when he came inside, he found the cat lying on the floor.

"It was just a cat," he said. "I was like, 'What the hell is going on here?' And he was laying there on his back, and I thought, Oh my God, I can't believe this is happening. I'm so scared. It's like I've never seen anything like this in my life."

[2] The cat had no business entering the neighbors garage, but when he came inside, he found the cat lying on the floor.

"It was just a cat," he said. "I was like, 'What the hell is going on here?' And he was laying there on his back, and I thought, Oh my God, I can't believe this is happening. I'm so scared. It's like I've never seen anything like this in my entire life."

[3] The cat had no business entering the neighbors garage, but when he came inside, he found the cat lying on the floor.

"It was just a cat," he said. "I was like, 'What the hell is going on here?' And he 

The improved settings produce more diverse, coherent continuations. The key trade-off in text generation is **quality vs. diversity:**

| Parameter | Effect | Trade-off |
|-----------|--------|-----------|
| `num_beams` | Higher = more search | Slower, more generic |
| `top_k` | Limits vocabulary per step | Too low = repetitive, too high = random |
| `top_p` | Dynamic vocabulary cutoff | More adaptive than fixed `top_k` |
| `no_repeat_ngram_size` | Prevents loops | May block legitimate repetition |
| `temperature` | Scales logits before softmax | $<1$ = conservative, $>1$ = creative |

**Note:** GPT-2 is a 2019 model with 124M parameters. Modern LLMs (GPT-4, Claude, Llama) use the same autoregressive architecture but with orders of magnitude more parameters and training data, producing far more coherent output.

In [22]:

# 5.3  Longer generation

results_long = generator(
    seed_text,
    do_sample=True,
    max_length=150,
    num_return_sequences=1,
    num_beams=5,
    no_repeat_ngram_size=2,
    top_k=50,
    top_p=0.85,
    pad_token_id=50256)

print("=== Extended generation (150 tokens) ===")
print(results_long[0]["generated_text"])


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


=== Extended generation (150 tokens) ===
The cat had no business entering the neighbors garage, but when she got there, she found it was empty.

"I was like, 'Oh my God, what the hell is going on?'" she said. "It was just a pile of garbage. It was really hard to get it out of there. I didn't know what to do with it."


Longer generation reveals both the strengths and limitations of GPT-2: it maintains local coherence (each sentence is grammatical) but may lose global coherence (the narrative can drift). This is a fundamental challenge of autoregressive generation — the model has no explicit "plan" for where the text should go.

---

## Recipe 6 — Language Translation with T5

Google's **T5 (Text-to-Text Transfer Transformer)** treats every NLP task as a text-to-text problem. For translation, the input is formatted as:

```
translate English to French: It's such a beautiful morning today!
```

The model uses the full **encoder-decoder** architecture:

$$\underbrace{\text{Encoder}(\mathbf{x})}_{\text{contextual representation}} \;\xrightarrow{\text{cross-attention}}\; \underbrace{\text{Decoder}}_{\text{autoregressive generation}} \;\rightarrow\; \mathbf{y}$$

The encoder processes the source language and produces contextualized representations. The decoder generates the target language one token at a time, attending to the encoder's output via **cross-attention** at every layer.

In [23]:

# 6.1  Initialize T5 model and tokenizer

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer_t5 = T5Tokenizer.from_pretrained(
    "t5-base", model_max_length=200)
model_t5 = T5ForConditionalGeneration.from_pretrained(
    "t5-base", return_dict=True)
model_t5 = model_t5.to(device)

print(f"T5-base loaded on {device}")
print(f"  Encoder layers: {model_t5.config.num_layers}")
print(f"  Decoder layers: {model_t5.config.num_decoder_layers}")
print(f"  Hidden size: {model_t5.config.d_model}")
print(f"  Parameters: {sum(p.numel() for p in model_t5.parameters()):,}")


T5-base loaded on cuda
  Encoder layers: 12
  Decoder layers: 12
  Hidden size: 768
  Parameters: 222,903,552


In [24]:

# 6.2  Translate English to French

source_text = "It's such a beautiful morning today!"

input_ids = tokenizer_t5(
    "translate English to French: " + source_text,
    return_tensors="pt",
    truncation=True
).input_ids.to(device)

output_ids = model_t5.generate(input_ids, max_new_tokens=200)

translation = tokenizer_t5.decode(output_ids[0], skip_special_tokens=True)

print(f"Source (EN):  {source_text}")
print(f"Target (FR):  {translation}")


Source (EN):  It's such a beautiful morning today!
Target (FR):  C'est un beau matin aujourd'hui!


T5 produces a clean French translation. The task prefix `"translate English to French:"` is what steers the model — the same T5 checkpoint can also summarize, answer questions, or classify, simply by changing the prefix. This **multi-task framing** is one of T5's key innovations.

**How translation differs from generation:** In GPT-2 (recipe 5), the decoder generates text from a seed with no structured input. In T5, the encoder first builds a **deep understanding** of the source text, and the decoder uses cross-attention to "look back" at the source while generating each target token. This two-stage process is essential for tasks where the output must be faithful to the input.

In [25]:

# 6.3  Try more translations

test_sentences = [
    ("translate English to French:",
     "Machine learning is transforming the way we process text."),
    ("translate English to German:",
     "The weather is quite nice in Berlin this time of year."),
    ("translate English to Romanian:",
     "I would like to order a coffee and a croissant please."),
]

for prefix, text in test_sentences:
    input_ids = tokenizer_t5(
        prefix + " " + text,
        return_tensors="pt", truncation=True
    ).input_ids.to(device)

    output_ids = model_t5.generate(input_ids, max_new_tokens=200)
    translation = tokenizer_t5.decode(output_ids[0], skip_special_tokens=True)

    lang = prefix.split("to ")[-1].rstrip(":")
    print(f"EN -> {lang}:")
    print(f"  Source:  {text}")
    print(f"  Target:  {translation}")
    print()


EN -> French:
  Source:  Machine learning is transforming the way we process text.
  Target:  L'apprentissage automatique transforme la façon dont nous traitons le texte.

EN -> German:
  Source:  The weather is quite nice in Berlin this time of year.
  Target:  Das Wetter ist in Berlin zu dieser Jahreszeit recht schön.

EN -> Romanian:
  Source:  I would like to order a coffee and a croissant please.
  Target:  Aş dori să comand o cafea şi o ceară, vă rog.



T5-base handles multiple language pairs from a single checkpoint. The quality varies by language pair — French and German translations (well-represented in T5's training data) are typically better than Romanian or other lower-resource languages.

**Production considerations for translation:**
- **T5-base** (220M parameters) provides decent quality for common language pairs
- **T5-large** or **T5-3B** improve quality but require more GPU memory
- For production-grade translation, dedicated models like **MarianMT** or **NLLB (No Language Left Behind)** often outperform general-purpose T5
- Always have native speakers validate translations for critical applications

---

## Summary and Key Takeaways

This chapter introduced the practical transformer workflow through six recipes that cover the three main architectural patterns:

**Encoder-only (BERT, RoBERTa)** — Recipes 1--4 used encoder models that produce rich contextual representations for classification tasks. The encoder processes the entire input bidirectionally, making it ideal for understanding tasks.

**Decoder-only (GPT-2)** — Recipe 5 used a decoder model for autoregressive text generation. The decoder sees only left context (previous tokens), making it naturally suited for generation.

**Encoder-Decoder (T5)** — Recipe 6 combined both components for translation, where the encoder understands the source and the decoder generates the target.

The Hugging Face `pipeline` abstraction made all of this accessible in just a few lines of code — but understanding what happens underneath (tokenization, attention, beam search, cross-attention) is essential for debugging, optimizing, and choosing the right model for your task.

**Looking ahead:** The techniques in this chapter use pre-trained models as-is. In production, you would typically **fine-tune** these models on your specific dataset to achieve higher accuracy, which is the natural next step beyond what we covered here.