# ü§ñ Beyond the Vanilla Transformer: Meet the Family (BERT, GPT, T5)

In our last notebook, we built the original "vanilla" Transformer, the Encoder-Decoder model introduced in "Attention Is All You Need." This architecture was a revolution.

However, researchers soon discovered that the **Encoder** and **Decoder** components were incredibly powerful on their own. By isolating and modifying them, they created a "zoo" of new models, each specialized for different tasks.

This notebook is a practical tour of the three main families of Transformer-based models. We won't build them from scratch; instead, we'll use the powerful **Hugging Face `transformers` library** to load pre-trained versions and understand:

1.  **Encoder-Only (like BERT):** The "Understander."
2.  **Decoder-Only (like GPT):** The "Generator."
3.  **Encoder-Decoder (like T5):** The "Translator" or "Swiss Army Knife."

**Our Goal:** To understand the conceptual differences between these architectures and see practical examples of what each is best suited for.

## 1. Setup: Installing the Transformers Library

First, we need to install the `transformers` library from Hugging Face. This library gives us easy access to thousands of pre-trained models. We'll also install `datasets` for some examples.

In [1]:
!pip install -q transformers datasets
!pip install -q sentencepiece # Required for T5

import torch
from transformers import pipeline

print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

Using device: cuda


## 2. A High-Level Overview

Before we dive in, here's the key difference in one sentence:

* **Encoder-Only (BERT):** Sees the *entire* input sentence at once (bidirectional). It's pre-trained to "fill in the blanks," so it's a master of understanding context.
* **Decoder-Only (GPT):** Sees only the text that came *before* it (left-to-right). It's pre-trained to "predict the next word," so it's a master of generation.
* **Encoder-Decoder (T5):** Has both. The encoder "understands" the input, and the decoder "generates" the output based on that understanding.

## 3. üß† Encoder-Only: The "Understander" (BERT)

The **BERT** (Bidirectional Encoder Representations from Transformers) family models are essentially just the **Encoder** stack from the original Transformer.

### How it Works:
* **Context:** **Bidirectional.** When processing a word, a BERT-style model can "see" all the words that come *before* it and *after* it in the sequence. This is its superpower.
* **Pre-training:** BERT is pre-trained using **Masked Language Modeling (MLM)**. This means it's given sentences with random words "masked out" (e.g., `The [MASK] sat on the mat.`) and its job is to use the full context to guess the masked word. This makes it an expert at building rich, contextual "understanding" of a sentence.



### What it's Best For:
Tasks that require a deep understanding of the *entire* sentence. These are often called **Natural Language Understanding (NLU)** tasks.

* **Sentiment Analysis:** Is this review positive or negative?
* **Text Classification:** What topic is this news article about?
* **Named Entity Recognition (NER):** Find all the "people," "places," and "organizations" in this text.
* **Extractive Question Answering:** Given a text, find the *exact span* of text that answers a question.

### Practical Examples

We can use the Hugging Face `pipeline` to easily use these models.

#### Example 1: Masked Language Model (BERT's "native" task)
Let's see BERT do what it was trained to do: fill in the blanks.

In [4]:
# Load a fill-mask pipeline with a BERT model
# 'bert-base-uncased' is a good, standard model
fill_mask = pipeline('fill-mask', model='bert-base-uncased')

sentence = "The capital of France is [MASK]."

results = fill_mask(sentence)

print(f"\nOriginal: {sentence}\n")
for result in results:
    print(f"Token: {result['token_str']:<10} | Score: {result['score']:.4f}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Original: The capital of France is [MASK].

Token: paris      | Score: 0.4168
Token: lille      | Score: 0.0714
Token: lyon       | Score: 0.0634
Token: marseille  | Score: 0.0444
Token: tours      | Score: 0.0303


> **Observation:** Notice how it knows "paris" is the most likely word, but it also suggests other plausible (but incorrect) city names. It's using the context of "capital" and "France."

#### Example 2: Sentiment Analysis (A common fine-tuned task)
Here, a BERT-style model has been "fine-tuned" on a labeled dataset for sentiment analysis. The core model "understands" the sentence, and a small classification "head" is added on top to output "Positive" or "Negative."

In [6]:
# We'll use 'distilbert', a smaller, faster version of BERT.
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

text1 = "I love this notebook! It's so clear and helpful."
text2 = "This is the worst movie I have ever seen. It was boring and poorly acted."

print(classifier(text1))
print(classifier(text2))

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.999875545501709}]
[{'label': 'NEGATIVE', 'score': 0.999819815158844}]


## 4. ‚úçÔ∏è Decoder-Only: The "Generator" (GPT)

The **GPT** (Generative Pre-trained Transformer) family models are essentially just the **Decoder** stack from the original Transformer.

### How it Works:
* **Context:** **Causal (or "Masked")**. This is the key. A Decoder-Only model can only see words to its *left* (the past). It *cannot* see future words. This is enforced by the "look-ahead mask" we built in the Transformer notebook.
* **Pre-training:** GPT is pre-trained on a simple **Causal Language Modeling** task: predict the *very next word* in a sequence. It's fed massive amounts of text from the internet and just learns to predict what comes next, over and over.



### What it's Best For:
Tasks that involve creating new, coherent text from a prompt. These are called **Natural Language Generation (NLG)** tasks.

* **Text Generation:** Continue a story, write a poem, complete a prompt.
* **Chatbots:** Generating a response in a conversation.
* **Summarization (Abstractive):** Writing a *new* summary (not just copying sentences).
* **Generative Question Answering:** Answering a question by generating a new sentence.

### Practical Example

Let's use the `pipeline` to generate text with **GPT-2**, a famous model from this family.

#### Example 1: Text Generation

In [8]:
# Load a text-generation pipeline with GPT-2
generator = pipeline('text-generation', model='gpt2')

prompt = "In a world where dragons and magic are real,"

# Let's generate 2 different (and longer) completions
results = generator(prompt, max_length=50, num_return_sequences=2)

for i, result in enumerate(results):
    print(f"--- Completion {i+1} ---")
    print(result['generated_text'])
    print("\n")

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


--- Completion 1 ---
In a world where dragons and magic are real, it may seem like the best option, as we can see from the video below.

The next thing is to see whether the dragons or magic of the world actually work. The video shows them in their natural state. The video shows them in their magic state.

It's not clear what makes these dragons, the magic of the world, work. But if the ability to fly is a thing you can see in the video, then the ability to fly is probably better than the ability to fly. The dragons may be able to fly in a different way, but they're still dragons.

In the video, the dragon is shown flying in a way that is the same as its natural state, a way that doesn't involve a lot of movement. But it does have a move. It also has a move that is in real time, such as a circle that changes direction.

The video does show dragons moving in a way that doesn't involve a lot of movement. But it does have a move that is in real time, such as a circle that changes directio

> **Observation**: Notice how the model "auto-regressively" generates text, token by token, with each new word being conditioned on the words that came before it. It's just "predicting the next word" on a loop.

## 5. üîÅ Encoder-Decoder: The "Swiss Army Knife" (T5)

This family brings us back to the original architecture, but with a twist. Models like **T5** (Text-to-Text Transfer Transformer) and **BART** use *both* the Encoder and Decoder.

### How it Works:
* **Context:** The Encoder has **bidirectional** context (like BERT) to "understand" the input. The Decoder has **causal** context (like GPT) to "generate" the output.
* **Pre-training:** T5's clever idea is to frame *every* NLP task as a **"text-to-text"** problem. It's pre-trained by taking text, corrupting it (e.g., `The cat <X> on the <Y> mat.`), and training the model to re-generate the original text (e.g., `The cat sat on the <Z> mat.` -> `T5 output: <X> sat <Y> <Z>`).
* **Task Prefixes:** To tell the model *what* to do, you add a **prefix** to the input text.
    * `"summarize: [article text]..."`
    * `"translate English to German: [English text]..."`
    * `"cola sentence: [sentence]..."` (for a classification task)



### What it's Best For:
Tasks that **transform** an input sequence into a *new* output sequence. This makes them incredibly versatile.

* **Machine Translation:** (The original Transformer task).
* **Abstractive Summarization:** (Often the best models for this).
* **Generative Question Answering:** Taking a question and context and generating a free-form answer.

### Practical Examples

Let's use **T5-small** (a smaller version) to perform two different tasks with the *same* model, just by changing the pipeline.

#### Example 1: Summarization

In [11]:
# Load a summarization pipeline with T5
summarizer = pipeline('summarization', model='t5-small')

long_text = """
The Transformer is a deep learning model architecture introduced in 2017.
It is notable for its use of self-attention mechanisms, which allow it to weigh the
importance of different words in a sequence. Unlike Recurrent Neural Networks (RNNs),
the Transformer does not process data sequentially, enabling significant parallelization
and faster training. This architecture is split into an Encoder and a Decoder.
The Encoder's job is to build a rich, contextual representation of the input, while
the Decoder's job is to generate an output sequence based on that representation.
This design proved to be highly effective for machine translation and has since become
the foundation for most modern large language models, including BERT and GPT.
"""

summary = summarizer(long_text, max_length=50, min_length=10)
print(f"\n{summary[0]['summary_text']}")

Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



the Transformer is a deep learning model architecture introduced in 2017 . it uses self-attention mechanisms to weigh the importance of different words in a sequence . the Transformer does not process data sequentially, enabling significant parallelization and faster training .


#### Example 2: Translation
Now, we use the exact same model architecture, but one that's been fine-tuned for translation. The pipeline handles the correct prefix (`"translate English to German: "`) for us.

In [14]:
# Load a translation pipeline, also using a T5 model
translator = pipeline('translation_en_to_de', model='t5-small')

text = "This notebook is a great introduction to the different Transformer architectures."

translation = translator(text)
print(f"\n{translation[0]['translation_text']}")

Device set to use cuda:0



Dieses Notebook ist eine gro√üartige Einf√ºhrung in die verschiedenen Transformer-Architekturen.


## 6. Summary: A Quick Comparison

Here's a simple table to help you remember the differences.

| Feature | Encoder-Only (e.g., BERT) | Decoder-Only (e.g., GPT) | Encoder-Decoder (e.g., T5) |
| :--- | :--- | :--- | :--- |
| **Full Name** | Bidirectional Encoder | Causal Decoder | Encoder-Decoder |
| **Context** | **Bidirectional** (sees all) | **Causal** (sees past) | Bidirectional (Encoder), Causal (Decoder) |
| **Analogy** | The "Understander" | The "Generator" | The "Translator" / "Transformer" |
| **Pre-training** | Masked Language Model (MLM) | Causal Language Model (Next Token) | "Text-to-Text" (Span Corruption) |
| **Best For...** | **NLU** (Understanding) | **NLG** (Generating) | **Seq2Seq** (Transforming) |
| **Examples** | Classification, NER | Chatbots, Story Writing | Translation, Summarization |
| **Models** | BERT, RoBERTa, ALBERT | GPT, LLaMA, Mistral | T5, BART, "Vanilla" Transformer |

## 7. Conclusion

You've now met the three main families of Transformer models and seen them in action.

The key takeaway is that **the architecture dictates the task.**
* Need to **understand** a sentence for classification? Use an **Encoder (BERT)**.
* Need to **generate** creative text from a prompt? Use a **Decoder (GPT)**.
* Need to **transform** an input sequence into a new one? Use an **Encoder-Decoder (T5)**.

This powerful ecosystem, all stemming from the original 2017 paper, is what powers almost all of modern NLP.