<a href="https://colab.research.google.com/github/debojit11/ml_nlp_dl_transformers/blob/main/TF_week_15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 15: Transformers & BERT (Modern NLP)

# **SECTION 1: Welcome & Objectives**

In [None]:
print("Welcome to Week 15!")
print("This week, you'll:")
print("- Understand the Transformer architecture")
print("- Learn how BERT works for NLP tasks")
print("- Use Hugging Face Transformers for real-world NLP")

Welcome to Week 15!
This week, you'll:
- Understand the Transformer architecture
- Learn how BERT works for NLP tasks
- Use Hugging Face Transformers for real-world NLP


# **SECTION 2: What Are Transformers?**

### 🤖 What Are Transformers?
Transformers are deep learning architectures designed to handle **sequential data** like text.

They use:
- **Self-Attention** to relate all words to each other
- **Positional Encoding** to retain order
- A stack of **Encoder (BERT)** or **Decoder (GPT)** blocks

# 🤖 Week 15 – Transformers & BERT (Modern NLP)

---

## 🔍 Why Transformers?

Traditional models (RNNs, LSTMs) process sequences **step-by-step**, which:
- Slows down training
- Makes it hard to capture long-range dependencies

**Transformers** process all tokens **in parallel**, using **attention mechanisms** to focus on relevant words.  
This enabled massive models like BERT, GPT, T5, etc.

---

# **SECTION 3: Self-Attention Intuition**

### 🧠 Why Self-Attention?
In traditional RNNs/LSTMs:
- Information flows step-by-step (sequentially)
- Hard to model long-range dependencies

Transformers:
- Each word attends to all others in parallel
- Learn what to "focus" on during prediction

Example:
> "The animal didn't cross the street because **it** was too tired."
> \--> What does **it** refer to? Self-attention helps figure that out.

## 🔗 Core Idea: Self-Attention

Self-attention helps the model learn **which words to attend to**, regardless of position.

> "The cat sat on the **mat** because it was **tired**."

BERT knows “it” refers to “cat” because attention weights learn such relationships.

---

Each word attends to **all other words** in the sentence.
This helps capture:
- Context
- Word relationships
- Long-range dependencies

$$
\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right)V
$$

Where:
- Q = Query
- K = Key
- V = Value

---

### 📦 Key Components

| Component        | Role                                         |
|------------------|----------------------------------------------|
| Multi-head Attention | Captures relationships from different subspaces |
| Positional Encoding | Adds order to the input tokens            |
| Feed-Forward Layer | Transforms each token independently        |
| Layer Norm + Residual | Helps in training deep layers           |

---

## 📐 Transformer Architecture

A single **Transformer block** includes:
- Multi-head Self-Attention
- Layer Norm
- Feedforward Layers
- Positional Encoding

Stack many such blocks → Transformer model.

---

# **SECTION 4: Load Pretrained BERT (Text Classification)**

## 🧠 What is BERT?

**BERT = Bidirectional Encoder Representations from Transformers**

It:
- Uses only the **encoder** part of the Transformer
- Reads text in **both directions**
- Pretrained on:
  - **Masked Language Modeling** (fill in the blanks)
  - **Next Sentence Prediction** (understand relationships)

---


## 🧱 Encoder vs Decoder

| Encoder (BERT)       | Decoder (GPT)       |
|----------------------|---------------------|
| Bidirectional        | Autoregressive       |
| Looks at full context | Left-to-right only |
| Ideal for classification, QA | Ideal for text generation |

---

## 🔧 Using BERT via Hugging Face

Hugging Face 🤗 Transformers makes it easy to use pretrained BERT models.

```python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier("I love this movie!"))
```

In [None]:
from transformers import pipeline

In [None]:
# Use a simple sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("Transformers are revolutionizing NLP!")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9956631064414978}]


In [None]:
# Try another sentence
print(classifier("I hate this so much."))

[{'label': 'NEGATIVE', 'score': 0.9995205402374268}]


# **SECTION 5: Tokenization (How Transformers See Text)**

In [None]:
from transformers import AutoTokenizer

In [None]:
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
example = "Transformers are powerful models."
tokens = tokenizer.tokenize(example)
ids = tokenizer.convert_tokens_to_ids(tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
print("Tokens:", tokens)
print("Token IDs:", ids)

Tokens: ['transformers', 'are', 'powerful', 'models', '.']
Token IDs: [19081, 2024, 3928, 4275, 1012]



## 🔍 Real-World Use Cases of BERT

| Task                      | Description                                      |
|---------------------------|--------------------------------------------------|
| Sentiment Classification  | Predict sentiment from reviews                   |
| Named Entity Recognition  | Identify names, places, etc.                     |
| Question Answering        | Extract answers from passages                    |
| Semantic Similarity       | Compare sentence meanings                        |

---

## 🧠 Concept Check: BERT vs GPT
|Aspect             | BERT               | GPT
|-------------------|--------------------|-------------------------
|Direction          | Bidirectional      | Left-to-right
|Training Objective | MLM + NSP          | Next token prediction
|Strengths          | Classification, QA | Text generation

## 📘 Summary
Transformers changed NLP forever:

- BERT → bidirectional understanding of language

- Hugging Face → makes it super easy to use

- One model can solve many tasks



# **SECTION 6: Fine-Tuning BERT (on Custom Dataset - Optional)**

### 🧪 Want to Go Further?
Use `Trainer` or `AutoModelForSequenceClassification` from Hugging Face
and fine-tune BERT on your own text classification dataset 🚀

We'll revisit this in the capstone projects.

# **SECTION 7: Exercises**

### 📝 Exercises:
1. Try the `zero-shot-classification` pipeline.
2. Tokenize a custom sentence and decode the IDs.
3. Use `AutoModel` and `AutoTokenizer` to extract embeddings.
4. Read BERT's original paper or Hugging Face docs.

**👋 Next week**: We'll explore modern **NLP architectures like T5 and BART** — powerful encoder-decoder models for summarization, rephrasing, and generation!