# ðŸ¤– Unit 4.2: Attention & Transformers

**Course:** Advanced Machine Learning (AICC 303)  
**Topics:**
*   4.6 Attention Mechanisms
*   4.7 Types of Attention
*   4.8 Transformer

**The Revolution:** In 2017, the paper "Attention is All You Need" changed NLP forever by removing Recurrence (RNNs) and relying entirely on Attention mechanisms.

---

In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import seaborn as sns
import matplotlib.pyplot as plt

# Setup
sns.set(style="whitegrid")
torch.manual_seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 1. The Attention Mechanism

In Seq2Seq (Encoder-Decoder), the Encoder had to compress the entire sentence into a *single fixed-size vector*. This caused information loss for long sentences.

**Attention** allows the Decoder to "look back" at all Encoder hidden states and focus on relevant words for the current prediction.

### 1.1 Self-Attention (The Core of Transformers)
Every word in the sentence looks at every other word to understand context.

**Formula:**
$$ Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

*   **Q (Query):** What am I looking for?
*   **K (Key):** What do I have to offer?
*   **V (Value):** What is the actual content?

If $Q$ matches $K$ (high dot product), we take more of $V$.

In [3]:
def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    """
    # matmul_qk = torch.bmm(q, k.transpose(1, 2))
    matmul_qk = torch.matmul(q, k.transpose(-2, -1))

    # Scale matmul_qk
    dk = k.size(-1)
    scaled_attention_logits = matmul_qk / np.sqrt(dk)

    # Add mask if present (for decoder to not look ahead)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # Softmax to get probabilities
    attention_weights = F.softmax(scaled_attention_logits, dim=-1)

    output = torch.matmul(attention_weights, v)

    return output, attention_weights

# Example Usage
torch.manual_seed(42)
temp_q = torch.randn(1, 3, 4)  # (Batch, Seq_Len, Dim)
temp_k = torch.randn(1, 3, 4)
temp_v = torch.randn(1, 3, 4)

output, weights = scaled_dot_product_attention(temp_q, temp_k, temp_v)
print("Output Shape:", output.shape)
print("Attention Weights:\n", weights.numpy())

Output Shape: torch.Size([1, 3, 4])
Attention Weights:
 [[[0.3017341  0.30983964 0.3884262 ]
  [0.2450955  0.3801395  0.37476498]
  [0.29378855 0.2293217  0.4768897 ]]]


## 2. Using Transformers (Hugging Face)

Instead of training from scratch (which requires massive data), we use Pre-trained models like BERT or GPT.

**Install needed libraries:** `pip install transformers`

In [4]:
# !pip install transformers
from transformers import pipeline

# 1. Sentiment Analysis (using a DistilBERT model by default)
classifier = pipeline("sentiment-analysis")
result = classifier("I absolutely loved the advanced machine learning course!")
print(f"Sentiment: {result}")

# 2. Masked Language Modeling (BERT)
# BERT is trained to predict missing words.
unmasker = pipeline('fill-mask', model='bert-base-uncased')
result_mask = unmasker("Artificial Intelligence is the [MASK] of the future.")

print("\nBERT Predictions for [MASK]:")
for r in result_mask[:3]:
    print(f"{r['token_str']}: {r['score']:.4f}")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Sentiment: [{'label': 'POSITIVE', 'score': 0.999861478805542}]


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0



BERT Predictions for [MASK]:
technology: 0.3353
science: 0.2559
reality: 0.0211


## 3. Tokenizer Visualization
Transformers use special tokenizers (Subword tokenization).

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text = "Transformer models are powerful."
encoded_input = tokenizer(text)

print("Original:", text)
print("Token IDs:", encoded_input['input_ids'])
print("Decoded Tokens:", tokenizer.convert_ids_to_tokens(encoded_input['input_ids']))

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Original: Transformer models are powerful.
Token IDs: [101, 13809, 23763, 3584, 1132, 3110, 119, 102]
Decoded Tokens: ['[CLS]', 'Trans', '##former', 'models', 'are', 'powerful', '.', '[SEP]']
