<a href="https://colab.research.google.com/github/divya2212001/colabs/blob/main/Transformer_Question.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!nvidia-smi

Thu Feb 26 04:39:22 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   64C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

In [2]:
import torch
torch.cuda.is_available()

True

# Transformer Translation: English to Hindi

**Goal:** Build a Transformer model to translate English to Hindi using **only PyTorch nn layers**.

**Dataset:** `Dataset_English_Hindi.csv` containing thousands of English-Hindi translation pairs.

### What You'll Learn
- How to build a Transformer from scratch using `nn.Transformer`
- Learned positional embeddings (like GPT-2)
- End-to-end sequence-to-sequence translation
- Training without padding for simplicity

### Architecture Overview
```
English → Token Embedding → + Position Embedding → Encoder
                                                      ↓
Hindi ← Output Projection ← Decoder ← + Position Embedding ← Token Embedding
```

### nn Layers Used
| Layer | Purpose |
|-------|--------|
| `nn.Embedding` | Token embeddings (words → vectors) |
| `nn.Embedding` | Learned positional embeddings |
| `nn.Transformer` | Complete encoder-decoder with attention |
| `nn.Linear` | Output projection to vocabulary |

In [3]:
import torch
import torch.nn as nn # Hint: PyTorch's neural network module
import pandas as pd

print(f"PyTorch version: {torch.__version__}")

PyTorch version: 2.10.0+cu128


## 1. Import Libraries

* **Import Libraries:** Load the necessary libraries for building our Transformer model.

**What each library does:**
| Library | Description |
|---------|-------------|
| `torch` | Core PyTorch library for tensor operations and neural networks |
| `torch.nn` | Neural network building blocks (layers, loss functions, etc.) |
| `pandas` | Data manipulation library for loading CSV files |

**Documentation:**
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html) - Official PyTorch docs
- [torch.nn](https://pytorch.org/docs/stable/nn.html) - Neural network module
- [pandas](https://pandas.pydata.org/docs/) - Data analysis library

In [4]:
# Load dataset from CSV
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1KRn1LucS8rhz-VTZrdyb-sSDETmWCNhD")  # Hint: Function to read CSV files

# Remove  NaN values
df = df.dropna() # Hint:  then remove duplicate rows

#remove duplicate values
df.drop_duplicates(inplace=True)  # Hint: Remove NaN rows

# Convert to list of tuples
data = list(zip(df['English'].tolist(), df['Hindi'].tolist()))

# Use a subset for faster training (adjust as needed)
MAX_SAMPLES = 5000
data = data[:MAX_SAMPLES]

print(f"Loaded {len(data)} translation pairs")
print("\nSample Examples:")
for i, (eng, hin) in enumerate(data[:5], 1):
    print(f"{i}. {eng[:15]:40} → {hin[:40]}")

Loaded 5000 translation pairs

Sample Examples:
1. Help!                                    → बचाओ!
2. Jump.                                    → उछलो.
3. Jump.                                    → कूदो.
4. Jump.                                    → छलांग.
5. Hello!                                   → नमस्ते।


## 2. Load Dataset from CSV

* **Load Translation Data:** This cell loads the English-Hindi dataset and displays sample translations.

**Hints:**
- Use `pd.read_csv()` to load CSV files
- Use `.dropna()` to remove missing values
- Use `.drop_duplicates()` to remove duplicate rows
- Adjust `MAX_SAMPLES` to control dataset size

**What each function does:**
| Function | Description |
|----------|-------------|
| `pd.read_csv('file.csv')` | Reads a CSV file and returns a DataFrame with rows and columns |
| `df.dropna()` | Removes rows that have NaN (missing) values |
| `df.drop_duplicates()` | Removes duplicate rows from the DataFrame |
| `list(zip(a, b))` | Combines two lists into list of tuples [(a1,b1), (a2,b2), ...] |
| `len(data)` | Returns the number of items in the list |

**Documentation:**
- [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) - Read CSV files
- [DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) - Remove missing values
- [DataFrame.drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) - Remove duplicates

In [5]:
def build_vocab(sentences):
    vocab = {"<SOS>": 0, "<EOS>": 1, "<UNK>": 2}
    for sentence in sentences:
        for word in sentence.split():  # Hint: Split sentence into words
            if word not in vocab:  # Hint: Check if word is NOT in dictionary
                vocab[word] = len(vocab)
    return vocab

# Create vocabularies
eng_vocab = build_vocab([p[0] for p in data])
hin_vocab = build_vocab([p[1] for p in data])

# Reverse mappings
eng_idx2word = {v: k for k, v in eng_vocab.items()}  # Hint: Get key-value pairs
hin_idx2word = {v: k for k, v in hin_vocab.items()}  # Hint: Get key-value pairs

print(f"English vocab: {len(eng_vocab)} words")
print(f"Hindi vocab: {len(hin_vocab)} words")

English vocab: 11213 words
Hindi vocab: 10865 words


## 3. Build Vocabulary

* **Create Word-to-Index Mappings:** This cell builds vocabularies that map words to unique numbers.

**Hints:**
- Start with special tokens: `<SOS>=0` (start), `<EOS>=1` (end), `<UNK>=2` (unknown)
- Use `dict.get(word, default)` for safe dictionary lookup
- Build reverse mapping with dictionary comprehension

**What each function does:**
| Function | Description |
|----------|-------------|
| `dict.get(key, default)` | Gets value for key if exists, otherwise returns default value |
| `str.split()` | Splits string into list of words by whitespace |
| `len(vocab)` | Returns number of unique words in vocabulary |
| `{v: k for k, v in dict.items()}` | Creates reverse mapping (swap keys and values) |

**Special Tokens:**
| Token | Index | Purpose |
|-------|-------|----------|
| `<SOS>` | 0 | Start of sentence - signals beginning of translation |
| `<EOS>` | 1 | End of sentence - signals translation is complete |
| `<UNK>` | 2 | Unknown word - used for words not in vocabulary |

**Documentation:**
- [dict.get](https://docs.python.org/3/library/stdtypes.html#dict.get) - Safe dictionary lookup
- [str.split](https://docs.python.org/3/library/stdtypes.html#str.split) - Split string
- [Dictionary comprehension](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) - Create dicts

## 4. Convert Sentences to Tensors

* **Text to Numbers:** This cell converts sentences from words into sequences of numbers (indices).

**Hints:**
- Use `vocab.get(word, 2)` to lookup word index (2 is UNK token)
- Truncate long sentences to `MAX_SEQ_LEN - 2` (leave room for special tokens)
- Add `<SOS>` at start of Hindi, `<EOS>` at end of both
- Use `torch.tensor()` to create PyTorch tensors
- Use `list.append()` to add items to list

**What each function does:**
| Function | Description |
|----------|-------------|
| `str.split()` | Splits sentence into list of words |
| `vocab.get(word, 2)` | Looks up word's index, returns 2 (UNK) if word not found |
| `list[:n]` | Truncates list to first n elements |
| `list.append(item)` | Adds item to end of list |
| `torch.tensor(data, dtype)` | Creates PyTorch tensor from Python list |
| `tensor.tolist()` | Converts tensor back to Python list for printing |

**Why truncate?**
Positional embeddings have a maximum length (`MAX_SEQ_LEN=200`). Longer sequences would cause an error.

**Documentation:**
- [torch.tensor](https://pytorch.org/docs/stable/generated/torch.tensor.html) - Create tensors
- [Tensor.tolist](https://pytorch.org/docs/stable/generated/torch.Tensor.tolist.html) - Tensor to list
- [list.append](https://docs.python.org/3/tutorial/datastructures.html) - Add to list

In [6]:
MAX_SEQ_LEN = 200

def sentence_to_tensor(sentence, vocab, add_sos=False, add_eos=True):
    tokens = str(sentence).strip().split()
    indices = [vocab.get(word, 2) for word in tokens]  # Hint: Safe dict lookup with default

    # Truncate if too long (leave room for SOS/EOS tokens)
    max_tokens = MAX_SEQ_LEN - 2
    if len(indices) > max_tokens:
        indices = indices[:max_tokens]

    if add_sos:
        indices = [0] + indices  # SOS token index is 0
    if add_eos:
        indices.append(1)  # Hint: Add item to end of list (EOS token)
    return torch.tensor(indices, dtype=torch.long)  # Hint: Create PyTorch tensor

# Convert all data
pairs = []
for eng, hin in data:
    src = sentence_to_tensor(eng, eng_vocab, add_sos=False, add_eos=True)
    tgt = sentence_to_tensor(hin, hin_vocab, add_sos=True, add_eos=True)
    pairs.append(    (src, tgt))  # Hint: Add tuple to list

print(f"Converted {len(pairs)} sentence pairs to tensors")
print(f"Max sequence length: {MAX_SEQ_LEN}")
print(f"Example: '{data[0][0]}' → {pairs[0][0].tolist()}")

Converted 5000 sentence pairs to tensors
Max sequence length: 200
Example: 'Help!' → [3, 1]


## 5. Build Transformer Model

* **Define Architecture:** This cell creates the complete Transformer model using PyTorch's `nn` layers.

**Hints:**
- Use `nn.Embedding` for both token and position embeddings
- Use `nn.Transformer` for the encoder-decoder architecture
- Use `nn.Linear` for the final output projection
- Use `torch.arange()` to create position indices

**What each layer does:**
| Layer | Description |
|-------|-------------|
| `nn.Module` | Base class for all neural networks in PyTorch |
| `nn.Embedding(vocab_size, d_model)` | Converts word indices to dense vectors of size d_model |
| `nn.Transformer(...)` | Complete encoder-decoder with multi-head attention |
| `nn.Linear(d_model, vocab_size)` | Projects hidden states to vocabulary probabilities |
| `torch.arange(n, device)` | Creates tensor [0, 1, 2, ..., n-1] on specified device |
| `nn.Transformer.generate_square_subsequent_mask()` | Creates causal mask (prevents looking at future tokens) |

**Transformer Components:**
1. **Token Embeddings**: Convert word indices to vectors
2. **Position Embeddings**: Add position information (learned, not fixed)
3. **Encoder**: Processes English sentence with self-attention
4. **Decoder**: Generates Hindi translation with cross-attention to encoder
5. **Output Projection**: Converts hidden states to word probabilities

**Documentation:**
- [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) - Base neural network class
- [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) - Embedding layer
- [nn.Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) - Transformer model
- [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) - Linear layer
- [torch.arange](https://pytorch.org/docs/stable/generated/torch.arange.html) - Create sequence

In [7]:
class TranslationTransformer(nn.Module):  # Hint: Base class for PyTorch models
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=128, nhead=4,
                 num_layers=3, dim_feedforward=256, dropout=0.1, max_len=200):
        super().__init__()
        self.d_model = d_model

        # Token Embeddings (nn.Embedding)
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Hint: Converts indices to vectors
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Hint: Same layer type

        # Learned Positional Embeddings (nn.Embedding) - like GPT-2
        self.src_pos_embedding = nn.Embedding(max_len, d_model)  # Hint: Position embeddings
        self.tgt_pos_embedding = nn.Embedding(max_len, d_model)  # Hint: Same layer type


        # PyTorch's built-in Transformer (nn.Transformer)
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            batch_first=True
        )  # Hint: Complete encoder-decoder architecture

        # Output projection (nn.Linear)
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)  # Hint: Fully connected layer

    def forward(self, src, tgt):
        src_seq_len = src.size(1)
        tgt_seq_len = tgt.size(1)

        # Create position indices
        src_positions = torch.arange(src_seq_len, device=src.device).unsqueeze(0)  # Hint: Create sequence 0,1,2,...
        tgt_positions = torch.arange(tgt_seq_len, device=tgt.device).unsqueeze(0)  # Hint: Same function

        # Token embeddings + Positional embeddings
        src_emb = self.src_embedding(src) + self.src_pos_embedding(src_positions)
        tgt_emb = self.tgt_embedding(tgt) + self.tgt_pos_embedding(tgt_positions)

        # Create causal mask for decoder (prevents looking at future tokens)
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_seq_len).to(tgt.device)

        # Transformer forward pass (no padding masks needed)
        output = self.transformer(src_emb, tgt_emb, tgt_mask=tgt_mask)

        return self.fc_out(output)

# Create model
model = TranslationTransformer(
    src_vocab_size=len(eng_vocab),
    tgt_vocab_size=len(hin_vocab),
    d_model=128,
    nhead=4,
    num_layers=3,
    dim_feedforward=256,
    dropout=0.1,
    max_len=MAX_SEQ_LEN
)

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"Model created with {total_params:,} parameters")
print(f"Using device: {device}")
print(f"\nnn Layers used:")
print(f"  - nn.Embedding (token + position embeddings)")
print(f"  - nn.Transformer (encoder-decoder)")
print(f"  - nn.Linear (output projection)")
print(f"  - nn.Dropout")

Model created with 8,036,337 parameters
Using device: cuda

nn Layers used:
  - nn.Embedding (token + position embeddings)
  - nn.Transformer (encoder-decoder)
  - nn.Linear (output projection)
  - nn.Dropout


## 6. Train the Model

* **Training Loop:** This cell trains the Transformer model on the translation pairs.

**Hints:**
- Use `nn.CrossEntropyLoss()` to measure prediction error
- Use `torch.optim.Adam()` for adaptive learning rate optimization
- Use `optimizer.zero_grad()` before each forward pass
- Use `loss.backward()` to compute gradients
- Use `optimizer.step()` to update weights
- Use `torch.nn.utils.clip_grad_norm_()` to prevent exploding gradients

**What each function does:**
| Function | Description |
|----------|-------------|
| `nn.CrossEntropyLoss()` | Loss function that measures how wrong predictions are |
| `torch.optim.Adam(params, lr)` | Optimizer that updates weights; lr=learning rate |
| `optimizer.zero_grad()` | Clears old gradients (required before each training step) |
| `model(src, tgt)` | Forward pass through the model |
| `loss.backward()` | Computes gradients via backpropagation |
| `torch.nn.utils.clip_grad_norm_(params, max)` | Prevents gradient explosion by clipping |
| `optimizer.step()` | Updates model weights using computed gradients |
| `tensor.unsqueeze(0)` | Adds batch dimension (e.g., [5] → [1,5]) |

**Training Process:**
```
For each epoch:
  For each sentence pair:
    1. Forward pass → get predictions
    2. Calculate loss (error)
    3. Backward pass → compute gradients
    4. Update weights
```

**Documentation:**
- [nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) - Loss function
- [torch.optim.Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) - Adam optimizer
- [Tensor.backward](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html) - Backpropagation
- [clip_grad_norm_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html) - Gradient clipping

In [8]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()  # Hint: Loss function for classification
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)  # Hint: Adaptive optimizer

def train_step(src, tgt):
    model.train()
    optimizer.zero_grad()  # Hint: Clear old gradients

    # Decoder input: all except last token
    # Target: all except first token (SOS)
    tgt_input = tgt[:, :-1]
    tgt_output = tgt[:, 1:]

    output = model(src, tgt_input)
    loss = criterion(output.reshape(-1, output.size(-1)), tgt_output.reshape(-1))

    loss.backward()  # Hint: Compute gradients via backpropagation
    torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)  # Prevent exploding gradients
    optimizer.step()  # Hint: Update weights
    return loss.item()

# Training loop (sample by sample - no padding needed)
EPOCHS = 1  # More epochs for better convergence
print(f"Training for {EPOCHS} epochs on {len(pairs)} pairs...\n")

for epoch in range(1, EPOCHS + 1):
    total_loss = 0
    for src, tgt in pairs:
        src = src.unsqueeze(0).to(device)  # Hint: Add batch dimension
        tgt = tgt.unsqueeze(0).to(device)  # Hint: Same function
        loss = train_step(src, tgt)
        total_loss += loss

    avg_loss = total_loss / len(pairs)

    if epoch % 10 == 0 or epoch == 1:
        print(f"Epoch {epoch:3d}: Loss = {avg_loss:.4f}")

print("\nTraining completed!")

Training for 1 epochs on 5000 pairs...

Epoch   1: Loss = 6.7454

Training completed!


## 7. Translation Function

* **Define Inference:** This cell creates a function to translate English sentences to Hindi.

**Hints:**
- Use `model.eval()` to switch to evaluation mode
- Use `torch.no_grad()` to disable gradient computation (faster)
- Start with `<SOS>` token (index 0)
- Use `torch.softmax()` to convert logits to probabilities
- Use `.argmax()` to get the most likely word
- Stop when `<EOS>` token (index 1) is predicted
- Use `torch.cat()` to append new token to sequence

**What each function does:**
| Function | Description |
|----------|-------------|
| `model.eval()` | Switch to evaluation mode (disables dropout, etc.) |
| `torch.no_grad()` | Context manager that disables gradient tracking (faster, less memory) |
| `torch.softmax(x, dim)` | Converts logits to probabilities (sum to 1.0) |
| `tensor.argmax(dim)` | Returns index of highest value (the prediction) |
| `tensor.item()` | Converts single-element tensor to Python number |
| `dict.get(key, default)` | Safe dictionary lookup, returns default if key not found |
| `torch.cat([a, b], dim)` | Concatenates tensors along specified dimension |
| `' '.join(list)` | Combines list of words into single string with spaces |

**Translation Process:**
```
1. Encode English sentence
2. Start with <SOS> token
3. Loop:
   - Predict next Hindi word
   - If <EOS>, stop
   - Otherwise, append word and continue
4. Return Hindi sentence
```

**Documentation:**
- [torch.no_grad](https://pytorch.org/docs/stable/generated/torch.no_grad.html) - Disable gradients
- [torch.softmax](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html) - Softmax function
- [Tensor.argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html) - Get max index
- [torch.cat](https://pytorch.org/docs/stable/generated/torch.cat.html) - Concatenate tensors






In [9]:
def translate(sentence, max_len=50):
    model.eval()  # Hint: Switch to evaluation mode
    src = sentence_to_tensor(sentence, eng_vocab).unsqueeze(0).to(device)
    tgt = torch.tensor([[0]]).to(device)  # Start with SOS token

    words = []
    with torch.no_grad():  # Hint: Disable gradient tracking
        for i in range(max_len):
            output = model(src, tgt)

            # Get probabilities and pick the most likely token
            probs = torch.softmax(output[:, -1, :], dim=-1)  # Hint: Convert to probabilities
            next_token = probs.argmax(dim=-1)  # Hint: Get index of max value

            # Stop if EOS is predicted (but not on first token if we have nothing)
            if next_token.item() == 1 and len(words) > 0:
                break

            # Skip EOS/SOS tokens in output
            if next_token.item() not in [0, 1]:
                words.append(hin_idx2word.get(next_token.item(), '<UNK>'))

            tgt = torch.cat(    [tgt, next_token.unsqueeze(0)], dim=1)  # Hint: Concatenate tensors

            # Safety: stop if we keep getting special tokens
            if i > 5 and len(words) == 0:
                break

    return ' '.join(words) if words else "(no translation)"  # Hint: Join list into string

# Test on random samples from the dataset
import random
test_samples = data[10:15]

print("Translation Results (random samples):")
print("=" * 80)
for eng, hin in test_samples:
    pred = translate(eng)
    print(f"EN: {eng[:40]:40}")
    print(f"HI: {hin[:50]}")
    print(f"PR: {pred[:50]}")
    print("-" * 80)

Translation Results (random samples):
EN: Awesome!                                
HI: बहुत बढ़िया!
PR: समालोचना
--------------------------------------------------------------------------------
EN: Come in.                                
HI: अंदर आ जाओ।
PR: इस
--------------------------------------------------------------------------------
EN: Get out!                                
HI: बाहर निकल जाओ!
PR: इस
--------------------------------------------------------------------------------
EN: Go away!                                
HI: चले जाओ!
PR: इस
--------------------------------------------------------------------------------
EN: Goodbye!                                
HI: ख़ुदा हाफ़िज़।
PR: समालोचना
--------------------------------------------------------------------------------


## 8. Try Your Own Translations!

* **Interactive Testing:** Try translating your own English sentences to Hindi.

**Hints:**
- Modify `test_sentences` to add your own examples
- Model works best on short, simple sentences
- Words must be in the vocabulary (trained on dataset)

**Best practices:**
- Use simple grammar
- Keep sentences short (< 10 words)
- Use common words from everyday conversation

**Documentation:**
- Experiment with different sentence structures
- Compare translations to see what the model learned

In [10]:
# Try your own translation
test_sentences = [
    "Hello!",
    "How are you?",
    "Thank you.",
    "Good morning!",
    "What is the issue"
]

print("Custom Translations:")
print("=" * 60)
for test in test_sentences:
    result = translate(test)  # Hint: Call the translation function
    print(f"Input:  {test}")
    print(f"Output: {result}")
    print("-" * 60)

print(f"\nVocabulary sizes:")
print(f"  English: {len(eng_vocab)} words")
print(f"  Hindi: {len(hin_vocab)} words")

Custom Translations:
Input:  Hello!
Output: समालोचना
------------------------------------------------------------
Input:  How are you?
Output: इस
------------------------------------------------------------
Input:  Thank you.
Output: इस
------------------------------------------------------------
Input:  Good morning!
Output: यह
------------------------------------------------------------
Input:  What is the issue
Output: यह
------------------------------------------------------------

Vocabulary sizes:
  English: 11213 words
  Hindi: 10865 words



## Summary

### What We Built
**A complete Transformer-based English-Hindi translation system using pure PyTorch `nn` layers!**

### How It Works
```
English: "Hello"
   ↓
Tokens: [word_idx, EOS]  
   ↓
Embeddings: [128-dim vectors]
   ↓
+ Position: [learned position vectors]
   ↓
Encoder: [multi-head attention + feedforward]
   ↓
Context Vector: [encoded meaning]
   ↓
Decoder: [generates Hindi word-by-word]
   ↓
Hindi: "नमस्ते"
```

### PyTorch `nn` Layers Used
| Layer | Purpose |
|-------|--------|
| `nn.Embedding` | Token embeddings (words → vectors) |
| `nn.Embedding` | Learned positional embeddings |
| `nn.Transformer` | Complete encoder-decoder with attention |
| `nn.Linear` | Project to vocabulary size |
| `nn.CrossEntropyLoss` | Training loss function |

### Key Concepts
1. **Token Embeddings**: Words → Dense vectors
2. **Positional Embeddings**: Position → Learned vectors (like GPT-2)
3. **Attention**: Model can "look at" any input word when generating output
4. **Causal Masking**: Decoder can't cheat by looking at future output
5. **Autoregressive**: Generate one word at a time

### Training Tips
- Adjust `MAX_SAMPLES` to train on more/fewer examples
- Increase `EPOCHS` for better convergence
- Use GPU (`cuda`) for faster training on large datasets

### Resources
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Original Transformer paper
- [PyTorch nn.Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) - Official docs
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) - Visual guide
- [GPT-2 Positional Embeddings](https://openai.com/blog/better-language-models/) - Learned positions

**Congratulations!** You built a Transformer translation system using pure PyTorch nn layers!