## Objective

Implement a **Sequence-to-Sequence (Seq2Seq)** model with **Bahdanau attention**, you can use pytorch, to learn how to **reverse word order in sentences**.

> Example task:  
> **Input:** `"the cat sat"`  
> **Output:** `"sat cat the"`

---

## Part 1 — Model Architecture

### Requirements
1. **Encoder**
   - Implement a GRU-based encoder.  
   - Input: tokenized source sentence.  
   - Output: sequence of hidden states.

2. **Attention Mechanism (Bahdanau)**
   - Compute alignment scores between the current decoder hidden state and all encoder outputs.  
   - Apply softmax to get attention weights.  
   - Derive a context vector as the weighted sum of encoder outputs.

3. **Decoder**
   - Implement a GRU decoder that uses the context vector at each step.  
   - Predicts the next word in the reversed sequence.

---

## Part 2 — Training Loop

### Requirements
Implement a full **training loop** that includes:

- **Loss:** Cross-entropy loss with padding mask (ignore padded tokens).  
- **Optimization:** Implement **Adam optimizer** manually.  
- **Gradient Clipping:** Apply **max-norm clipping** (norm ≤ 1.0).  
- **Teacher Forcing:** Use teacher forcing during training.  
- **Model Saving:** Save the best model based on validation loss.  
- **Logging:** Print training and validation loss for each epoch.

---

## Part 3 — Evaluation & Visualization

After training, evaluate the model on a test set and report:

1. **Qualitative Examples**
   Show at least **10 examples** in the following format:
   Input: "the cat sat"
   Output: "sat cat the"
   Reference: "sat cat the"
   match or no match

2. **Quantitative Metric**
- Compute **exact match accuracy** across the test set.

3. **Attention Visualization**
- Plot a **heatmap** showing attention weights.  
- X-axis → encoder tokens  
- Y-axis → decoder steps  
- Save as `attention_heatmap.png`

---

## Part 4 — Analysis

Write a short answering:

- What patterns do you observe in the attention weights?  
- Does the attention align input and output tokens correctly?  
- How does attention help the model learn to reverse sequences?  
- What happens at the beginning and end of sequences?

---
