# ✅ **Encoder–Decoder Architecture**

The Encoder–Decoder model is the backbone of many sequence-to-sequence tasks like **machine translation**, **text summarization**, and **speech-to-text**.

---

## **1. Core Idea**

We want to **map an input sequence** $x = (x_1, x_2, ..., x_T)$ to an **output sequence** $y = (y_1, y_2, ..., y_{T'})$, where $T$ and $T'$ can be different.

- **Encoder**: Reads and compresses the input into a fixed-size context vector.
- **Decoder**: Expands that context into the output sequence.

---

## **2. Architecture Overview**
Input → [Encoder] → Context Vector → [Decoder] → Output


### **Encoder**

- Often built with **LSTMs**, **GRUs**, or **Transformers**.
- Reads the input tokens step-by-step.
- Produces hidden states and a **final context vector** $(h_T, c_T)$ (for LSTM).

### **Decoder**

- Takes the **context vector** as the initial hidden state.
- Predicts the output sequence **one token at a time**.
- At each step, uses its previous output as the next input.

---

## **3. Encoder (Math)**

Let $x_t$ be the embedding of the $t^{th}$ input token.

**For an LSTM encoder:**

$$
h_t, c_t = \text{LSTM}_{enc}(x_t, h_{t-1}, c_{t-1})
$$

Where:

- $h_t$ = hidden state at time $t$
- $c_t$ = cell state (memory)
- $h_T, c_T$ = final states passed to the decoder

---

## **4. Decoder (Math)**

Let $y_{t-1}$ be the embedding of the previous output token.

**For an LSTM decoder:**

$$
s_t, c_t' = \text{LSTM}_{dec}(y_{t-1}, s_{t-1}, c_{t-1}')
$$

Where:

- $s_t$ = decoder hidden state
- $c_t'$ = decoder cell state
- Initial states: $s_0 = h_T$, $c_0' = c_T$ (from encoder)

The probability of the next token:

$$
P(y_t | y_{<t}, x) = \text{Softmax}(W_o \cdot s_t + b_o)
$$

Where:

- $W_o, b_o$ are learned output projection parameters.

---

## **5. Training (Teacher Forcing)**

During training:

- At step $t$, we feed the **ground-truth token** $y_{t-1}$ instead of the predicted one to the decoder.
- Loss is **cross-entropy** between predicted probabilities and actual tokens.

$$
\mathcal{L} = - \sum_{t=1}^{T'} \log P(y_t^{true} | y_{<t}^{true}, x)
$$

---

## **6. Inference (Generation)**

- Start with a special `<SOS>` token.
- Feed the predicted token back into the decoder for the next step.
- Stop at `<EOS>` or max length.

---

## **7. Key Limitations**

- **Fixed-size context vector bottleneck**: The encoder must compress all input info into a single vector — hard for long sequences.
- **Solution**: Attention mechanisms (introduced later) let the decoder look back at all encoder states.

---

## **8. Visual Summary**

**Encoder**:  


\[LSTM] → 

\[LSTM] → ... → Final state $(h_T, c_T)$

**Decoder**:  
Initial state from encoder → Predict token → Feed back → Repeat.

---

## **Notes:**
The Encoder–Decoder architecture converts sequences into sequences by first **encoding meaning** into a hidden representation, then **decoding** it step-by-step.

---

# ✅ **Attention in Encoder–Decoder Models**

## **Why Attention?**

- Vanilla Encoder–Decoder squeezes all input info into a **single context vector** → bottleneck for long sequences.
- **Attention** allows the decoder to **look at all encoder states** and decide which parts of the input are important at each output step.

---

## **1. Bahdanau Attention (Additive Attention, 2015)**

**Key idea**:  
At each decoder step $t$, compare the **previous decoder hidden state** $s_{t-1}$ with **each encoder hidden state** $h_i$ to find relevance.

**Steps:**

1. **Score (alignment model)**:

   $$
   e_{t,i} = v_a^\top \tanh(W_s s_{t-1} + W_h h_i)
   $$

2. **Attention weights**:

   $$
   \alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{k=1}^T \exp(e_{t,k})}
   $$

3. **Context vector**:

   $$
   c_t = \sum_{i=1}^T \alpha_{t,i} h_i
   $$

4. **Decoder update**:  
   Use $c_t$ with $y_{t-1}$ to produce $s_t$ and predict $y_t$.

**Characteristics**:

- Computes scores **before** updating decoder state for step $t$.
- “Additive” because score uses addition + nonlinearity.
- Good for small to medium hidden sizes.

---

## **2. Luong Attention (Multiplicative Attention, 2015)**

**Key idea**:  
Simpler scoring, computed **after** the current decoder hidden state $s_t$ is obtained.

**Variants of score function**:

1. **Dot**:

   $$
   \text{score}(s_t, h_i) = s_t^\top h_i
   $$

2. **General**:

   $$
   \text{score}(s_t, h_i) = s_t^\top W_a h_i
   $$

3. **Concat**:

   $$
   \text{score}(s_t, h_i) = v_a^\top \tanh(W_a [s_t; h_i])
   $$

**Steps**:

1. Compute score between $s_t$ and each $h_i$.
2. Softmax over scores → attention weights $\alpha_{t,i}$.
3. Weighted sum → context vector $c_t$.
4. Combine $c_t$ with $s_t$ for output prediction.

**Characteristics**:

- Faster (especially dot-product version).
- Works well for large hidden dimensions.

---

## **Bahdanau vs. Luong**

| Feature                   | Bahdanau (2015)                                                                | Luong (2015)                                                            |
|---------------------------|--------------------------------------------------------------------------------|-------------------------------------------------------------------------|
| **Score type**            | **Additive** — uses a small feedforward neural network (ANN) to compare states | **Multiplicative** — uses dot product or weighted dot product (General) |
| **How score is computed** | $e_{t,i} = v_a^\top \tanh(W_s s_{t-1} + W_h h_i)$                              | Dot: $s_t^\top h_i$ <br> General: $s_t^\top W_a h_i$                    |
| **Score timing**          | Before decoder state update ($s_{t-1}$ used)                                   | After decoder state update ($s_t$ used)                                 |
| **Computation cost**      | Higher (extra ANN adds parameters & computation)                               | Lower (simple multiplication)                                           |
| **Best for**              | Small/medium hidden sizes — more flexible matching                             | Large hidden sizes — faster and efficient                               |

---

## **Notes:**

- **Bahdanau** = more complex, better alignments for small/medium models.
- **Luong** = simpler, faster, works well for larger models.
- Both replace the fixed context vector with a **dynamic context** per output step, improving Seq2Seq performance.

---

# ✅ **Transformers**

Transformers are a type of deep learning architecture introduced in the 2017 paper *“Attention is All You Need”* by Vaswani et al.  
They revolutionized **Natural Language Processing (NLP)** by replacing recurrent architectures (RNNs, LSTMs) with **self-attention mechanisms** that can:

- Process all tokens in a sequence **in parallel** (faster training)
- Capture **long-range dependencies** between words without losing context
- Scale well with large datasets and models

**Key idea:**  
Instead of processing words one-by-one, Transformers let every word **look at** every other word in the sequence, deciding *how much each should matter* for the current word’s meaning.

---
## **Main Components of Transformers**

Transformers are built from modular components that work together to process and generate sequences. They are as follows:

---

### **1. Self-Attention**

- Allows each token to **attend to every other token** in the same sequence.
- Enables context-aware representations.
- Used in both encoder and decoder.
- **Masked Self-Attention** is a variant used in the decoder to prevent future token access.

---

### **2. Multi-Head Attention**

- Runs multiple self-attention mechanisms (called "heads") **in parallel**.
- Each head captures different types of relationships (e.g., syntactic, semantic).
- Outputs are concatenated and linearly transformed.

---

### **3. Positional Encoding**

- Adds **position information** to token embeddings.
- Since self-attention is **order-agnostic**, positional encodings help the model understand token order.
- Can be **sinusoidal** (fixed) or **learned**.

---

### **4. Layer Normalization**

- Normalizes activations across features.
- Helps stabilize training and improve convergence.
- Applied before or after sublayers (depending on implementation).

---

### **5. Masked Self-Attention**

- Used in **decoder** during training and inference.
- Prevents a token from attending to **future tokens**.
- Ensures autoregressive generation (left-to-right prediction).

---

### **6. Cross-Attention**

- Used in **decoder** to attend to **encoder outputs**.
- Essential for **sequence-to-sequence tasks** like translation.
- Helps decoder incorporate source sentence information.

---

### **7. Encoder Architecture**

- Composed of multiple identical layers.
- Each layer includes:
  - Self-Attention
  - Feed-Forward Network (FFN)
  - Layer Normalization
  - Residual Connections

---

### **8. Decoder Architecture**

- Similar to encoder but with additional components:
  - **Masked Self-Attention**
  - **Cross-Attention** (to encoder outputs)
  - Feed-Forward Network
  - Layer Normalization
  - Residual Connections

---

### **9. Transformer Inference**

- During inference (e.g., text generation):
  - Decoder generates tokens **one at a time**.
  - Uses **masked self-attention** to prevent future leakage.
  - May use techniques like **beam search** or **sampling** for output generation.

---

# ✅ **1. Self-Attention**

Self-attention transforms **input word representations** into **context-aware representations**.  
Before we get into it, let’s build up the concepts in the order you mentioned.

---

## **a. Vectorisation in NLP**

We can’t feed text directly to a model — we must **convert words into numerical vectors**.

**Common vectorisation methods:**

1. **OHE (One-Hot Encoding):**
   - Each word gets a unique vector with 1 in one position and 0 elsewhere.
   - Problem: extremely sparse, no notion of similarity (e.g., “king” and “queen” are as different as “king” and “banana”).

2. **Bag-of-Words / TF-IDF:**
   - Counts occurrences of words in a document.
   - Loses order/context information.

3. **Word2Vec (WoW) and similar distributed representations:**
   - Learn dense vector representations where similar words have similar vectors.

---

## **b. Word Embeddings**

- Dense, low-dimensional representations of words.
- Capture semantic similarity:

  - Example:  
    **vec(king) - vec(man) + vec(woman) ≈ vec(queen)**

---

## **c. Problem of Word Embeddings**

- **Static embeddings** (Word2Vec, GloVe) give **one fixed vector per word**.
- Meaning is **averaged** across contexts:

  - "Apple" (fruit) and "Apple" (company) get the same embedding.
  - This causes ambiguity in downstream tasks.

---

## **d. Static vs Contextual Embeddings**

- **Static Embeddings:** Same vector for a word everywhere.
- **Contextual (Dynamic) Embeddings:** Word vectors depend on surrounding words (context).

  - Example:  
    In "Apple launched a phone" vs. "I ate an apple", the vector for "Apple" will differ.

---

## **e. General vs Task-Specific Contextual Embeddings**

- **General Contextual Embeddings:**  
  Produced by large pre-trained models (e.g., BERT, GPT) on massive corpora; can be used for many tasks.

- **Task-Specific Embeddings:**  
  Fine-tuned on a particular task (e.g., sentiment analysis) so they encode information relevant to that task.

---

## **f. Why is self attention called self**

Self-attention is called “self” because each token (word) in a sequence looks at — or attends to — **other tokens in the same sequence** (including itself) to gather context, rather than attending to a different sequence.

---

## **g. How Self-Attention Creates Contextual Embeddings**

1. Start with **static embeddings** (e.g., from a lookup table).
2. Apply **Self-Attention**:
   - Each word representation gets updated based on weighted contributions from other words in the sentence.
3. The result is a **contextual embedding**:
   - Now "Apple" in a tech sentence will differ from "Apple" in a food sentence.

---


## **h. Scaled Dot-Product Attention (Core math of Self-Attention)**

The self-attention mechanism uses three vectors per word:

- **Q (Query)** – what this word is looking for
- **K (Key)** – what this word offers to others
- **V (Value)** – the actual information content

**Computation:**

1. Compute similarity between Q and K:

   $$
   \text{score}(Q, K) = \frac{Q \cdot K^T}{\sqrt{d_k}}
   $$

   (Divide by $\sqrt{d_k}$ to prevent large values that hurt gradient stability.)

2. Apply **Softmax** to get attention weights.

3. Multiply weights by **V** to get the updated representation.

This lets each token decide how much attention to pay to every other token, producing context-rich vectors.

---

## **Full Process of Self Attention**

In [6]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/T1.png" style="width: 800px;"/>
</div>
"""))

---

##  **Notes:**

- Transformers use **Self-Attention** to build **contextual embeddings** that adapt to the meaning of each word in its sentence.  
- This solves the ambiguity problem of static embeddings and enables powerful, scalable NLP models.
- Video 76 from the playlist serving as a revision of the encoder–decoder architecture, Bahdanau attention, Luong attention, and self-attention.

---

# ✅ **2. Multi-Head Attention**
Multi-Head Attention means running **multiple self-attention mechanisms in parallel**, each called a *head*.

- Each head learns to focus on different relationships or patterns in the sequence.
- The outputs of all heads are **combined** to give a richer, more nuanced representation.

---

## **Why Multiple Heads?**

A single self-attention layer can learn **one type of relationship** at a time.  
Multiple heads allow the model to:

- Capture **different kinds of dependencies** between words
- Attend to **different positions** and **different aspects** of meaning simultaneously

---

## **Example**

Sentence:

> "The bank can ensure your deposits are safe."

**Possible interpretations of "bank":**

- **Head 1**: Focuses on financial meaning — attends to *deposits*, *safe*
- **Head 2**: Focuses on grammatical structure — attends to *can ensure*
- **Head 3**: Tracks other context cues — maybe *your*, *are*

Each head processes its own **Query–Key–Value** attention and returns context-aware vectors.  
These vectors are then **concatenated and linearly projected** into a single representation.

---

## **Analogy**

Think of MHA like having a **panel of experts** reading the same sentence:

- One expert focuses on **grammar**
- Another on **financial context**
- Another on **safety/security meaning**

After they all share their insights, you **combine** them into a final, comprehensive understanding.

---

In [7]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/T2.png" style="width: 800px;"/>
</div>
"""))

In [9]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/T1.png" style="width: 800px;"/>
</div>
"""))

---

# ✅ **3. Positional Encoding**

Self-attention in Transformers looks at all tokens in a sequence **simultaneously** — it does **not** know the order of the words by default.  
**Positional Encoding** adds information about the position of each token so the model can understand sequence order.

---

## **Why We Need It**

Without positional information:

> "Dog bites man" and "Man bites dog"  
> would look the same to the model — just a bag of words.

Sequence order is critical for meaning, and positional encoding solves this.

---

## **How It Works**

- A **positional vector** is added to each word embedding.
- This positional vector is **unique for each position** in the sequence.
- It can be generated using:

  - **Fixed sinusoidal functions** (used in the original Transformer paper)
  - **Learned position embeddings** (trainable like word embeddings)

---

## **Example**

Sentence:

> "I love NLP"

Let’s say the word embeddings are:

- "I" → 

\[0.5, 0.1, 0.4]  
- "love" → 

\[0.9, 0.3, 0.8]  
- "NLP" → 

\[0.2, 0.7, 0.5]

Positional encodings might be:

- Position 1 → 

\[0.01, 0.99, 0.05]  
- Position 2 → 

\[0.02, 0.98, 0.10]  
- Position 3 → 

\[0.03, 0.97, 0.15]

**Final input to the model** = word embedding + positional encoding  
For example:  
"I" → 

\[0.5 + 0.01, 0.1 + 0.99, 0.4 + 0.05] = 

\[0.51, 1.09, 0.45]

---

## **Analogy**

Imagine you have three photos (words) in a pile:

- The photos alone tell you *what* is in each picture (semantic meaning).
- The captions (positional encodings) tell you *when* each photo was taken (order).

Without captions, you could shuffle the pictures and lose the story.

---

## **Notes**

Positional Encoding gives Transformers the ability to **understand word order**, which is essential for capturing meaning in sequences.  
It’s a simple yet powerful fix for the order-agnostic nature of self-attention.

---