# **Seq2Seq: Encoder-Decoder with NumPy**

## **From Scratch Implementation in NumPy**

This notebook implements a **Sequence-to-Sequence (Seq2Seq)** model using the **Encoder-Decoder** architecture following **Cho et el., 2014**. Unlike a standard language model that predicts the next word in a continuous stream, this architecture predicts next sentence given a context sentence.

### **Architecture Overview:**
1.  **Encoder:** A GRU that digests the input sentence and compresses it into a fixed-size **Context Vector** (Hidden State).
2.  **The Bridge:** A learned non-linear layer that transforms the Encoder's final representation into the Decoder's initial state.
3.  **Decoder:** A second GRU that generates the target sentence one word at a time, initialized by the Bridge.

### **Key Features:**
- **Teacher Forcing:** Used during training to stabilize convergence.
- **Autoregressive Inference:** Used during prediction (feeding the model's own output back in).
- **Parameter Sharing:** Encoder and Decoder share the same Embedding Matrix.

---
*Notebook by*: Ahmad Raza [@ahmadrazacdx](https://github.com/ahmadrazacdx)<br>
*Date: 2025* *License: MIT*

In [None]:
import re
import random
import numpy as np
np.random.seed(42)

In [None]:
data = open('../data/thirsty_crow.txt', 'r').read().lower()
sentences = re.split(r'(?<=[.!?])\s+', data)
sentences = [s.strip() for s in sentences if len(s.strip()) > 0]
print(f"Found {len(sentences)} sentences.")
print(f"Example sentences: {sentences[:2]}")

Found 12 sentences.
Example sentences: ['once upon a time, on a very hot day, a thirsty crow was flying in search of water.', 'the sun was shining brightly, and the poor crow was feeling tired and weak.']


In [None]:
#Build Vocab
words = []
for sent in sentences:
    word = re.findall(r"\w+|[.,!?'\";:]", sent)
    words.extend(word)

SOS_TOKEN = '<SOS>'  # Start of Sequence
EOS_TOKEN = '<EOS>'  # End of Sequence
UNK_TOKEN = '<UNK>'  # Unknown word

vocab = [SOS_TOKEN, EOS_TOKEN, UNK_TOKEN] + sorted(list(set(words)))
vocab_size = len(vocab)
word_to_ix = {w: i for i, w in enumerate(vocab)}
ix_to_word = {i: w for i, w in enumerate(vocab)}

print(f"Vocab Size: {vocab_size}")

Vocab Size: 90


In [None]:
#Create Training Pairs (Encoder(Input), Decoder(Target))  with Teacher Forcing
training_pairs = []

for i in range(len(sentences) - 1):
    src_text = sentences[i]
    src_words = re.findall(r"\w+|[.,!?'\";:]", src_text)
    src_indices = [word_to_ix[w] for w in src_words]

    trg_text = sentences[i+1]
    trg_words = re.findall(r"\w+|[.,!?'\";:]", trg_text)

    dec_input = [word_to_ix[SOS_TOKEN]] + [word_to_ix[w] for w in trg_words] # Decoder Input: <SOS> + sentence
    dec_target = [word_to_ix[w] for w in trg_words] + [word_to_ix[EOS_TOKEN]] # Decoder Target: sentence + <EOS>

    training_pairs.append({
        'src': src_indices,
        'dec_input': dec_input,
        'dec_target': dec_target
    })

print(f"\nExample Pair 0:")
print(f"Encoder Input (Indices): {training_pairs[0]['src']}")
print(f"Decoder Input (Indices): {training_pairs[0]['dec_input']}")
print(f"Decoder Target (Indices): {training_pairs[0]['dec_target']}")
print(f"Original Source: {sentences[0]}")
print(f"Original Target: {sentences[1]}")


Example Pair 0:
Encoder Input (Indices): [54, 85, 6, 78, 4, 53, 6, 86, 39, 22, 4, 6, 76, 21, 87, 32, 41, 65, 52, 88, 5]
Decoder Input (Indices): [0, 72, 71, 87, 66, 15, 4, 9, 72, 59, 21, 87, 28, 79, 9, 89, 5]
Decoder Target (Indices): [72, 71, 87, 66, 15, 4, 9, 72, 59, 21, 87, 28, 79, 9, 89, 5, 1]
Original Source: once upon a time, on a very hot day, a thirsty crow was flying in search of water.
Original Target: the sun was shining brightly, and the poor crow was feeling tired and weak.


### __HYPER-PARAMETERS__

In [None]:
lr = 1e-3            # Learning rate
hidden_size = 100    # Size of hidden state (h)
embed_size = 100     # Size of embedding vector (e)
MAX_LEN = 25         # Max length for generation
clip_value = 5.0     # Gradient clipping threshold

### __MODEL PARAMETER INITIALIZATION__

**1. Shared Embeddings:**
- $\mathbf{W}_{emb} \in \mathbb{R}^{V \times E}$: Shared lookup table for both source and target words.
-  Concatenated input $[\mathbf{h}_{t-1}; \mathbf{x}_t]\in \mathbb{R}^{H+E}$

**2. Encoder Parameters:**
- $\mathbf{W}^{enc}_u, \mathbf{W}^{enc}_r, \mathbf{W}^{enc}_h \in \mathbb{R}^{H \times H+E}$: Weights for Update, Reset, and Candidate gates.
- $\mathbf{b}^{enc}_u, \mathbf{b}^{enc}_r, \mathbf{b}^{enc}_h \in \mathbb{R}^{H \times 1}$: Biases for the Encoder.

**3. The Bridge:**
- $\mathbf{W}_{bridge} \in \mathbb{R}^{H \times H}$: Learns how to translate the Encoder's final thought into the Decoder's starting thought.
- $\mathbf{b}_{bridge} \in \mathbb{R}^{H \times 1}$: Bias for the bridge.

**4. Decoder Parameters:**
- $\mathbf{W}^{dec}_u, \mathbf{W}^{dec}_r, \mathbf{W}^{dec}_h \in \mathbb{R}^{H \times H+E}$: Weights for the Decoder GRU.
- $\mathbf{b}^{dec}_u, \mathbf{b}^{dec}_r, \mathbf{b}^{dec}_h \in \mathbb{R}^{H \times 1}$: Biases for the Decoder.
- $\mathbf{W}_y\in \mathbb{R}^{V \times H}$: Hidden-to-Output weight matrix (Vocabulary projection).
- $\mathbf{b}_y\in \mathbb{R}^{V \times 1}$: Output bias.

**Where:**  
- $V$ = vocabulary size  
- $E$ = embedding dimension (100)  
- $H$ = hidden size (100)

In [None]:
# 1. Shared Embeddings
Wemb = np.random.randn(vocab_size, embed_size) * 0.01 # Word Embeddings (V,E)
# 2. Encoder Parameters
Wu_enc = np.random.randn(hidden_size, hidden_size + embed_size) * 0.01 # Update Gate weights (H, H+E)
Wr_enc = np.random.randn(hidden_size, hidden_size + embed_size) * 0.01 # Reset Gate weights (H, H+E)
Wh_enc = np.random.randn(hidden_size, hidden_size + embed_size) * 0.01 # Candidate Hidden weights (H, H+E)

bu_enc = np.zeros((hidden_size, 1)) # Update Gate bias (H, 1)
br_enc = np.zeros((hidden_size, 1)) # Reset Gate bias (H, 1)
bh_enc = np.zeros((hidden_size, 1)) # Candidate Hidden bias (H, 1)

# 3. Bridge Paramters
W_bridge = np.random.randn(hidden_size, hidden_size) * 0.01 # Bridge Weights (H,H)
b_bridge = np.zeros((hidden_size, 1)) # Bridge bias (H,1)

# 4. Decoder Parameters
Wu_dec = np.random.randn(hidden_size, hidden_size + embed_size) * 0.01 # Update Gate weights (H, H+E)
Wr_dec = np.random.randn(hidden_size, hidden_size + embed_size) * 0.01 # Reset Gate weights (H, H+E)
Wh_dec = np.random.randn(hidden_size, hidden_size + embed_size) * 0.01 # Candidate Hidden weights (H, H+E)

bu_dec = np.zeros((hidden_size, 1)) # Update Gate bias (H, 1)
br_dec = np.zeros((hidden_size, 1)) # Reset Gate bias (H, 1)
bh_dec = np.zeros((hidden_size, 1)) # Candidate Hidden bias (H, 1)

# 5. Output Layer (Only the Decoder makes predictions)
Wy = np.random.randn(vocab_size, hidden_size) * 0.01
by = np.zeros((vocab_size, 1))

In [None]:
print(f"""
Wemb: Word Embeddings        : {Wemb.shape}
=========================================
         ENCODER PARAMS
=========================================
Wu_enc: Update Gate Weights  : {Wu_enc.shape}
Wr_enc: Reset Gate Weights   : {Wr_enc.shape}
Wh_enc: CHS Weights          : {Wh_enc.shape}
bu_enc: Update Gate bias     : {bu_enc.shape}
br_enc: Reset Gate bias      : {br_enc.shape}
bh_enc: CHS bias             : {bh_enc.shape}
=========================================
         BRIDGE PARAMS
=========================================
W_bridge: Bridge Weights    : {W_bridge.shape}
b_bridge: Bridge bias       : {b_bridge.shape}
=========================================
         DECODER PARAMS
=========================================
Wu_dec: Update Gate Weights  : {Wu_dec.shape}
Wr_dec: Reset Gate Weights   : {Wr_dec.shape}
Wh_dec: CHS Weights          : {Wh_dec.shape}
bu_dec: Update Gate bias     : {bu_dec.shape}
br_dec: Reset Gate bias      : {br_dec.shape}
bh_dec: CHS bias             : {bh_dec.shape}
=========================================
Wy: Prediction Weights       : {Wy.shape}
by: Prediction bias          : {by.shape}
""")


Wemb: Word Embeddings        : (90, 100)
         ENCODER PARAMS
Wu_enc: Update Gate Weights  : (100, 200)
Wr_enc: Reset Gate Weights   : (100, 200)
Wh_enc: CHS Weights          : (100, 200)
bu_enc: Update Gate bias     : (100, 1)
br_enc: Reset Gate bias      : (100, 1)
bh_enc: CHS bias             : (100, 1)
         BRIDGE PARAMS
W_bridge: Bridge Weights    : (100, 100)
b_bridge: Bridge bias       : (100, 1)
         DECODER PARAMS
Wu_dec: Update Gate Weights  : (100, 200)
Wr_dec: Reset Gate Weights   : (100, 200)
Wh_dec: CHS Weights          : (100, 200)
bu_dec: Update Gate bias     : (100, 1)
br_dec: Reset Gate bias      : (100, 1)
bh_dec: CHS bias             : (100, 1)
Wy: Prediction Weights       : (90, 100)
by: Prediction bias          : (90, 1)



### __ADAM OPTIMIZER INITIALIZATION__

In [None]:
# Adam hyperparameters
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8

# 1. Embeddings
mWemb = np.zeros_like(Wemb); vWemb = np.zeros_like(Wemb)

# 2. Encoder
mWu_enc = np.zeros_like(Wu_enc); vWu_enc = np.zeros_like(Wu_enc)
mWr_enc = np.zeros_like(Wr_enc); vWr_enc = np.zeros_like(Wr_enc)
mWh_enc = np.zeros_like(Wh_enc); vWh_enc = np.zeros_like(Wh_enc)
mbu_enc = np.zeros_like(bu_enc); vbu_enc = np.zeros_like(bu_enc)
mbr_enc = np.zeros_like(br_enc); vbr_enc = np.zeros_like(br_enc)
mbh_enc = np.zeros_like(bh_enc); vbh_enc = np.zeros_like(bh_enc)

# 3. Bridge
mW_bridge = np.zeros_like(W_bridge); vW_bridge = np.zeros_like(W_bridge)
mb_bridge = np.zeros_like(b_bridge); vb_bridge = np.zeros_like(b_bridge)

# 4. Decoder
mWu_dec = np.zeros_like(Wu_dec); vWu_dec = np.zeros_like(Wu_dec)
mWr_dec = np.zeros_like(Wr_dec); vWr_dec = np.zeros_like(Wr_dec)
mWh_dec = np.zeros_like(Wh_dec); vWh_dec = np.zeros_like(Wh_dec)
mbu_dec = np.zeros_like(bu_dec); vbu_dec = np.zeros_like(bu_dec)
mbr_dec = np.zeros_like(br_dec); vbr_dec = np.zeros_like(br_dec)
mbh_dec = np.zeros_like(bh_dec); vbh_dec = np.zeros_like(bh_dec)

# 5. Output Layer
mWy = np.zeros_like(Wy); vWy = np.zeros_like(Wy)
mby = np.zeros_like(by); vby = np.zeros_like(by)

# Timestep counter
t_adam = 0

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
def softmax(z):
    exp_z = np.exp(z - np.max(z))
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

### __ENCODER & DECODER GRU CELLS__
**Note:** I use two separate GRU cells with distinct weights for the Encoder and Decoder. Both follow standard GRU dynamics, but only the Decoder projects to the vocabulary.

**1. Shared Embeddings (Both Cells):**
$$\mathbf{e}_t = \mathbf{W}_{emb}[word\_idx]$$


#### **A. Encoder GRU**
The Encoder updates its hidden state to consume information but does not make predictions.

**Update Gate:**
$$\mathbf{z}_u = \mathbf{W}^{enc}_u[\mathbf{h}_{t-1}; \mathbf{e}_t] + \mathbf{b}^{enc}_u$$
$$\mathbf{u}_t = \sigma(\mathbf{z}_u)$$

**Reset Gate:**
$$\mathbf{z}_r = \mathbf{W}^{enc}_r[\mathbf{h}_{t-1}; \mathbf{e}_t] + \mathbf{b}^{enc}_r$$
$$\mathbf{r}_t = \sigma(\mathbf{z}_r)$$

**Candidate Hidden State:**
$$\mathbf{z}_h =  \mathbf{W}^{enc}_{h,x} \mathbf{e}_t  + \mathbf{r}_t \odot ( \mathbf{W}^{enc}_{h,h} \mathbf{h}_{t-1}) + \mathbf{b}^{enc}_{h}$$
$$\tilde{\mathbf{h}}_t = \tanh(\mathbf{z}_h)$$

**Final Hidden State:**
$$\mathbf{h}_t = (1 - \mathbf{u}_t) \odot \tilde{\mathbf{h}}_t  + \mathbf{u}_t \odot \mathbf{h}_{t-1}$$


#### **B. Decoder GRU**
The Decoder updates its hidden state *and* projects it to the vocabulary size to predict the next word.

**Update & Reset Gates:**
$$\mathbf{z}_u = \mathbf{W}^{dec}_u[\mathbf{h}_{t-1}; \mathbf{e}_t] + \mathbf{b}^{dec}_u$$
$$\mathbf{z}_r = \mathbf{W}^{dec}_r[\mathbf{h}_{t-1}; \mathbf{e}_t] + \mathbf{b}^{dec}_r$$

**Candidate & Final Hidden State:**
$$\mathbf{z}_h =  \mathbf{W}^{dec}_{h,x} \mathbf{e}_t  + \mathbf{r}_t \odot ( \mathbf{W}^{dec}_{h,h} \mathbf{h}_{t-1}) + \mathbf{b}^{dec}_{h}$$
$$\mathbf{h}_t = (1 - \mathbf{u}_t) \odot \tilde{\mathbf{h}}_t  + \mathbf{u}_t \odot \mathbf{h}_{t-1}$$

**Output Layer (Projection):**
$$\mathbf{z}_y = \mathbf{W}_y\mathbf{h}_t + \mathbf{b}_y$$
$$\mathbf{p}_t = \text{softmax}(\mathbf{z}_y)$$

**Note:**
This implementation follows encoder-decoder from original paper **(Cho et al., 2014)** but the hidden state equation is according to PyTorch's modified equation that is slightly different.

**References**

- Cho et al. *Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation*, EMNLP 2014. [https://aclanthology.org/D14-1179/](https://aclanthology.org/D14-1179/)

- PyTorch GRUCell documentation. [https://docs.pytorch.org/stable/generated/torch.nn.GRUCell.html](https://docs.pytorch.org/docs/stable/generated/torch.nn.GRU.html)

In [None]:
# ENCODER GRU CELL
def encoder(h_prev, word_idx):
    """
    Single GRU step for the Encoder. Uses _enc weights and DOES NOT compute output yt.
    """
    et = Wemb[word_idx].reshape(-1, 1)  # (E, 1)
    zt = np.concatenate((h_prev, et), axis=0) # (H+E, 1)
    zu = np.dot(Wu_enc, zt) + bu_enc #(H,H+E)@(H+E,1)->(H,1)+(H,1)=(H,1)
    ut = sigmoid(zu) # (H,1)
    zr = np.dot(Wr_enc, zt) + br_enc ##(H,H+E)@(H+E,1)->(H,1)+(H,1)=(H,1)
    rt = sigmoid(zr) # (H,1)
    Wh_h = Wh_enc[:, :hidden_size] # (H,H)
    Wh_x = Wh_enc[:, hidden_size:] # (H,E)
    zcht = np.dot(Wh_x, et) + rt * np.dot(Wh_h, h_prev) + bh_enc #(H,1)
    cht = np.tanh(zcht) # (H,1)
    ht = (1 - ut) * cht + ut * h_prev # (H,1)*(H,1) + (H,1)*(H,1)=(H,1)
    return et, ut, rt, cht, ht

# DECODER GRU CELL
def decoder(h_prev, word_idx):
    """
    Single GRU step for the Decoder.Uses _dec weights and COMPUTES output yt.
    """
    et = Wemb[word_idx].reshape(-1, 1)
    zt = np.concatenate((h_prev, et), axis=0)
    zu = np.dot(Wu_dec, zt) + bu_dec
    ut = sigmoid(zu)
    zr = np.dot(Wr_dec, zt) + br_dec
    rt = sigmoid(zr)
    Wh_h = Wh_dec[:, :hidden_size]
    Wh_x = Wh_dec[:, hidden_size:]
    zcht = np.dot(Wh_x, et) + rt * np.dot(Wh_h, h_prev) + bh_dec
    cht = np.tanh(zcht)
    ht = (1 - ut) * cht + ut * h_prev
    yt = np.dot(Wy, ht) + by # #(V,H)@(H,1)->(V,1)+(V,1)=(V,1)
    return et, ut, rt, cht, ht, yt

### __UNDERSTANDING THE GRU CELLS__

Since the architecture is splitted, the two GRU cells behave differently:

**1. Encoder GRU (The Reader)**
- **Inputs:** Current word embedding $\mathbf{e}_t$ and Previous hidden state $\mathbf{h}_{t-1}$.
- **Action:** Updates its internal memory to "understand" the sequence.
- **Outputs:**
  1. New hidden state $\mathbf{h}_t$.
  2. Cache values $(\mathbf{e}_t, \mathbf{u}_t, \mathbf{r}_t, \tilde{\mathbf{h}}_t)$ for backprop.
- **Note:** It does **not** produce output logits ($\mathbf{y}_t$).

**2. Decoder GRU (The Writer)**
- **Inputs:** Current word embedding $\mathbf{e}_t$ (from Teacher Forcing or Prediction) and Previous hidden state $\mathbf{h}_{t-1}$.
- **Action:** Updates memory *and* predicts the next word.
- **Outputs:**
  1. New hidden state $\mathbf{h}_t$.
  2. **Logits $\mathbf{y}_t$** (Projected to Vocabulary size).
  3. Cache values $(\mathbf{e}_t, \mathbf{u}_t, \mathbf{r}_t, \tilde{\mathbf{h}}_t)$.

### __FORWARD PASS__


In [None]:
def forward(src_inputs, dec_inputs, dec_targets):
    """
    Forward pass for the Encoder-Decoder Architecture.

    Inputs:
        - src_inputs: List of indices for Source Sentence (Encoder)
        - dec_inputs: List of indices for Decoder Inputs (Teacher Forcing: <SOS> + words)
        - dec_targets: List of indices for Decoder Targets (words + <EOS>)

    Returns:
        - loss: Scalar Cross-Entropy Loss
        - enc_cache: Tuple of dictionaries needed for Encoder Backward
        - dec_cache: Tuple of dictionaries needed for Decoder Backward
    """

    # ENCODER
    enc_et, enc_ut, enc_rt, enc_cht, enc_ht = {}, {}, {}, {}, {}
    h_enc_init = np.zeros((hidden_size, 1))
    enc_ht[-1] = h_enc_init
    # Run Encoder Loop
    for t in range(len(src_inputs)):
        word_idx = src_inputs[t]
        # Run single step
        enc_et[t], enc_ut[t], enc_rt[t], enc_cht[t], enc_ht[t] = encoder(enc_ht[t-1], word_idx)

    # BRIDGE
    h_enc_final = enc_ht[len(src_inputs) - 1] # (H,1)
    bridge_z = np.dot(W_bridge, h_enc_final) + b_bridge # (H,H)@(H,1)+(H,1)=(H,1)
    h_dec_init = np.tanh(bridge_z) #(H,1)
    bridge_cache = (h_enc_final, bridge_z, h_dec_init)

    # DECODER
    dec_et, dec_ut, dec_rt, dec_cht, dec_ht, dec_yt = {}, {}, {}, {}, {}, {}
    dec_probt = {}
    dec_ht[-1] = np.copy(h_dec_init)
    loss = 0
    # Run Decoder Loop
    for t in range(len(dec_inputs)):
        word_idx = dec_inputs[t]   # Current input (from Teacher Forcing)
        target_idx = dec_targets[t] # True label
        dec_et[t], dec_ut[t], dec_rt[t], dec_cht[t], dec_ht[t], dec_yt[t] = decoder(dec_ht[t-1], word_idx)
        dec_probt[t] = softmax(dec_yt[t])
        loss += -np.log(dec_probt[t][target_idx, 0] + epsilon)

    enc_cache = (enc_et, enc_ut, enc_rt, enc_cht, enc_ht)
    dec_cache = (dec_et, dec_ut, dec_rt, dec_cht, dec_ht, dec_yt, dec_probt)

    return loss, enc_cache, bridge_cache,  dec_cache

## __BACKWARD PASS (BPTT)__

**Backpropagation Through Time Equations for Encoder-Decoder Architecture**


### **DECODER BACKWARD PASS**

#### **Step 1: Output Layer Gradient (Softmax + Cross-Entropy)**

$$\frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t} = \mathbf{p}_t - \mathbf{1}_{y^*_t}$$

**Output layer weight gradients:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_y} = \sum_{t=0}^{T_{dec}-1} \frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t} \mathbf{h}_t^{dec, T}$$

$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}_y} = \sum_{t=0}^{T_{dec}-1} \frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t}$$

#### **Step 2: Decoder Hidden State Gradient**

$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_t^{dec}} = \mathbf{W}_y^T \frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t} + \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{t+1}^{dec}}$$

The gradient flows from two sources:
- Current timestep's output loss (first term)
- Future timestep's hidden state (second term, initialized as zeros for the last timestep)


### **GRU GATE GRADIENTS (Applied to both Encoder & Decoder)**

**Note:** The following gradient computations (Steps 3-7) apply to both the Encoder and Decoder GRUs. For clarity:
- **Decoder**: Use $\mathbf{h}^{dec}$, $\mathbf{u}^{dec}$, $\mathbf{r}^{dec}$, $\tilde{\mathbf{h}}^{dec}$, and corresponding weight matrices $\mathbf{W}_u^{dec}$, $\mathbf{W}_r^{dec}$, $\mathbf{W}_h^{dec}$
- **Encoder**: Use $\mathbf{h}^{enc}$, $\mathbf{u}^{enc}$, $\mathbf{r}^{enc}$, $\tilde{\mathbf{h}}^{enc}$, and corresponding weight matrices $\mathbf{W}_u^{enc}$, $\mathbf{W}_r^{enc}$, $\mathbf{W}_h^{enc}$

For brevity, I write equations without superscripts below, understanding they apply to whichever RNN is being processed.

#### **Step 3: Update Gate Gradients**

**Gradient w.r.t. update gate (after sigmoid):**
$$\frac{\partial \mathbf{h}_t}{\partial \mathbf{u}_t} = -\tilde{\mathbf{h}}_t + \mathbf{h}_{t-1} = \mathbf{h}_{t-1} - \tilde{\mathbf{h}}_t$$

$$\frac{\partial \mathcal{L}}{\partial \mathbf{u}_t} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} \odot (\mathbf{h}_{t-1} - \tilde{\mathbf{h}}_t)$$

**Gradient w.r.t. update gate pre-activation:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{z}_u} = \frac{\partial \mathcal{L}}{\partial \mathbf{u}_t} \odot \mathbf{u}_t \odot (1 - \mathbf{u}_t)$$

#### **Step 4: Candidate Hidden State Gradients**

**Gradient w.r.t. candidate hidden state (after tanh):**
$$\frac{\partial \mathcal{L}}{\partial \tilde{\mathbf{h}}_t} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} \odot (1 - \mathbf{u}_t)$$

**Key Insight:** The coefficient of $\tilde{\mathbf{h}}_t$ in the hidden state equation is $(1 - \mathbf{u}_t)$.

**Gradient w.r.t. candidate pre-activation:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{z}_h} = \frac{\partial \mathcal{L}}{\partial \tilde{\mathbf{h}}_t} \odot (1 - \tilde{\mathbf{h}}_t^2)$$

#### **Step 5: Reset Gate Gradients**
**Gradient w.r.t. reset gate (after sigmoid):**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{r}_t} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_h} \odot (\mathbf{W}_{h,h} \mathbf{h}_{t-1})$$

**Key Insight:** The reset gate multiplies the hidden contribution $\mathbf{W}_{h,h} \mathbf{h}_{t-1}$, so its gradient is the element-wise product with this term.

**Gradient w.r.t. reset gate pre-activation:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{z}_r} = \frac{\partial \mathcal{L}}{\partial \mathbf{r}_t} \odot \mathbf{r}_t \odot (1 - \mathbf{r}_t)$$

#### **Step 6: Weight Matrix Gradients**

**Update Gate:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_u} = \sum_{t=0}^{T-1} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_u} [\mathbf{h}_{t-1}; \mathbf{e}_t]^T$$
$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}_u} = \sum_{t=0}^{T-1} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_u}$$

**Reset Gate:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_r} = \sum_{t=0}^{T-1} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_r} [\mathbf{h}_{t-1}; \mathbf{e}_t]^T$$
$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}_r} = \sum_{t=0}^{T-1} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_r}$$

**Candidate Hidden State (Split Computation):**

For the hidden part $\mathbf{W}_{h,h}$:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{h,h}} = \sum_{t=0}^{T-1} (\frac{\partial \mathcal{L}}{\partial \mathbf{z}_h} \odot \mathbf{r}_t) \mathbf{h}_{t-1}^T$$

For the input part $\mathbf{W}_{h,x}$:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{h,x}} = \sum_{t=0}^{T-1} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_h} \mathbf{e}_t^T$$

Bias gradient:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}_h} = \sum_{t=0}^{T-1} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_h}$$

**Embedding Matrix:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{emb}[i]} \mathrel{+}= \sum_{t: w_t=i} \left[\mathbf{W}_u^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}_u}\right]_{[H:]} + \left[\mathbf{W}_r^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}_r}\right]_{[H:]} + \mathbf{W}_{h,x}^T\frac{\partial \mathcal{L}}{\partial \mathbf{z}_h}$$


#### **Step 7: Gradient to Previous Hidden State**

The gradient flows back to $t-1$ via four paths (same for both Encoder and Decoder logic):

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{h}_{t-1}} =
\underbrace{\left[\mathbf{W}_u^T \delta_u\right]_{H}}_{\text{Update}} +
\underbrace{\left[\mathbf{W}_r^T \delta_r\right]_{H}}_{\text{Reset}} +
\underbrace{\mathbf{W}_{h,h}^T (\delta_h \odot \mathbf{r}_t)}_{\text{Candidate}} +
\underbrace{(\delta_{\text{next}} \odot \mathbf{u}_t)}_{\text{Direct}}
$$
**Where:**
$$
\delta_k = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_k} \quad (\text{for } k \in \{u, r, h\})
$$
$$
\delta_{\text{next}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{\text{next}}}
$$


### **BRIDGE BACKWARD PASS**

#### **Step 8: Bridge Layer Gradients**

The final gradient from the decoder $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_{-1}^{dec}}$ (gradient w.r.t. decoder's initial state) flows through the bridge. Recall: $\mathbf{h}_{-1}^{dec} = \tanh(\mathbf{W}_{bridge} \mathbf{h}_{final}^{enc} + \mathbf{b}_{bridge})$

**Gradient w.r.t. bridge pre-activation:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{z}_{bridge}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{-1}^{dec}} \odot (1 - (\mathbf{h}_{-1}^{dec})^2)$$

**Bridge weight gradients:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{bridge}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{bridge}} (\mathbf{h}_{final}^{enc})^T$$

$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}_{bridge}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{bridge}}$$

**Gradient to encoder's final hidden state:**
$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_{final}^{enc}} = \mathbf{W}_{bridge}^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{bridge}}$$


### **ENCODER BACKWARD PASS**

#### **Step 9: Encoder Hidden State Gradient**

$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_t^{enc}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{t+1}^{enc}}$$

For the final timestep $T_{enc}-1$, initialize with gradient from bridge:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_{T_{enc}-1}^{enc}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{final}^{enc}}$$

**Steps 3-7 from the GRU Gate Gradients section above are applied here for the Encoder**, using $\mathbf{h}^{enc}$, $\mathbf{W}_u^{enc}$, $\mathbf{W}_r^{enc}$, $\mathbf{W}_h^{enc}$, and accumulating gradients over $t = 0$ to $T_{enc}-1$.


#### **Step 10: Gradient Clipping**

To prevent exploding gradients, clip all parameter gradients:

$$\text{clip}(\nabla \theta, -\tau, \tau)$$

Common clipping threshold: $\tau = 5$


**Notation:**
- $T_{enc}$ = encoder sequence length
- $T_{dec}$ = decoder sequence length
- $T$ = sequence length (generic, when referring to either encoder or decoder)
- $H$ = hidden dimension
- $E$ = embedding dimension
- $V$ = vocabulary size
- $\mathbf{W}_{h,h}$ = hidden part of $\mathbf{W}_h$ (first $H$ columns)
- $\mathbf{W}_{h,x}$ = input part of $\mathbf{W}_h$ (last $E$ columns)
- $[\cdot; \cdot]$ = concatenation along dimension 0
- Superscripts $enc$ and $dec$ denote encoder and decoder components respectively when disambiguation is needed

In [None]:
def backward(src_inputs, dec_inputs, dec_targets, enc_cache,
            bridge_cache, dec_cache):
    """
    Backpropagation through Time BPTT for Encoder-Decoder Architecture (Decoder -> Bridge -> Encoder)
    """
    # Unpack caches
    enc_et, enc_ut, enc_rt, enc_cht, enc_ht = enc_cache
    h_enc_final, bridge_z, h_dec_init = bridge_cache
    dec_et, dec_ut, dec_rt, dec_cht, dec_ht, dec_yt, dec_probt = dec_cache

    # Initialize Gradients
    dWemb = np.zeros_like(Wemb) # Shared embeddings
    # Encoder
    dWu_enc, dWr_enc, dWh_enc = np.zeros_like(Wu_enc), np.zeros_like(Wr_enc), np.zeros_like(Wh_enc)
    dbu_enc, dbr_enc, dbh_enc = np.zeros_like(bu_enc), np.zeros_like(br_enc), np.zeros_like(bh_enc)
    # Bridge
    dW_bridge = np.zeros_like(W_bridge)
    db_bridge = np.zeros_like(b_bridge)
    # Decoder
    dWu_dec, dWr_dec, dWh_dec = np.zeros_like(Wu_dec), np.zeros_like(Wr_dec), np.zeros_like(Wh_dec)
    dbu_dec, dbr_dec, dbh_dec = np.zeros_like(bu_dec), np.zeros_like(br_dec), np.zeros_like(bh_dec)
    dWy, dby = np.zeros_like(Wy), np.zeros_like(by)

    # DECODER BACKWARD PASS
    dh_next = np.zeros_like(dec_ht[0]) # Gradient from future timestep (0 for last step)- (H,1)

    for t in reversed(range(len(dec_inputs))):
        # A. Gradient of Loss w.r.t Output
        dy = np.copy(dec_probt[t]) # (V,1)
        dy[dec_targets[t]] -= 1  # (V,1)
        dWy += np.dot(dy, dec_ht[t].T) # Accumulate output layer gradients (V,1)@(1,H)=(V,H)
        dby += dy #(V,1)

        # B. Gradient w.r.t Hidden State
        dh = np.dot(Wy.T, dy) + dh_next #(H,V)@(V,1):(H,1)+(H,1)=(H,1)

        # C. Compute all gate gradients (GRU)
        du = dh * (dec_ht[t-1] - dec_cht[t]) #(H,1)*(H,1)=(H,1)
        dzu = du * dec_ut[t] * (1 - dec_ut[t]) # (H,1)*(H,1)=(H,1)

        dcht = dh * (1 - dec_ut[t]) # (H,1)*(H,1)=(H,1)
        dzh = dcht * (1 - dec_cht[t]**2) #(H,1)*(H,1)=(H,1)

        Wh_h_dec = Wh_dec[:, :hidden_size] #(H,H)
        Wh_x_dec = Wh_dec[:, hidden_size:] #(H,E)

        dr = dzh * np.dot(Wh_h_dec, dec_ht[t-1]) #(H,H)@(H,1):(H,1)*(H,1)=(H,1)
        dzr = dr * dec_rt[t] * (1 - dec_rt[t]) #(H,1)*(H,1)8(H,1)=(H,1)

        # Accumulate Weight Gradients
        z_cat = np.concatenate((dec_ht[t-1], dec_et[t]), axis=0) #(H,1)+(E,1)=(H+E,1)
        dWu_dec += np.dot(dzu, z_cat.T) #(H,1)@(1,H+E)=(H,H+E)
        dbu_dec += dzu #(H,1)

        dWr_dec += np.dot(dzr, z_cat.T) #(H,1)@(1,H+E)=(H,H+E)
        dbr_dec += dzr #(H,1)

        # Candidate State Weights
        dWh_dec[:, :hidden_size] += np.dot(dzh * dec_rt[t], dec_ht[t-1].T) # Hidden part (H,1)*(H,1):(H,1)@(1,H)=(H,H)
        dWh_dec[:, hidden_size:] += np.dot(dzh, dec_et[t].T)               # Input part (H,1)@(1,E)=(H,E)
        dbh_dec += dzh #(H,1)

        # Embeddings Gradient
        de = (np.dot(Wu_dec.T, dzu) + np.dot(Wr_dec.T, dzr) + np.dot(Wh_dec.T, dzh))[hidden_size:, :] #(H+E,1)+(H+E,1)+(H+E,1)=(H+E,1):[(H+E)-H,1]=(E,1)
        dWemb[dec_inputs[t]] += de.ravel() #(E,)

        # Compute dh_next for the previous timestep
        dh_from_zu = np.dot(Wu_dec.T, dzu)[:hidden_size, :] #(H+E,H)@(H,1):(H+E,1)=>(H,1)
        dh_from_zr = np.dot(Wr_dec.T, dzr)[:hidden_size, :] #(H,1)
        dh_from_zh = np.dot(Wh_h_dec.T, dzh * dec_rt[t]) #(H,H)@(H,1):(H,1)
        dh_from_direct = dh * dec_ut[t] #(H,1)*(H,1)=(H,1)
        dh_next = dh_from_zu + dh_from_zr + dh_from_zh + dh_from_direct #(H,1)

    # The final `dh_next` here is the gradient w.r.t the Decoder's Initial State!
    d_h_dec_init = dh_next #(H,1)

    # BRIDGE BACKWARD PASS
    d_bridge_z = d_h_dec_init * (1 - h_dec_init**2) # (H,1)*(H,1)=(H,1)
    dW_bridge = np.dot(d_bridge_z, h_enc_final.T) # (H,1)@(1,H)=(H,H)
    db_bridge = d_bridge_z #(H,1)
    d_h_enc_final = np.dot(W_bridge.T, d_bridge_z) #(H,H)@(H,1)=(H,1)

    # ENCODER BACKWARD PASS
    dh_next = d_h_enc_final # Initialize with gradient from bridge

    for t in reversed(range(len(src_inputs))):
        dh = dh_next #(H,1)
        # Compute all gate gradients (GRU)
        du = dh * (enc_ht[t-1] - enc_cht[t]) #(H,1)
        dzu = du * enc_ut[t] * (1 - enc_ut[t]) #(H,1)

        dcht = dh * (1 - enc_ut[t]) #(H,1)
        dzh = dcht * (1 - enc_cht[t]**2) #(H,1)

        Wh_h_enc = Wh_enc[:, :hidden_size] #(H,H)

        dr = dzh * np.dot(Wh_h_enc, enc_ht[t-1]) #(H,H)@(H,1):(H,1)*(H,1)=(H,1)
        dzr = dr * enc_rt[t] * (1 - enc_rt[t]) #(H,1)

        # Accumulate Weight Gradients
        z_cat = np.concatenate((enc_ht[t-1], enc_et[t]), axis=0) #(H+E,1)
        dWu_enc += np.dot(dzu, z_cat.T) #(H,1)@(1,H+E)=(H,H+E)
        dbu_enc += dzu #(H,1)

        dWr_enc += np.dot(dzr, z_cat.T) #(H,H+E)
        dbr_enc += dzr #(H,1)

        dWh_enc[:, :hidden_size] += np.dot(dzh * enc_rt[t], enc_ht[t-1].T) #(H,1)*(H,1):(H,1)@(1,H)=(H,H)
        dWh_enc[:, hidden_size:] += np.dot(dzh, enc_et[t].T) #(H,1)@(1,E)=(H,E)
        dbh_enc += dzh #(H,1)

        # Embeddings Gradient
        de = (np.dot(Wu_enc.T, dzu) + np.dot(Wr_enc.T, dzr) + np.dot(Wh_enc.T, dzh))[hidden_size:, :] # (E,1)
        dWemb[src_inputs[t]] += de.ravel() #(E,)

        # Compute dh_next for previous timestep
        dh_from_zu = np.dot(Wu_enc.T, dzu)[:hidden_size, :] #(H,1)
        dh_from_zr = np.dot(Wr_enc.T, dzr)[:hidden_size, :] #(H,1)
        dh_from_zh = np.dot(Wh_h_enc.T, dzh * enc_rt[t]) #(H,1)
        dh_from_direct = dh * enc_ut[t] #(H,1)

        dh_next = dh_from_zu + dh_from_zr + dh_from_zh + dh_from_direct #(H,1)

    # Gradient Clipping
    grads = [dWemb, dWu_enc, dWr_enc, dWh_enc, dWu_dec, dWr_dec, dWh_dec, dWy, dW_bridge, dbu_enc, dbr_enc, dbh_enc, dbu_dec, dbr_dec, dbh_dec, dby, db_bridge]

    for grad in grads:
        np.clip(grad, -clip_value, clip_value, out=grad)

    return grads

### __UPDATE PARAMS WITH ADAM__

In [None]:
def update_parameters(grads, learning_rate):
    """
    Update all Encoder, Bridge, and Decoder parameters using Adam.

    Inputs:
        - grads: List of gradients returned by backward()
        - learning_rate: Scalar float
    """
    # Model weights
    global Wemb
    global Wu_enc, Wr_enc, Wh_enc, bu_enc, br_enc, bh_enc
    global Wu_dec, Wr_dec, Wh_dec, bu_dec, br_dec, bh_dec, Wy, by
    global W_bridge, b_bridge

    # Adam memory variables (m and v)
    global mWemb, vWemb
    global mWu_enc, vWu_enc, mWr_enc, vWr_enc, mWh_enc, vWh_enc, mbu_enc, vbu_enc, mbr_enc, vbr_enc, mbh_enc, vbh_enc
    global mWu_dec, vWu_dec, mWr_dec, vWr_dec, mWh_dec, vWh_dec, mbu_dec, vbu_dec, mbr_dec, vbr_dec, mbh_dec, vbh_dec, mWy, vWy, mby, vby
    global mW_bridge, vW_bridge, mb_bridge, vb_bridge
    global t_adam

    t_adam += 1
    (dWemb,
     dWu_enc, dWr_enc, dWh_enc,
     dWu_dec, dWr_dec, dWh_dec, dWy,
     dW_bridge,
     dbu_enc, dbr_enc, dbh_enc,
     dbu_dec, dbr_dec, dbh_dec, dby,
     db_bridge) = grads

    params = [
        (Wemb, dWemb, mWemb, vWemb),

        (Wu_enc, dWu_enc, mWu_enc, vWu_enc),
        (Wr_enc, dWr_enc, mWr_enc, vWr_enc),
        (Wh_enc, dWh_enc, mWh_enc, vWh_enc),

        (Wu_dec, dWu_dec, mWu_dec, vWu_dec),
        (Wr_dec, dWr_dec, mWr_dec, vWr_dec),
        (Wh_dec, dWh_dec, mWh_dec, vWh_dec),
        (Wy, dWy, mWy, vWy),

        (W_bridge, dW_bridge, mW_bridge, vW_bridge),

        (bu_enc, dbu_enc, mbu_enc, vbu_enc),
        (br_enc, dbr_enc, mbr_enc, vbr_enc),
        (bh_enc, dbh_enc, mbh_enc, vbh_enc),

        (bu_dec, dbu_dec, mbu_dec, vbu_dec),
        (br_dec, dbr_dec, mbr_dec, vbr_dec),
        (bh_dec, dbh_dec, mbh_dec, vbh_dec),
        (by, dby, mby, vby),

        (b_bridge, db_bridge, mb_bridge, vb_bridge)
    ]

    updated_params = []

    for param, grad, m, v in params:
        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * (grad ** 2)

        # Bias correction
        m_corrected = m / (1 - beta1 ** t_adam)
        v_corrected = v / (1 - beta2 ** t_adam)

        param = param - learning_rate * m_corrected / (np.sqrt(v_corrected) + epsilon)
        updated_params.append((param, m, v))

    (Wemb, mWemb, vWemb) = updated_params[0]

    (Wu_enc, mWu_enc, vWu_enc) = updated_params[1]
    (Wr_enc, mWr_enc, vWr_enc) = updated_params[2]
    (Wh_enc, mWh_enc, vWh_enc) = updated_params[3]

    (Wu_dec, mWu_dec, vWu_dec) = updated_params[4]
    (Wr_dec, mWr_dec, vWr_dec) = updated_params[5]
    (Wh_dec, mWh_dec, vWh_dec) = updated_params[6]
    (Wy, mWy, vWy) = updated_params[7]

    (W_bridge, mW_bridge, vW_bridge) = updated_params[8]

    (bu_enc, mbu_enc, vbu_enc) = updated_params[9]
    (br_enc, mbr_enc, vbr_enc) = updated_params[10]
    (bh_enc, mbh_enc, vbh_enc) = updated_params[11]

    (bu_dec, mbu_dec, vbu_dec) = updated_params[12]
    (br_dec, mbr_dec, vbr_dec) = updated_params[13]
    (bh_dec, mbh_dec, vbh_dec) = updated_params[14]
    (by, mby, vby) = updated_params[15]

    (b_bridge, mb_bridge, vb_bridge) = updated_params[16]

### __TRAIN MODEL__


In [None]:
def train(pairs, num_iterations=1000, print_every=100):
    """
    Train the Encoder-Decoder Model
    """
    total_loss = 0
    print_loss_total = 0

    for iter in range(1, num_iterations + 1):
        # 1. Pick a random pair from the dataset
        training_pair = random.choice(pairs)
        src_inputs = training_pair['src']
        dec_inputs = training_pair['dec_input']
        dec_targets = training_pair['dec_target']

        # 2. Forward Pass
        loss, enc_cache, bridge_cache, dec_cache = forward(src_inputs, dec_inputs, dec_targets)

        # 3. Backward Pass
        grads = backward(src_inputs, dec_inputs, dec_targets, enc_cache, bridge_cache, dec_cache)

        # 4. Update Parameters
        update_parameters(grads, lr)
        print_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print(f"Iteration {iter} | Loss: {print_loss_avg:.4f}")


### __GENERATE TEXT__

In [None]:
def predict(source_sentence, max_len=25):
    """
    Generate a response/next sentence given a source sentence.
    Pipeline: Source -> Encoder -> Bridge -> Decoder -> Generation
    """
    # PREPROCESS
    words = re.findall(r"\w+|[.,!?'\";:]", source_sentence.lower())
    src_indices = [word_to_ix.get(w, word_to_ix[UNK_TOKEN]) for w in words]

    # ENCODER PASS
    h_enc = np.zeros((hidden_size, 1))
    for word_idx in src_indices:
        _, _, _, _, h_enc = encoder(h_enc, word_idx)

    # BRIDGE
    # Transform Encoder Final State -> Decoder Initial State
    bridge_z = np.dot(W_bridge, h_enc) + b_bridge
    h_dec = np.tanh(bridge_z)

    # DECODER GENERATION
    # Start with <SOS>
    curr_word_idx = word_to_ix[SOS_TOKEN]
    generated_words = []

    for _ in range(max_len):
        _, _, _, _, h_dec, yt = decoder(h_dec, curr_word_idx)
        prob = softmax(yt)
        next_word_idx = np.argmax(prob)
        # next_word_idx = np.random.choice(range(vocab_size), p=prob.ravel())
        next_word = ix_to_word[next_word_idx]
        if next_word == EOS_TOKEN:
            break
        generated_words.append(next_word)
        curr_word_idx = next_word_idx

    return ' '.join(generated_words)

### __RUN TRAINING__

In [None]:
# Run Training
train(training_pairs, num_iterations=5000, print_every=500)

Iteration 500 | Loss: 39.0605
Iteration 1000 | Loss: 12.1057
Iteration 1500 | Loss: 3.9874
Iteration 2000 | Loss: 2.4961
Iteration 2500 | Loss: 2.1274
Iteration 3000 | Loss: 1.8678
Iteration 3500 | Loss: 1.8496
Iteration 4000 | Loss: 1.1618
Iteration 4500 | Loss: 0.7839
Iteration 5000 | Loss: 0.4819


### __TEST__

In [None]:
test_sentences = [
    "The crow was thirsty.",
    "He found a pitcher.",
]
for sent in test_sentences:
    response = predict(sent)
    print(f"Input:  {sent}")
    print(f"Output: {response}")
    print("-" * 30)

Input:  The crow was thirsty.
Output: there was a little water at the bottom , but his beak could not reach it .
------------------------------
Input:  He found a pitcher.
Output: then he got an idea !
------------------------------


## **SUMMARY**

This implementation provides a deep, mathematical understanding of how neural networks process sequences, paving the way for understanding more complex architectures like Attention mechanisms and Transformers.


### **Further Reading:**
1. [Learning Phrase Representations using RNN Encoder–Decoder](https://arxiv.org/abs/1406.1078) — Cho et al. (2014) - *The specific architecture implemented in this notebook.*
2. [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) — Sutskever et al. (2014) - *The LSTM-based parallel to this work.*

3. [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) — Olah (2015) - *Visual explanations of gating mechanisms.*
4. [Empirical Evaluation of Gated RNNs](https://arxiv.org/abs/1412.3555) — Chung et al. (2014) - *Rigorous comparison of GRU vs. LSTM.*

**Next Steps (Modern Architectures):**

5. [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) — Bahdanau et al. (2014) - *The introduction of the **Attention Mechanism** (the logical next step for this model).*
6. [Attention Is All You Need](https://arxiv.org/abs/1706.03762) — Vaswani et al. (2017) - *Transformers, which replaced RNNs.*