# Coding: Machine Translation by RNN

 - Dataset: wmt-17, en-zh, select 5m high-quality pairs of data
 - Model: Seq2seq with Encoder & Decoder framework
 - GPU: 1660TI

# Seq2Seq Encoder-Decoder Architecture

## Overview

The Sequence-to-Sequence (Seq2Seq) model with Encoder-Decoder architecture is a neural network framework designed for tasks where both input and output are sequences of variable length, such as machine translation (English → Chinese in your case).

```
Input Sequence (English):  "Hello world"
                              ↓
                          [ENCODER]
                              ↓
                        Context Vector
                              ↓
                          [DECODER]
                              ↓
Output Sequence (Chinese): "你好世界"
```

## Architecture Components

### 1. Encoder

The encoder processes the input sequence and compresses the information into a fixed-size context vector (also called thought vector).

```
Input: [w1, w2, w3, ..., wn]
       ↓    ↓    ↓       ↓
    [RNN][RNN][RNN]...[RNN]
       ↓    ↓    ↓       ↓
    [h1] [h2] [h3] ... [hn] → Context Vector (hn)
```

#### Key Components:
- **Embedding Layer**: Converts input tokens to dense vectors
- **RNN Layers**: LSTM/GRU cells process the sequence sequentially
- **Hidden States**: Capture information at each time step
- **Final Context**: Last hidden state becomes the context vector

```python
# Pseudo-code structure
class Encoder:
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers)

    def forward(self, input_seq):
        embedded = self.embedding(input_seq)
        outputs, hidden = self.rnn(embedded)
        return hidden  # Context vectors
```

### 2. Decoder

The decoder generates the output sequence one token at a time, using the context vector from the encoder.

```
Context Vector (C) → [RNN] → [RNN] → [RNN] → ... → [RNN]
                      ↓       ↓       ↓             ↓
                    [y1]    [y2]    [y3]   ...   [yn]
```

#### Key Components:
- **Initial State**: Initialized with encoder's context vector
- **RNN Layers**: Generate hidden states for each output position
- **Output Projection**: Maps hidden states to vocabulary probabilities
- **Softmax**: Converts logits to probability distribution

```python
# Pseudo-code structure
class Decoder:
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers)
        self.output_projection = nn.Linear(hidden_size, vocab_size)

    def forward(self, target_seq, encoder_hidden, encoder_cell):
        embedded = self.embedding(target_seq)
        outputs, _ = self.rnn(embedded, encoder_hidden)
        predictions = self.output_projection(outputs)
        return predictions
```

## Complete Architecture Flow

### Training Phase

```
1. Input Processing:
   English: "Hello world" → [101, 7592, 2088, 102] (tokenized)
   Chinese: "[BOS] 你好世界 [EOS]" → [101, 872, 1962, 686, 102] (tokenized)

2. Encoder Forward Pass:
   Input: [101, 7592, 2088, 102]
   ↓
   Embedding: [[0.1, 0.2, ...], [0.3, 0.4, ...], ...]
   ↓
   LSTM/GRU: h1, h2, h3, h4 → Context Vector (h4)

3. Decoder Forward Pass:
   Initial State: (h4) from encoder
   Input: [101, 872, 1962, 686]
   ↓
   LSTM/GRU: generates hidden states for each position
   ↓
   Output Projection: [vocab_size] logits for each position
   ↓
   Loss Calculation: CrossEntropy with targets [872, 1962, 686, 102]
```

### Inference Phase

```
1. Encode input sequence: "Hello world"
2. Initialize decoder with encoder's context vector
3. Start with [BOS] token
4. Generate tokens one by one:
   - Input: [BOS] → Output: 你 (probability distribution)
   - Input: [BOS] 你 → Output: 好
   - Input: [BOS] 你 好 → Output: 世
   - Input: [BOS] 你 好 世 → Output: 界
   - Input: [BOS] 你 好 世 界 → Output: [EOS] (stop)
```

## Mathematical Formulation

### Encoder
```
h_t = LSTM/GRU(embedding(x_t), h_{t-1})
context = h_n  # Final hidden state
```

### Decoder
```
s_t = LSTM/GRU(embedding(y_{t-1}), s_{t-1})  # s_0 = context
P(y_t | y_1...y_{t-1}, x) = softmax(W_s * s_t + b_s)
```

### Loss Function
```
Loss = -∑∑ log P(y_t^i | y_1^i...y_{t-1}^i, x^i)
```

## Architecture Advantages

1. **Variable Length Handling**: Can process sequences of different lengths
2. **End-to-End Learning**: Jointly optimizes encoder and decoder
3. **Context Preservation**: Encoder captures semantic meaning in context vector
4. **Language Agnostic**: Works for any language pair

## Architecture Limitations

1. **Information Bottleneck**: Fixed-size context vector may lose information
2. **Long Sequence Problem**: Difficulty with very long input sequences
3. **Sequential Processing**: Cannot parallelize during inference

## Improvements & Variants

1. **Attention Mechanism**: Allows decoder to focus on relevant encoder states
2. **Bidirectional Encoder**: Processes sequence in both directions
3. **Beam Search**: Better decoding strategy than greedy search
4. **Teacher Forcing**: Training technique using ground truth as decoder input

## Implementation Architecture for Your Project

Based on your dataset (WMT-17 EN-ZH) and tokenizers (BERT-based), here's the recommended architecture:

```
Input: English sentence (max_length=100)
↓
BERT Tokenizer (vocab_size=30522) → Token IDs
↓
Embedding Layer (30522 → 512)
↓
Encoder LSTM/GRU (512 → 1024, num_layers=2)
↓
Context Vector (1024-dim)
↓
Decoder LSTM/GRU (512 → 1024, num_layers=2)
↓
Output Projection (1024 → 21128)
↓
Chinese Token IDs → BERT Tokenizer → Chinese sentence
```