# Coding: Machine Translation by RNN

 - Dataset: wmt-17, en-zh, select 5m high-quality pairs of data
 - Model: Seq2seq with Encoder & Decoder framework
 - GPU: 1660TI

# Seq2Seq Encoder-Decoder Architecture

## Overview

The Sequence-to-Sequence (Seq2Seq) model with Encoder-Decoder architecture is a neural network framework designed for tasks where both input and output are sequences of variable length, such as machine translation (English → Chinese in your case).

```
Input Sequence (English):  "Hello world"
                              ↓
                          [ENCODER]
                              ↓
                        Context Vector
                              ↓
                          [DECODER]
                              ↓
Output Sequence (Chinese): "你好世界"
```

## Architecture Components

### 1. Encoder

The encoder processes the input sequence and compresses the information into a fixed-size context vector (also called thought vector).

```
Input: [w1, w2, w3, ..., wn]
       ↓    ↓    ↓       ↓
    [RNN][RNN][RNN]...[RNN]
       ↓    ↓    ↓       ↓
    [h1] [h2] [h3] ... [hn] → Context Vector (hn)
```

#### Key Components:
- **Embedding Layer**: Converts input tokens to dense vectors
- **RNN Layers**: LSTM/GRU cells process the sequence sequentially
- **Hidden States**: Capture information at each time step
- **Final Context**: Last hidden state becomes the context vector

```python
# Pseudo-code structure
class Encoder:
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers)

    def forward(self, input_seq):
        embedded = self.embedding(input_seq)
        outputs, hidden = self.rnn(embedded)
        return hidden  # Context vectors
```

### 2. Decoder

The decoder generates the output sequence one token at a time, using the context vector from the encoder.

```
Context Vector (C) → [RNN] → [RNN] → [RNN] → ... → [RNN]
                      ↓       ↓       ↓             ↓
                    [y1]    [y2]    [y3]   ...   [yn]
```

#### Key Components:
- **Initial State**: Initialized with encoder's context vector
- **RNN Layers**: Generate hidden states for each output position
- **Output Projection**: Maps hidden states to vocabulary probabilities
- **Softmax**: Converts logits to probability distribution

```python
# Pseudo-code structure
class Decoder:
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers)
        self.output_projection = nn.Linear(hidden_size, vocab_size)

    def forward(self, target_seq, encoder_hidden, encoder_cell):
        embedded = self.embedding(target_seq)
        outputs, _ = self.rnn(embedded, encoder_hidden)
        predictions = self.output_projection(outputs)
        return predictions
```

## Complete Architecture Flow

### Training Phase

```
1. Input Processing:
   English: "Hello world" → [101, 7592, 2088, 102] (tokenized)
   Chinese: "[BOS] 你好世界 [EOS]" → [101, 872, 1962, 686, 102] (tokenized)

2. Encoder Forward Pass:
   Input: [101, 7592, 2088, 102]
   ↓
   Embedding: [[0.1, 0.2, ...], [0.3, 0.4, ...], ...]
   ↓
   LSTM/GRU: h1, h2, h3, h4 → Context Vector (h4)

3. Decoder Forward Pass:
   Initial State: (h4) from encoder
   Input: [101, 872, 1962, 686]
   ↓
   LSTM/GRU: generates hidden states for each position
   ↓
   Output Projection: [vocab_size] logits for each position
   ↓
   Loss Calculation: CrossEntropy with targets [872, 1962, 686, 102]
```

### Inference Phase

```
1. Encode input sequence: "Hello world"
2. Initialize decoder with encoder's context vector
3. Start with [BOS] token
4. Generate tokens one by one:
   - Input: [BOS] → Output: 你 (probability distribution)
   - Input: [BOS] 你 → Output: 好
   - Input: [BOS] 你 好 → Output: 世
   - Input: [BOS] 你 好 世 → Output: 界
   - Input: [BOS] 你 好 世 界 → Output: [EOS] (stop)
```

## Mathematical Formulation

### Encoder
```
h_t = LSTM/GRU(embedding(x_t), h_{t-1})
context = h_n  # Final hidden state
```

### Decoder
```
s_t = LSTM/GRU(embedding(y_{t-1}), s_{t-1})  # s_0 = context
P(y_t | y_1...y_{t-1}, x) = softmax(W_s * s_t + b_s)
```

### Loss Function
```
Loss = -∑∑ log P(y_t^i | y_1^i...y_{t-1}^i, x^i)
```

## Architecture Advantages

1. **Variable Length Handling**: Can process sequences of different lengths
2. **End-to-End Learning**: Jointly optimizes encoder and decoder
3. **Context Preservation**: Encoder captures semantic meaning in context vector
4. **Language Agnostic**: Works for any language pair

## Architecture Limitations

1. **Information Bottleneck**: Fixed-size context vector may lose information
2. **Long Sequence Problem**: Difficulty with very long input sequences
3. **Sequential Processing**: Cannot parallelize during inference

## Improvements & Variants

1. **Attention Mechanism**: Allows decoder to focus on relevant encoder states
2. **Bidirectional Encoder**: Processes sequence in both directions
3. **Beam Search**: Better decoding strategy than greedy search
4. **Teacher Forcing**: Training technique using ground truth as decoder input

## Implementation Architecture for Your Project

Based on your dataset (WMT-17 EN-ZH) and tokenizers (BERT-based), here's the recommended architecture:

```
Input: English sentence (max_length=100)
↓
BERT Tokenizer (vocab_size=30522) → Token IDs
↓
Embedding Layer (30522 → 512)
↓
Encoder LSTM/GRU (512 → 1024, num_layers=2)
↓
Context Vector (1024-dim)
↓
Decoder LSTM/GRU (512 → 1024, num_layers=2)
↓
Output Projection (1024 → 21128)
↓
Chinese Token IDs → BERT Tokenizer → Chinese sentence
```

In [None]:
# 可以先测试网络连接
import requests
try:
    response = requests.get("https://huggingface.co")
    print("网络连接正常")
except:
    print("网络连接可能存在问题")

In [None]:
# Download the data & select 5m high-quality pairs

from datasets import load_dataset
import re

# load full wmt-17 en-zh dataset
full_dataset = load_dataset("wmt/wmt17", "zh-en", split="train", cache_dir=r"D:\Developer\LLM\FuggingFace-cache-model")

# Length & Ratio filter
def is_high_quality(x):
    import re  # 添加这一行
    en = x["translation"]["en"]
    zh = x["translation"]["zh"]
    if not en or not zh:
        return False
    if len(en) < 3 or len(zh) < 3:
        return False
    if len(en) > 100 or len(zh) > 100:
        return False
    ratio = len(en) / len(zh)
    if ratio < 0.5 or ratio > 2:
        return False
    if not re.search(r'[\u4e00-\u9fff]', zh):
        return False
    return True

filtered_dataset = full_dataset.filter(is_high_quality, num_proc=10)
dataset = filtered_dataset.select(range(min(5_000_000, len(filtered_dataset))))

print("Full Dataset Size: ", len(full_dataset))
print("Filtered Dataset Size: ", len(filtered_dataset))
print("Dataset Size: ", len(dataset))

# print 10 samples
sample = dataset.shuffle(seed=42).select(range(10))
print("-"*100)
for i in sample:
    print(i["translation"]["en"])
    print(i["translation"]["zh"])
    print("-"*100)


In [None]:
# Create PyTorch Dataset and DataLoader for training

import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

class TranslationDataset(Dataset):
    def __init__(self, hf_dataset, tokenizer_en, tokenizer_zh, max_length=100):
        """
        PyTorch Dataset wrapper for HuggingFace translation dataset

        Args:
            hf_dataset: HuggingFace dataset with translation pairs
            tokenizer_en: English tokenizer (optional, can be added later)
            tokenizer_zh: Chinese tokenizer (optional, can be added later)
            max_length: Maximum sequence length
        """
        self.dataset = hf_dataset
        self.tokenizer_en = tokenizer_en
        self.tokenizer_zh = tokenizer_zh
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        en_text = item["translation"]["en"]
        zh_text = item["translation"]["zh"]

        en_tokens = self.tokenizer_en(en_text,
                                        max_length=self.max_length,
                                        padding='max_length',
                                        truncation=True,
                                        return_tensors='pt')

        zh_tokens = self.tokenizer_zh(zh_text,
                                        max_length=self.max_length,
                                        padding='max_length',
                                        truncation=True,
                                        return_tensors='pt')

        return {
                'source_ids': en_tokens['input_ids'].squeeze(),     # 去除张量的多余维度，输出以为数字数组
                'target_ids': zh_tokens['input_ids'].squeeze(),
                'source_text': en_text,
                'target_text': zh_text
        }

def create_dataloaders(dataset, batch_size=64, num_workers=12, train_split=0.95):
    """
    Create train and validation DataLoaders from HuggingFace dataset

    Args:
        dataset: HuggingFace dataset with translation pairs
        batch_size: Batch size for DataLoaders
        num_workers: Number of worker processes for data loading
        train_split: Fraction of data to use for training

    Returns:
        train_dataloader, val_dataloader, train_dataset, val_dataset
    """

    # Split dataset into train and validation
    train_size = int(train_split * len(dataset))

    # Create indices for splitting
    indices = list(range(len(dataset)))
    train_indices, val_indices = train_test_split(indices,
                                                train_size=train_size,
                                                random_state=42)

    # Create train and validation datasets
    train_dataset_hf = dataset.select(train_indices)
    val_dataset_hf = dataset.select(val_indices)

    # tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
    tokenizer_en = AutoTokenizer.from_pretrained("bert-base-uncased")
    tokenizer_zh = AutoTokenizer.from_pretrained("bert-base-chinese")

    # get vocab sizes
    vocab_size_en = tokenizer_en.vocab_size
    vocab_size_zh = tokenizer_zh.vocab_size

    print(f"Vocab size for en: {vocab_size_en}")
    print(f"Vocab size for zh: {vocab_size_zh}")

    # Create PyTorch datasets
    train_dataset = TranslationDataset(train_dataset_hf, tokenizer_en, tokenizer_zh)
    val_dataset = TranslationDataset(val_dataset_hf, tokenizer_en, tokenizer_zh)

    print(f"Train dataset size: {len(train_dataset)}")
    print(f"Validation dataset size: {len(val_dataset)}")

    # Create DataLoaders
    train_dataloader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True if torch.cuda.is_available() else False
    )

    val_dataloader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
        pin_memory=True if torch.cuda.is_available() else False
    )

    print(f"Train DataLoader: {len(train_dataloader)} batches")
    print(f"Validation DataLoader: {len(val_dataloader)} batches")

    return train_dataloader, val_dataloader, vocab_size_en, vocab_size_zh

def test_dataloader(dataloader):
    """Test the DataLoader by printing a sample batch"""
    print("\n" + "="*50)
    print("Sample batch from DataLoader:")
    print("="*50)

    # for batch in dataloader:
    #     print(f"Batch size: {len(batch['source_text'])}")
    #     print(f"Source example: {batch['source_text'][0]}")
    #     print(f"Source tokens: {batch['source_ids'][0]}")
    #     print(f"Target example: {batch['target_text'][0]}")
    #     print(f"Target tokens: {batch['target_ids'][0]}")
    #     break

    # 使用next和iter更直接地获取一个batch
    try:
        batch = next(iter(dataloader))
        print(f"Batch size: {len(batch['source_text'])}")
        print(f"Source example: {batch['source_text'][0]}")
        print(f"Source tokens shape: {batch['source_ids'][0].shape}")
        print(f"Target example: {batch['target_text'][0]}")
        print(f"Target tokens shape: {batch['target_ids'][0].shape}")
    except Exception as e:
        print(f"Error getting batch: {e}")

train_dataloader, val_dataloader, encoder_vocab_size, decoder_vocab_size = create_dataloaders(dataset)
test_dataloader(train_dataloader)
test_dataloader(val_dataloader)