# Coding: Machine Translation by RNN

 - Dataset: wmt-17, en-zh, select 5m high-quality pairs of data
 - Model: Seq2seq with Encoder & Decoder framework
 - GPU: 1660TI

# Seq2Seq Encoder-Decoder Architecture

## Overview

The Sequence-to-Sequence (Seq2Seq) model with Encoder-Decoder architecture is a neural network framework designed for tasks where both input and output are sequences of variable length, such as machine translation (English → Chinese in your case).

```
Input Sequence (English):  "Hello world"
                              ↓
                          [ENCODER]
                              ↓
                        Context Vector
                              ↓
                          [DECODER]
                              ↓
Output Sequence (Chinese): "你好世界"
```

## Architecture Components

### 1. Encoder

The encoder processes the input sequence and compresses the information into a fixed-size context vector (also called thought vector).

```
Input: [w1, w2, w3, ..., wn]
       ↓    ↓    ↓       ↓
    [RNN][RNN][RNN]...[RNN]
       ↓    ↓    ↓       ↓
    [h1] [h2] [h3] ... [hn] → Context Vector (hn)
```

#### Key Components:
- **Embedding Layer**: Converts input tokens to dense vectors
- **RNN Layers**: LSTM/GRU cells process the sequence sequentially
- **Hidden States**: Capture information at each time step
- **Final Context**: Last hidden state becomes the context vector

```python
# Pseudo-code structure
class Encoder:
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers)

    def forward(self, input_seq):
        embedded = self.embedding(input_seq)
        outputs, hidden = self.rnn(embedded)
        return hidden  # Context vectors
```

### 2. Decoder

The decoder generates the output sequence one token at a time, using the context vector from the encoder.

```
Context Vector (C) → [RNN] → [RNN] → [RNN] → ... → [RNN]
                      ↓       ↓       ↓             ↓
                    [y1]    [y2]    [y3]   ...   [yn]
```

#### Key Components:
- **Initial State**: Initialized with encoder's context vector
- **RNN Layers**: Generate hidden states for each output position
- **Output Projection**: Maps hidden states to vocabulary probabilities
- **Softmax**: Converts logits to probability distribution

```python
# Pseudo-code structure
class Decoder:
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers)
        self.output_projection = nn.Linear(hidden_size, vocab_size)

    def forward(self, target_seq, encoder_hidden, encoder_cell):
        embedded = self.embedding(target_seq)
        outputs, _ = self.rnn(embedded, encoder_hidden)
        predictions = self.output_projection(outputs)
        return predictions
```

## Complete Architecture Flow

### Training Phase

```
1. Input Processing:
   English: "Hello world" → [101, 7592, 2088, 102] (tokenized)
   Chinese: "[BOS] 你好世界 [EOS]" → [101, 872, 1962, 686, 102] (tokenized)

2. Encoder Forward Pass:
   Input: [101, 7592, 2088, 102]
   ↓
   Embedding: [[0.1, 0.2, ...], [0.3, 0.4, ...], ...]
   ↓
   LSTM/GRU: h1, h2, h3, h4 → Context Vector (h4)

3. Decoder Forward Pass:
   Initial State: (h4) from encoder
   Input: [101, 872, 1962, 686]
   ↓
   LSTM/GRU: generates hidden states for each position
   ↓
   Output Projection: [vocab_size] logits for each position
   ↓
   Loss Calculation: CrossEntropy with targets [872, 1962, 686, 102]
```

### Inference Phase

```
1. Encode input sequence: "Hello world"
2. Initialize decoder with encoder's context vector
3. Start with [BOS] token
4. Generate tokens one by one:
   - Input: [BOS] → Output: 你 (probability distribution)
   - Input: [BOS] 你 → Output: 好
   - Input: [BOS] 你 好 → Output: 世
   - Input: [BOS] 你 好 世 → Output: 界
   - Input: [BOS] 你 好 世 界 → Output: [EOS] (stop)
```

## Mathematical Formulation

### Encoder
```
h_t = LSTM/GRU(embedding(x_t), h_{t-1})
context = h_n  # Final hidden state
```

### Decoder
```
s_t = LSTM/GRU(embedding(y_{t-1}), s_{t-1})  # s_0 = context
P(y_t | y_1...y_{t-1}, x) = softmax(W_s * s_t + b_s)
```

### Loss Function
```
Loss = -∑∑ log P(y_t^i | y_1^i...y_{t-1}^i, x^i)
```

## Architecture Advantages

1. **Variable Length Handling**: Can process sequences of different lengths
2. **End-to-End Learning**: Jointly optimizes encoder and decoder
3. **Context Preservation**: Encoder captures semantic meaning in context vector
4. **Language Agnostic**: Works for any language pair

## Architecture Limitations

1. **Information Bottleneck**: Fixed-size context vector may lose information
2. **Long Sequence Problem**: Difficulty with very long input sequences
3. **Sequential Processing**: Cannot parallelize during inference

## Improvements & Variants

1. **Attention Mechanism**: Allows decoder to focus on relevant encoder states
2. **Bidirectional Encoder**: Processes sequence in both directions
3. **Beam Search**: Better decoding strategy than greedy search
4. **Teacher Forcing**: Training technique using ground truth as decoder input

## Implementation Architecture for Your Project

Based on your dataset (WMT-17 EN-ZH) and tokenizers (BERT-based), here's the recommended architecture:

```
Input: English sentence (max_length=100)
↓
BERT Tokenizer (vocab_size=30522) → Token IDs
↓
Embedding Layer (30522 → 512)
↓
Encoder LSTM/GRU (512 → 1024, num_layers=2)
↓
Context Vector (1024-dim)
↓
Decoder LSTM/GRU (512 → 1024, num_layers=2)
↓
Output Projection (1024 → 21128)
↓
Chinese Token IDs → BERT Tokenizer → Chinese sentence
```

In [1]:
# 可以先测试网络连接
import requests
try:
    response = requests.get("https://huggingface.co")
    print("网络连接正常")
except:
    print("网络连接可能存在问题")

网络连接正常


In [2]:
# Download the data & select 5m high-quality pairs

from datasets import load_dataset
import re

# load full wmt-17 en-zh dataset
full_dataset = load_dataset("wmt/wmt17", "zh-en", split="train", cache_dir=r"D:\Developer\LLM\FuggingFace-cache-model")

# Length & Ratio filter
def is_high_quality(x):
    import re  # 添加这一行
    en = x["translation"]["en"]
    zh = x["translation"]["zh"]
    if not en or not zh:
        return False
    if len(en) < 3 or len(zh) < 3:
        return False
    if len(en) > 100 or len(zh) > 100:
        return False
    ratio = len(en) / len(zh)
    if ratio < 0.5 or ratio > 2:
        return False
    if not re.search(r'[\u4e00-\u9fff]', zh):
        return False
    return True

filtered_dataset = full_dataset.filter(is_high_quality, num_proc=10)
dataset = filtered_dataset.select(range(min(5_000_000, len(filtered_dataset))))

print("Full Dataset Size: ", len(full_dataset))
print("Filtered Dataset Size: ", len(filtered_dataset))
print("Dataset Size: ", len(dataset))

# print 10 samples
sample = dataset.shuffle(seed=42).select(range(10))
print("-"*100)
for i in sample:
    print(i["translation"]["en"])
    print(i["translation"]["zh"])
    print("-"*100)


Full Dataset Size:  25134743
Filtered Dataset Size:  1141860
Dataset Size:  1141860
----------------------------------------------------------------------------------------------------
Zambia (7)
赞比亚(7)
----------------------------------------------------------------------------------------------------
15:00 to 18:00 Informal consultations (closed) Conference Room 5 (NLB)
下午3:00－6:00 非正式磋商(闭门会议) 第5会议室(北草坪会议大楼)
----------------------------------------------------------------------------------------------------
Spain
西班牙
----------------------------------------------------------------------------------------------------
Mr. Robert Morrison
Robert Morrison先生 加拿大自然资源部
----------------------------------------------------------------------------------------------------
This satisfied the kids, but not the husband.
"孩子们得到了满意的答案, 但她的丈夫却没有。
----------------------------------------------------------------------------------------------------
Shutaro Omura (Japan)
Shutaro Omura（日本）
---------------

In [3]:
# Create PyTorch Dataset and DataLoader for training

import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

# 自定义数据集（对原始数据集进行自定义处理，进行tokenizer）
class TranslationDataset(Dataset):
    def __init__(self, hf_dataset, tokenizer_en, tokenizer_zh, max_length=100):
        """
        PyTorch Dataset wrapper for HuggingFace translation dataset

        Args:
            hf_dataset: HuggingFace dataset with translation pairs
            tokenizer_en: English tokenizer (optional, can be added later)
            tokenizer_zh: Chinese tokenizer (optional, can be added later)
            max_length: Maximum sequence length
        """
        self.dataset = hf_dataset
        self.tokenizer_en = tokenizer_en
        self.tokenizer_zh = tokenizer_zh
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        en_text = item["translation"]["en"]
        zh_text = item["translation"]["zh"]

        en_tokens = self.tokenizer_en(en_text,
                                        max_length=self.max_length,
                                        padding='max_length',
                                        truncation=True,
                                        return_tensors='pt')

        zh_tokens = self.tokenizer_zh(zh_text,
                                        max_length=self.max_length,
                                        padding='max_length',
                                        truncation=True,
                                        return_tensors='pt')

        return {
                'source_ids': en_tokens['input_ids'].squeeze(),     # 去除张量的多余维度，输出以为数字数组
                'target_ids': zh_tokens['input_ids'].squeeze(),
                'source_text': en_text,
                'target_text': zh_text
        }

# 制作训练数据集Loader和验证数据集Loader
def create_dataloaders(dataset, batch_size=64, num_workers=0, train_split=0.95):
    """
    Create train and validation DataLoaders from HuggingFace dataset

    Args:
        dataset: HuggingFace dataset with translation pairs
        batch_size: Batch size for DataLoaders
        num_workers: Number of worker processes for data loading
        train_split: Fraction of data to use for training

    Returns:
        train_dataloader, val_dataloader, train_dataset, val_dataset
    """

    # Split dataset into train and validation
    train_size = int(train_split * len(dataset))

    # Create indices for splitting
    indices = list(range(len(dataset)))
    train_indices, val_indices = train_test_split(indices,
                                                train_size=train_size,
                                                random_state=42)

    # Create train and validation datasets
    train_dataset_hf = dataset.select(train_indices)
    val_dataset_hf = dataset.select(val_indices)

    # tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
    tokenizer_en = AutoTokenizer.from_pretrained("bert-base-uncased")
    tokenizer_zh = AutoTokenizer.from_pretrained("bert-base-chinese")

    # get vocab sizes
    vocab_size_en = tokenizer_en.vocab_size
    vocab_size_zh = tokenizer_zh.vocab_size

    print(f"Vocab size for en: {vocab_size_en}")
    print(f"Vocab size for zh: {vocab_size_zh}")

    # Create PyTorch datasets
    train_dataset = TranslationDataset(train_dataset_hf, tokenizer_en, tokenizer_zh)
    val_dataset = TranslationDataset(val_dataset_hf, tokenizer_en, tokenizer_zh)

    print(f"Train dataset size: {len(train_dataset)}")
    print(f"Validation dataset size: {len(val_dataset)}")

    # Create DataLoaders
    train_dataloader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        # pin_memory=True if torch.cuda.is_available() else False
    )

    val_dataloader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
        # pin_memory=True if torch.cuda.is_available() else False
    )

    print(f"Train DataLoader: {len(train_dataloader)} batches")
    print(f"Validation DataLoader: {len(val_dataloader)} batches")

    return train_dataloader, val_dataloader, vocab_size_en, vocab_size_zh

def test_dataloader(dataloader):
    """Test the DataLoader by printing a sample batch"""
    print("\n" + "="*50)
    print("Sample batch from DataLoader:")
    print("="*50)

    # for batch in dataloader:
    #     print(f"Batch size: {len(batch['source_text'])}")
    #     print(f"Source example: {batch['source_text'][0]}")
    #     print(f"Source tokens: {batch['source_ids'][0]}")
    #     print(f"Target example: {batch['target_text'][0]}")
    #     print(f"Target tokens: {batch['target_ids'][0]}")
    #     break

    # 使用next和iter更直接地获取一个batch
    try:
        batch = next(iter(dataloader))
        print(f"Batch size: {len(batch['source_text'])}")
        print(f"Source example: {batch['source_text'][0]}")
        print(f"Source tokens shape: {batch['source_ids'][0].shape}")
        print(f"Target example: {batch['target_text'][0]}")
        print(f"Target tokens shape: {batch['target_ids'][0].shape}")
    except Exception as e:
        print(f"Error getting batch: {e}")

train_dataloader, val_dataloader, encoder_vocab_size, decoder_vocab_size = create_dataloaders(dataset)
test_dataloader(train_dataloader)
test_dataloader(val_dataloader)

Vocab size for en: 30522
Vocab size for zh: 21128
Train dataset size: 1084767
Validation dataset size: 57093
Train DataLoader: 16950 batches
Validation DataLoader: 893 batches

Sample batch from DataLoader:
Batch size: 64
Source example: Miss Graziella Dubra
Source tokens shape: torch.Size([100])
Target example: Graziella Dubra小姐
Target tokens shape: torch.Size([100])

Sample batch from DataLoader:
Batch size: 64
Source example: 74 and A/60/640/Add.1, para. 6
Source tokens shape: torch.Size([100])
Target example: (A/60/640/Add.1，第6段)
Target tokens shape: torch.Size([100])


In [4]:
# Define the Seq2Seq model with GRU

import torch
import torch.nn as nn

class Encoder(nn.Module):
    """
    Encoder component of the Seq2Seq model using GRU
    Processes the input sequence and generates context vectors
    """
    def __init__(self, vocab_size, embed_size=512, hidden_size=1024, num_layers=2, dropout=0.1):
        super(Encoder, self).__init__()

        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding layer to convert token IDs to dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=0)

        # GRU layer for processing sequences
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers,
                         batch_first=True, dropout=dropout, bidirectional=False)

    # 输入的是语句token序列（固定长度，制作数据集的时候已经处理成固定长度的token序列了），输出的是所有隐藏状态和最终隐藏状态
    def forward(self, input_seq):
        """
        Forward pass of the encoder

        Args:
            input_seq: Input token sequences [batch_size, seq_len]
            input_lengths: Actual lengths of sequences (for packed sequences)

        Returns:
            outputs: All hidden states [batch_size, seq_len, hidden_size]
            hidden: Final hidden state [num_layers, batch_size, hidden_size]
        """
        # Convert token IDs to embeddings
        embedded = self.embedding(input_seq)  # [batch_size, seq_len, embed_size]

        # Pass through GRU
        outputs, hidden = self.rnn(embedded)

        # outputs: [batch_size, seq_len, hidden_size]
        # hidden: [num_layers, batch_size, hidden_size]

        return outputs, hidden

class Decoder(nn.Module):
    """
    Decoder component of the Seq2Seq model using GRU
    Generates output sequence one token at a time
    """
    def __init__(self, vocab_size, embed_size=512, hidden_size=1024, num_layers=2, dropout=0.1):
        super(Decoder, self).__init__()

        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding layer for target tokens
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=0)

        # GRU layer for generating sequences
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers,
                         batch_first=True, dropout=dropout, bidirectional=False)

        # Output projection layer to vocabulary
        self.output_projection = nn.Linear(hidden_size, vocab_size)

    # 输入一个单词，输出vocab_size大小的概率（最终去确定是哪个单词）
    def forward(self, input_token, hidden):
        """
        Forward pass of the decoder (single step)

        Args:
            input_token: Current input token [batch_size, 1]  输入为一个单词，输出也为一个单词
            hidden: Hidden state from encoder/previous step [num_layers, batch_size, hidden_size]

        Returns:
            output: Vocabulary predictions [batch_size, vocab_size]
            hidden: Updated hidden state [num_layers, batch_size, hidden_size]
        """
        # input_token: [batch_size, 1]
        embedded = self.embedding(input_token)  # [batch_size, 1, embed_size]

        # Pass through GRU
        gru_out, hidden = self.rnn(embedded, hidden)
        # gru_out: [batch_size, 1, hidden_size]
        # hidden: [num_layers, batch_size, hidden_size]

        # Project to vocabulary
        output = self.output_projection(gru_out.squeeze(1))  # [batch_size, hidden_size] -> [batch_size, vocab_size]

        # output: [batch_size, vocab_size]
        # hidden: [num_layers, batch_size, hidden_size]
        return output, hidden

    def forward_parallel(self, input_seq, hidden):
        """
        ⚡ NEW - Parallel forward for training (teacher forcing)
        Process entire sequence at once for 10-50x speedup during training
        """
        # input_sequence: [batch_size, seq_len]
        embedded = self.embedding(input_seq)  # [batch_size, seq_len, embed_size]

        # Process entire sequence in parallel,这里会进行优化，但是单词预测的时候还是会走单步预测，其他阶段会进行底层优化并行
        gru_out, final_hidden = self.rnn(embedded, hidden)
        # gru_out: [batch_size, seq_len, hidden_size]

        # Project to vocabulary for all timesteps
        outputs = self.output_projection(gru_out)  # [batch_size, seq_len, vocab_size]

        # outputs : [batch_size, seq_len, vocab_size]
        # final_hidden: [num_layers, batch_size, hidden_size]
        return outputs, final_hidden

class Seq2Seq(nn.Module):
    """
    Complete Sequence-to-Sequence model using GRU
    Combines Encoder and Decoder for translation
    """
    def __init__(self, encoder_vocab_size, decoder_vocab_size, embed_size=512,
                 hidden_size=1024, num_layers=2, dropout=0.1):
        super(Seq2Seq, self).__init__()

        self.encoder_vocab_size = encoder_vocab_size
        self.decoder_vocab_size = decoder_vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Initialize encoder and decoder
        self.encoder = Encoder(encoder_vocab_size, embed_size, hidden_size, num_layers, dropout)
        self.decoder = Decoder(decoder_vocab_size, embed_size, hidden_size, num_layers, dropout)

    # 根据输入的源序列和目标序列生成每个token的预测序列
    def forward(self, source_seq, target_seq):
        """
        ⚡ OPTIMIZED - Fast training forward with parallel processing
        Uses teacher forcing + parallel decoder processing
        """

        # hidden: [num_layers, batch_size, hidden_size]
        _, hidden = self.encoder(source_seq)

        # output: [batch_size, seq_len, vocab_size]
        outputs, _ = self.decoder.forward_parallel(target_seq, hidden)

        # output: [batch_size, seq_len, vocab_size]
        return outputs

    # 根据输入的源序列生成目标序列
    def generate(self, source_seq, max_length=100, start_token=101, end_token=102):
        """
        Generate translation for given source sequence (inference mode)

        Args:
            source_seq: Source sequence [batch_size, source_len]
            max_length: Maximum length of generated sequence
            start_token: BOS token ID (101 for BERT)
            end_token: EOS token ID (102 for BERT)

        Returns:
            generated_seq: Generated sequence [batch_size, generated_len]
        """
        self.eval()
        batch_size = source_seq.size(0)

        with torch.no_grad():
            # Encode source sequence
            _, hidden = self.encoder(source_seq)

            # Initialize with start token
            # decoder_input: [batch_size, 1]
            decoder_input = torch.full((batch_size, 1), start_token, dtype=torch.long).to(source_seq.device)

            # Store generated tokens
            generated_tokens = []

            for _ in range(max_length):
                # Get next token prediction
                # output: [batch_size, 1, vocab_size]
                output, hidden = self.decoder(decoder_input, hidden)

                # Get the token with highest probability
                # next_token: [batch_size, 1]
                next_token = output.argmax(dim=1).unsqueeze(1)
                # generated_tokens: [batch_size, i+1]
                generated_tokens.append(next_token)

                # Use predicted token as next input
                decoder_input = next_token

                # Stop if all sequences generated EOS token
                if torch.all(next_token.squeeze() == end_token):
                    break

            # Concatenate all generated tokens
            # generated_seq: [batch_size, 1],输出预测的序列
            generated_seq = torch.cat(generated_tokens, dim=1)

        return generated_seq

    # Model configuration based on your dataset
model_config = {
    'encoder_vocab_size': encoder_vocab_size,  # 30522 (English BERT)
    'decoder_vocab_size': decoder_vocab_size,  # 21128 (Chinese BERT)
    'embed_size': 128,
    'hidden_size': 256,
    'num_layers': 2,
    'dropout': 0.1
}

# Initialize the model
model = Seq2Seq(**model_config)

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print("=== Seq2Seq Model with GRU Architecture ===")
print(f"Device: {device}")
print(f"Encoder Vocabulary Size: {model_config['encoder_vocab_size']:,}")
print(f"Decoder Vocabulary Size: {model_config['decoder_vocab_size']:,}")
print(f"Embedding Size: {model_config['embed_size']}")
print(f"Hidden Size: {model_config['hidden_size']}")
print(f"Number of Layers: {model_config['num_layers']}")
print(f"Dropout Rate: {model_config['dropout']}")
print(f"RNN Type: GRU (Gated Recurrent Unit)")

# Calculate total parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nTotal Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")
print(f"Model Size: {total_params * 4 / 1024 / 1024:.2f} MB")

# Print model architecture
print(f"\n=== Model Architecture ===")
print(model)

# Test the model with a sample batch
print(f"\n=== Testing Model Forward Pass ===")
sample_batch = next(iter(train_dataloader))
source_ids = sample_batch['source_ids'].to(device)
target_ids = sample_batch['target_ids'].to(device)

print(f"Source shape: {source_ids.shape}")
print(f"Target shape: {target_ids.shape}")

# Forward pass
model.train()
outputs = model(source_ids, target_ids)
print(f"Output shape: {outputs.shape}")
print(f"Expected shape: [batch_size, target_len, decoder_vocab_size]")
print(f"Actual shape: [{outputs.shape[0]}, {outputs.shape[1]}, {outputs.shape[2]}]")        # [batch_size, seq_len, vocab_size]

# Test generation
print(f"\n=== Testing Model Generation ===")
model.eval()
with torch.no_grad():
    generated = model.generate(source_ids[:2], max_length=100)  # Generate for first 2 samples
    print(f"Generated sequence shape: {generated.shape}")
    print(f"Generated tokens (first sample): {generated[0].tolist()}")

=== Seq2Seq Model with GRU Architecture ===
Device: cuda
Encoder Vocabulary Size: 30,522
Decoder Vocabulary Size: 21,128
Embedding Size: 128
Hidden Size: 256
Number of Layers: 2
Dropout Rate: 0.1
RNN Type: GRU (Gated Recurrent Unit)

Total Parameters: 13,423,496
Trainable Parameters: 13,423,496
Model Size: 51.21 MB

=== Model Architecture ===
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(30522, 128, padding_idx=0)
    (rnn): GRU(128, 256, num_layers=2, batch_first=True, dropout=0.1)
  )
  (decoder): Decoder(
    (embedding): Embedding(21128, 128, padding_idx=0)
    (rnn): GRU(128, 256, num_layers=2, batch_first=True, dropout=0.1)
    (output_projection): Linear(in_features=256, out_features=21128, bias=True)
  )
)

=== Testing Model Forward Pass ===
Source shape: torch.Size([64, 100])
Target shape: torch.Size([64, 100])
Output shape: torch.Size([64, 100, 21128])
Expected shape: [batch_size, target_len, decoder_vocab_size]
Actual shape: [64, 100, 21128]

=== Testing Model Ge