<img src="../images/cover.jpg" width="1920"/>

In [1]:
# !pip install datasets -q

# Encoder-Decoder Models

Encoder-decoder models are a foundational architecture in deep learning, primarily designed for sequence-to-sequence (seq2seq) tasks like machine translation, text summarization, and question answering. These models consist of two distinct components:

1. **Encoder**: Encodes the input sequence into a fixed-size representation, often referred to as the latent space. This representation captures the semantic meaning of the input.

2. **Decoder**: Decodes the latent representation into the output sequence. It generates tokens step-by-step, conditioning on both the latent representation and the tokens generated so far.

<img src="../images/encoder-decoder-model.png" width="1920"/>

Encoder-decoder models closely resemble the architecture proposed in the seminal paper *"Attention Is All You Need"*, which introduced the Transformer. The Transformer uses self-attention mechanisms to capture relationships between tokens in a sequence, regardless of their positional distance.



#### Workflow to Create a Transformer Model:
1. **Define the Encoder and Decoder**:
   - The encoder encodes the input sequence.
   - The decoder uses cross-attention to decode and generate the output sequence.

2. **Combine Components**:
   - Use `EncoderDecoderModel` to combine the encoder and decoder into a single architecture.

3. **Configure Special Tokens**:
   - Define special tokens like `pad_token_id`, `eos_token_id`, and `decoder_start_token_id` to control input and output processing.

4. **Train or Fine-tune**:
   - Fine-tune the model on seq2seq datasets using frameworks like `Seq2SeqTrainer`.

The `EncoderDecoderModel` abstracts much of the complexity, enabling researchers and developers to focus on the task at hand while leveraging state-of-the-art Transformer capabilities.

In [2]:
from transformers import (
    EncoderDecoderModel,
    BertTokenizer,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)


from datasets import load_dataset


import torch

In [None]:
# Check for GPU
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# Data Preparation
Preparing data for training an encoder-decoder model for a machine translation task, converting English ("en") to French ("fr"). Here's a breakdown of its functionality:

1. **Loading the Tokenizer**:
   - The `BertTokenizer` is loaded from the pre-trained "bert-base-uncased" model.
   - Special tokens for the beginning-of-sequence (`bos_token`) and end-of-sequence (`eos_token`) are mapped to BERT's `[CLS]` and `[SEP]` tokens, respectively, for compatibility with seq2seq tasks.

In [None]:
# Load tokenizer (BERT tokenizer in this example)
tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-small")

In [31]:
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token

2. **Loading the Dataset**:
   - The `opus_books` dataset, a machine translation dataset for English-to-French, is loaded using the Hugging Face `datasets` library.
   - The dataset is split into training (90%) and validation (10%) subsets.

In [32]:
# Load dataset
train_data = load_dataset("opus_books", "en-fr", split="train[10%:]")
val_data = load_dataset("opus_books", "en-fr", split="train[:10%]")

In [None]:
train_data, val_data

In [34]:
# only use 32 training examples for notebook - DELETE LINEs FOR FULL TRAINING
train_data = train_data.select(range(1000))
val_data = val_data.select(range(1000))

3. **Preprocessing Function**:
   - The `preprocess_function` tokenizes the input and target sentences:
     - **Input**: English sentences are tokenized into `input_ids` and padded/truncated to a maximum length of 128.
     - **Target**: French sentences are tokenized similarly, with their tokenized IDs stored as `labels`.
   - Both `input_ids` (tokenized input) and `attention_mask` (mask for padding tokens) are extracted for the model's encoder, while the target `input_ids` serve as labels for the decoder.

In [35]:
# Preprocessing function
def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["fr"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, max_length=128, truncation=True, padding="max_length"
    )
    labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")

    examples["input_ids"] = model_inputs["input_ids"]
    examples["attention_mask"] = model_inputs["attention_mask"]
    examples["labels"] = labels["input_ids"]
    return examples

4. **Dataset Preprocessing**:
   - The `train_data` and `val_data` subsets are preprocessed using the `preprocess_function`:
     - The tokenized inputs and labels are added to the dataset.
     - Original dataset columns are removed to keep only model-relevant fields (`input_ids`, `attention_mask`, and `labels`).

In [None]:
# Preprocess the dataset
encoded_train_dataset = train_data.map(
    preprocess_function, batched=True, remove_columns=train_data.column_names
)
encoded_val_dataset = val_data.map(
    preprocess_function, batched=True, remove_columns=val_data.column_names
)











This results in tokenized and structured datasets ready for training and evaluation in a seq2seq task using an encoder-decoder model.

# Hugging Face's `EncoderDecoderModel`

The Hugging Face `EncoderDecoderModel` is a flexible framework for building and fine-tuning encoder-decoder architectures. It allows users to combine different pre-trained models for the encoder and decoder or train a model from scratch.

#### Features:
1. **Modular Design**: You can pair any Transformer-based encoder (e.g., BERT, RoBERTa) with a compatible decoder (e.g., GPT, BERT).
2. **Ease of Customization**: The library supports defining custom configurations for the encoder and decoder, including the number of layers, attention heads, and hidden dimensions.
3. **Pre-trained Models**: You can start with pre-trained weights to leverage existing knowledge or train a model from scratch for specialized tasks.
4. **Training Support**: Includes tools like `Seq2SeqTrainer` for efficient training, evaluation, and inference.

In [None]:
# Load EncoderDecoderModel
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "prajjwal1/bert-small", "prajjwal1/bert-small"
)

he method `model.num_parameters()` is typically used to return the total number of parameters in a model. Parameters are the weights and biases that the model learns during training, and their count gives an indication of the model's complexity and size.

In [None]:
print(f"{model.num_parameters():,} trainable parameters")

### Configuration
Configuring the encoder-decoder model with essential special tokens:

1. **`decoder_start_token_id`**:
   - Specifies the token ID that signals the start of the decoder's generation process.
   - Here, it is set to the ID of the beginning-of-sequence (`bos_token`).

2. **`eos_token_id`**:
   - Specifies the token ID that indicates the end of a sequence during decoding.
   - Here, it is set to the ID of the end-of-sequence (`eos_token`).

3. **`pad_token_id`**:
   - Specifies the token ID used for padding sequences to ensure uniform length during training or inference.
   - Here, it is set to the ID of the padding token (`pad_token`).

These configurations are critical for properly managing sequence boundaries and padding in encoder-decoder models during training and generation.

In [39]:
# Set special tokens
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

### Training Arguments

`Seq2SeqTrainingArguments` specifies the configuration and hyperparameters needed for training and evaluating a sequence-to-sequence (seq2seq) model. It provides an easy way to define settings like batch size, learning rate, number of epochs, and evaluation strategy, which the trainer uses during the training process.

In [40]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=30,
    predict_with_generate=True,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
)

### Trainer
`Seq2SeqTrainer` manages the entire training and evaluation process for seq2seq models. It handles:
- Optimizing the model.
- Computing gradients.
- Running evaluations during training.
- Generating predictions for validation.
- Saving and loading model checkpoints.

In [41]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_val_dataset,
    processing_class=tokenizer,
)

In [None]:
# Train the model
trainer.train()

Save the trained model and tokenizer to the specified directory, `"translation_model"`, so they can be reloaded later for inference or further fine-tuning. The `save_pretrained` method ensures all necessary configurations and weights are stored for both the model and tokenizer.

In [None]:
# Save the model
model.save_pretrained("translation_model")
tokenizer.save_pretrained("translation_model")

print("Model training completed and saved!")

Using pre-trained transformer-based Encoder-Decoder model for translation by loading the model and its tokenizer, and moving the model to a GPU (if available) for efficient computation.

In [None]:
from transformers import EncoderDecoderModel, BertTokenizer
import torch

# Load the saved model and tokenizer
model = EncoderDecoderModel.from_pretrained("translation_model")
tokenizer = BertTokenizer.from_pretrained("translation_model")

# Move model to device (GPU or CPU)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

Then we tokenize an example input text ("my name") and prepare it for inference. The input is passed through the model to generate a translation, using beam search for better quality.

In [47]:
# Example input for inference
input_text = "my name"
inputs = tokenizer(
    input_text, return_tensors="pt", padding=True, truncation=True, max_length=128
)

# Move inputs to device
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate translation
output_ids = model.generate(
    inputs["input_ids"],
    max_length=128,
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=tokenizer.cls_token_id,
)

The translated output is decoded back into text and printed along with the original input.

In [None]:
translation = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("Input:", input_text)
print("Translation:", translation)