# Finetune T5 on the English-French subset of the OPUS Books dataset to translate English text to French.



## Introduction

Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework for returning some output from an input, like translation or summarization. Translation systems are commonly used for translation between different language texts, but it can also be used for speech or some combination in between like text-to-speech or speech-to-text.

## Setup

In [1]:
import torch
torch.cuda.is_available()

True

In [None]:
!pip install datasets evaluate transformers sacrebleu

In [3]:
# login to  Hugging Face
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load OPUS Books dataset

In [4]:
# Load first 50k data from dataset
from datasets import load_dataset

books = load_dataset("opus_books", "en-fr", split="train[:10000]")
books

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'translation'],
    num_rows: 10000
})

In [5]:
# Split the dataset into a train and test set
books = books.train_test_split(test_size=0.20)
books

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 2000
    })
})

In [6]:
# Let's print an example
books["train"][0]

{'id': '8759',
 'translation': {'en': '"Child! what do you mean?',
  'fr': '-- Enfant, que voulez-vous dire?'}}

## Data Preprocessing

In [7]:
# Load T5 tokenizer
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [8]:
# Define the preprocessing function
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

def preprocess_function(examples):
    # Prefix the input with a prompt so T5 knows this is a translation task
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    # Tokenize the input (English) and target (French) separately
    targets = [example[target_lang] for example in examples["translation"]]
    # Truncate sequences to be no longer than the maximum length
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

    return model_inputs

In [9]:
# Apply the preprocessing function over entire dataset
tokenized_books = books.map(preprocess_function, batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [10]:
# Create a batcu of examples with DataCollator
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=checkpoint
)

## Define Evaluation function

In [11]:
# Load the SacreBLEU metric
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [12]:
# Function to calculate SacreBLEU score
import numpy as np

def preprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = preprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result


## Fine-tuning the model

In [13]:
# Load T5 pretrained model checkpoint
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [14]:
# Define training hyperparameters
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="t5-small-finetuned-opus-books",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)



In [15]:
# Define the Trainer
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [16]:
# Finetune the model
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.2831,1.98875,3.0971,17.7335
2,2.2008,1.96529,3.1445,17.716




TrainOutput(global_step=1000, training_loss=2.2419578857421874, metrics={'train_runtime': 252.0876, 'train_samples_per_second': 63.47, 'train_steps_per_second': 3.967, 'total_flos': 392922626260992.0, 'train_loss': 2.2419578857421874, 'epoch': 2.0})

In [17]:
# Share the model to Hub
trainer.push_to_hub(
    tags="translation",
    commit_message="Training complete",
)

events.out.tfevents.1724674746.eb4517cf4939.703.0:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ashaduzzaman/t5-small-finetuned-opus-books/commit/51b7c37d99977cad96cacdb0e9465c52166b27a1', commit_message='Training complete', commit_description='', oid='51b7c37d99977cad96cacdb0e9465c52166b27a1', pr_url=None, pr_revision=None, pr_num=None)

## Inference

In [18]:
from transformers import pipeline

translator = pipeline(
    "translation_en_to_fr",
    model="ashaduzzaman/t5-small-finetuned-opus-books"
)

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [20]:
text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

print(translator(text))

[{'translation_text': 'Legumes partagent des ressources avec des bactéries fixatrices d’azote.'}]
