# English to French Translation with Hugging Face

This notebook demonstrates how to fine-tune a pre-trained `Helsinki-NLP/opus-mt-en-fr` model for English-to-French translation using the Hugging Face `transformers`, `datasets`, and `accelerate` libraries.

We will cover the following steps:
1.  **Setup**: Install and import the necessary libraries.
2.  **Load Data**: Load a sample dataset for translation.
3.  **Preprocessing**: Tokenize the source (English) and target (French) texts.
4.  **Fine-Tuning**: Set up the trainer and fine-tune the model on our dataset.
5.  **Inference**: Use the fine-tuned model with the `pipeline` API to translate new sentences.

## 1. Setup

First, let's install the required libraries. We need `transformers` for the models, `datasets` to handle the data, `accelerate` to optimize training, and `sacrebleu` for evaluation metrics.

In [None]:
!pip install transformers[torch] datasets sacrebleu accelerate -q

Now, let's import all the necessary components.

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    pipeline
)

## 2. Load Data

We'll use the `opus_books` dataset, which contains translated texts from books. We will use the English-French (`en-fr`) pair. To make the training faster for this demonstration, we'll only use a small portion of the dataset.

In [None]:
# Load a smaller subset for demonstration purposes
raw_dataset = load_dataset("opus_books", "en-fr", split='train[:1%]')

# Split the dataset into training and validation sets
split_dataset = raw_dataset.train_test_split(test_size=0.2, seed=42)

print("Training set size:", len(split_dataset['train']))
print("Validation set size:", len(split_dataset['test']))
print("\nSample:", split_dataset['train'][0])

## 3. Preprocessing

Next, we need to convert our text data into a format the model can understand. We'll use a tokenizer that corresponds to our pre-trained model.

The `Helsinki-NLP/opus-mt-en-fr` model is a great choice for this task.

In [None]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

We'll create a preprocessing function to tokenize the English text as input and the French text as the target label.

In [None]:
source_lang = "en"
target_lang = "fr"
max_input_length = 128
max_target_length = 128

def preprocess_function(examples):
    # examples['translation'] is a list of dicts {'en': ..., 'fr': ...}
    inputs = [ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    
    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
    
    # Tokenize targets
    # The 'with tokenizer.as_target_tokenizer():' block ensures the tokenizer handles the target language correctly
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)
        
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Apply the preprocessing function to the entire dataset
tokenized_datasets = split_dataset.map(preprocess_function, batched=True)

# Let's check the structure of our tokenized data
print(tokenized_datasets['train'][0].keys())

## 4. Fine-Tuning the Model

Now we are ready to set up the training process. We start by loading the pre-trained model.

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Next, we define the training arguments. These arguments control various hyperparameters like learning rate, batch size, number of epochs, and evaluation strategy.

In [None]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
output_dir = f"{model_name}-finetuned-en-to-fr"

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3, # Increased epochs for better learning on small dataset
    predict_with_generate=True,
    fp16=torch.cuda.is_available(), # Use mixed precision if a GPU is available
    push_to_hub=False # Set to True if you want to upload the model to the Hub
)

We also need a data collator. This will create batches of data and dynamically pad the texts to the length of the longest element in their batch. This is more efficient than padding all texts to a global maximum length.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

With all the components ready, we can instantiate the `Seq2SeqTrainer` and start the training.

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    # We don't add a compute_metrics function here for simplicity,
    # but for a real project, you would add one to calculate BLEU scores.
)

In [None]:
# Start training!
trainer.train()

## 5. Inference with Pipeline

After training is complete, the best model is saved in the output directory. We can now use this model for inference. The easiest way to do this is with the `pipeline` API.

In [None]:
# The trainer saves the best model in the 'output_dir' specified in TrainingArguments
fine_tuned_model_path = f"./{output_dir}/"

# Load the fine-tuned model and tokenizer
translator = pipeline("translation_en_to_fr", model=fine_tuned_model_path)

# Let's test it with a sentence
english_text = "Hugging Face is a company based in New York City."
french_translation = translator(english_text)

print(f"English: {english_text}")
print(f"French: {french_translation[0]['translation_text']}")

In [None]:
# Another example
english_text_2 = "The quick brown fox jumps over the lazy dog."
french_translation_2 = translator(english_text_2)

print(f"English: {english_text_2}")
print(f"French: {french_translation_2[0]['translation_text']}")