# NLP Machine Translation

In this TP we are going to build a Machine Translation model to translate sentences from english to french.

Okey its nothing compared to ChatGPT, but its a good start as a first machine translation transformer model !

The TP can be divided into multiple parts:

- Loading: Load the data / Load the tokenizer / Load the model

- Preprocessing of the dataset

- Train the model on translation task

- [OPT] Train a LSTM model this task

Models that we will use in this task:

https://huggingface.co/t5-small

https://huggingface.co/facebook/bart-large

Start by using the t5-small. Once all your code is ready and the t5 trained for translation you can train also the bart and compare the performances.

In this TP we are going to use multiple libraries

- transformers: Library with all the pytorch Transformers implementation library

- datasets: Library with multiple dataset and many function to work with you data (super usefull when you work on NLP task)

- evaluate: Library that we will use to

In [None]:
!pip install transformers datasets evaluate sacrebleu accelerate

In [None]:
from datasets import load_dataset

In [None]:
books = load_dataset("opus_books", "en-fr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

Split between training & test set

you can use this function

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.train_test_split

In [None]:
books = books["train"].train_test_split(test_size=0.2)

Have a look at some samples in the dataset

In [None]:
books["train"][0]

{'id': '54361',
 'translation': {'en': '"Oh, it isn\'t his blood that\'s too thick," said the physician.',
  'fr': '-- Oh! ce n’est pas le sens qui le gêne.'}}

Compute some statistics on the dataset

Build an historogram of the words

-> Check the words that are the most used

In [None]:
# FIXME

For the preprocessing of the dataset for translation you can have a look at this video

https://www.youtube.com/watch?v=XAR8jnZZuUs

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

Load the tokenizer of the "t5-small" model

you can use `AutoTokenizer` and `from_pretrained` function

Load also the model, you can use `AutoModelForSeq2SeqLM`

In [None]:
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

Tokenize the dataset, you can use the tokenizer as a function

`tokenizer(inputs, text_target=targets, max_length=128, truncation=True)`

ex:
```
inputs=["translate English to French: Hello I like food"]
targets = ["j'aime la nourriture"]
```

Also add the prefix string to the input: `"translate English to French: "`

Build a function for these steps

In [1]:
def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

You can use the map function to apply a function over the dataset

In [None]:
tokenized_books = books.map(preprocess_function, batched=True)

Map:   0%|          | 0/101668 [00:00<?, ? examples/s]

Map:   0%|          | 0/25417 [00:00<?, ? examples/s]

Compute some metrics on the dataset:

- Min sentence length
- Max sentence length
- Average sentence length

In [None]:
import math

In [None]:
min_len, max_len, sum_len = math.inf, 0, 0
for input_ids in tokenized_books['train']['input_ids']:
    m = len(input_ids)
    if m < min_len:
        min_len = m
    if m > max_len:
        max_len = m
    sum_len += m

avg_len = sum_len / len(tokenized_books['train']['input_ids'])

In [None]:
min_len, max_len, avg_len

(7, 128, 37.52498327890782)

Test the tokenizer, write a dummy sentence, and apply the tokenizer to extract the `input_ids` and the `attention_mask`

In [None]:
books['train'][0]['translation']

{'en': '"Oh, it isn\'t his blood that\'s too thick," said the physician.',
 'fr': '-- Oh! ce n’est pas le sens qui le gêne.'}

In [None]:
inputs = [prefix + books['train'][0]['translation']['en']]
targets = [books['train'][0]['translation']['fr']]
model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
model_inputs

{'input_ids': [[13959, 1566, 12, 2379, 10, 96, 15046, 6, 34, 19, 29, 31, 17, 112, 1717, 24, 31, 7, 396, 4126, 976, 243, 8, 10027, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[1636, 3359, 55, 197, 3, 29, 22, 222, 330, 90, 3952, 285, 90, 3, 122, 8202, 29, 15, 5, 1]]}

Now use the encoded input (output of the tokenizer) to make a prediction (inference) with the model.

https://huggingface.co/docs/transformers/v4.36.1/en/model_doc/t5#transformers.T5ForConditionalGeneration

In [None]:
import torch

In [None]:
model

In [None]:
input_ids = torch.tensor(model_inputs['input_ids'])
attention_mask = torch.tensor(model_inputs['attention_mask'])
labels = torch.tensor(model_inputs['labels'])

output = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
output.logits.shape

torch.Size([1, 20, 32128])

In [None]:
inputs

['translate English to French: "Oh, it isn\'t his blood that\'s too thick," said the physician.']

In [None]:
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



« Oh, ce n'est pas son sang trop épais, a déclaré le médecin


Build the `DataCollator`

`from transformers import DataCollatorForSeq2Seq`

This will be used during the training to know how to load the data

https://huggingface.co/docs/transformers/main_classes/data_collator


In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

Using the library evaluate

load the metric "sacrebleu"

You can read this article to understand more about metrics in Machine Translation task https://machinetranslate.org/metrics

In this TP we are going to use the BLEU metric but you can read and use other metrics additionally.

In [None]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
import numpy as np

# I give you these function to help you
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

Build the training arguments object

https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments

You can play with the learning rate, it is recommended to train the model in float16

In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="tmp",
    evaluation_strategy="steps",
    eval_steps=1000,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
)

Build the trainer

https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.Seq2SeqTrainer

You need to pass:

- you model
- the training_args
- train dataset
- eval dataset
- tokenizer
- data collator
- compute_metrics funtion

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Bleu
1000,1.9111,1.702239,4.929715
2000,1.9061,1.670622,5.143051
3000,1.8732,1.650075,5.312324
4000,1.8669,1.637578,5.404644
5000,1.8434,1.625449,5.461418
6000,1.8392,1.615809,5.556101
7000,1.8186,1.610614,5.609566
8000,1.8292,1.603364,5.633493
9000,1.8114,1.599651,5.658308
10000,1.8086,1.59613,5.681675




TrainOutput(global_step=12710, training_loss=1.8471555216119198, metrics={'train_runtime': 7482.2135, 'train_samples_per_second': 27.176, 'train_steps_per_second': 1.699, 'total_flos': 4992724163493888.0, 'train_loss': 1.8471555216119198, 'epoch': 2.0})

In [None]:
tokenized_books["test"]

Dataset({
    features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 25417
})

In [None]:
device = torch.device("cuda")

In [None]:
inputs = ["translate English to French: I am 50"]

model_inputs = tokenizer(inputs, max_length=128, truncation=True, return_tensors="pt")

outputs = model.generate(model_inputs["input_ids"].to(device))
print(outputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

tensor([[   0, 1022, 3448,  944,    1]], device='cuda:0')
Je suis 25
