# Language Translation with Transformers

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# Language Translation

Let’s now dive into translation. This is another [sequence-to-sequence task](https://huggingface.co/course/chapter1/7), which means it’s a problem that can be formulated as going from one sequence to another. In that sense the problem is pretty close to [summarization](https://huggingface.co/course/chapter7/6), and you could adapt what we will see here to other sequence-to-sequence problems.



# Language Translation by Fine-tuning Transformers

If you have a big enough corpus of texts in two (or more) languages, you can train a new translation model from scratch. 

It will be faster, however, to fine-tune an existing translation model, be it a multilingual one like mT5 or mBART that you want to fine-tune to a specific language pair, or even a model specialized for translation from one language to another that you want to fine-tune to your specific corpus.

In this notebook, we will fine-tune a Marian model pretrained to translate from English to French on the [KDE4 dataset](https://huggingface.co/datasets/kde4), which is a dataset of localized files for the [KDE apps](https://apps.kde.org/).

![](https://i.imgur.com/eG0JLVk.png)


The model we will use has been pretrained on a large corpus of French and English texts taken from the [Opus dataset](https://opus.nlpl.eu/), the idea is to see if we can get a better fine-tuned version of it after fine-tuning.

Once we’re finished, we will have a model able to make predictions like this one:

![](https://i.imgur.com/ZB1G8dh.png)

## Install Relevant Libraries



In [1]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 7.3 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 72.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 77.9 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 82.4 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 7.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.7 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.12.1 transformers-4.19.4


In [3]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 6.9 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


You will be leveraging 🤗 Transformers and 🤗 Datasets as well as other dependencies

## Load Dataset

To fine-tune or train a translation model from scratch, we will need a dataset suitable for the task. As mentioned previously, we’ll use the [KDE4 dataset] in this section, but you can adapt the code to use your own data quite easily, as long as you have pairs of sentences in the two languages you want to translate from and into.

### The KDE4 Dataset

We download our dataset using the `load_dataset()` function:



In [1]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")

Using custom data configuration en-fr-lang1=en,lang2=fr


Downloading and preparing dataset kde4/en-fr (download: 6.72 MiB, generated: 24.46 MiB, post-processed: Unknown size, total: 31.18 MiB) to /root/.cache/huggingface/datasets/kde4/en-fr-lang1=en,lang2=fr/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac...


Generating train split:   0%|          | 0/210173 [00:00<?, ? examples/s]

Dataset kde4 downloaded and prepared to /root/.cache/huggingface/datasets/kde4/en-fr-lang1=en,lang2=fr/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
raw_datasets.keys()

dict_keys(['train'])

Let’s have a look at the dataset:

In [3]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

We have 210,173 pairs of sentences, but in one single split, so we will need to create our own validation set. 

A Dataset has a `train_test_split()` method that can help us. We’ll provide a seed for reproducibility:

In [4]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 189155
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 21018
    })
})

We can rename the "test" key to "validation" like this:

In [5]:
split_datasets["validation"] = split_datasets.pop("test")

Now let’s take a look at one element of the dataset:



In [6]:
split_datasets["train"][1]

{'id': '152754',
 'translation': {'en': 'Default to expanded threads',
  'fr': 'Par défaut, développer les fils de discussion'}}

We get a dictionary with two sentences in the pair of languages we requested. 

One particularity of this dataset full of technical computer science terms is that they are all fully translated in French.

However, French engineers are often lazy and leave most computer science-specific words in English when they talk. Here, for instance, the word “threads” might well appear in a French sentence, especially in a technical conversation.

But in this dataset it has been translated into the more correct “fils de discussion.” 

The pretrained model we use, which has been pretrained on a larger corpus of French and English sentences, takes the easier option of leaving the word as is:

In [7]:
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
translator = pipeline("translation", model=model_checkpoint)

Downloading:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/287M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]



In [8]:
translator("Default to expanded threads")[0]

{'translation_text': 'Par défaut pour les threads élargis'}

Another example of this behavior can be seen with the word “plugin,” which isn’t officially a French word but which most native speakers will understand and not bother to translate. 

In the KDE4 dataset this word has been translated in French into the more official “module d’extension”:

In [9]:
split_datasets["train"][172]["translation"]

{'en': 'Unable to import %1 using the OFX importer plugin. This file is not the correct format.',
 'fr': "Impossible d'importer %1 en utilisant le module d'extension d'importation OFX. Ce fichier n'a pas un format correct."}

Our pretrained model, however, sticks with the compact and familiar English word:



In [10]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)[0]

{'translation_text': "Impossible d'importer %1 en utilisant le plugin d'importateur OFX. Ce fichier n'est pas le bon format."}

It will be interesting to see if our fine-tuned model picks up on those particularities of the dataset (spoiler alert: it will).



## Preprocessing the data

As usual, the texts all need to be converted into sets of token IDs so the model can make sense of them. 

For this task, we’ll need to tokenize both the inputs and the targets. Our first task is to create our tokenizer object. 

As noted earlier, we’ll be using a Marian English to French pretrained model. 

If you are trying this code with another pair of languages, make sure to adapt the model checkpoint.

The [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) organization provides more than a thousand models in multiple languages.

The modeling code is the same as [BartForConditionalGeneration](https://huggingface.co/docs/transformers/v4.19.4/en/model_doc/bart#transformers.BartForConditionalGeneration)with a few minor modifications


In [11]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



You can also replace the model_checkpoint with any other model you prefer from the [Hub](https://huggingface.co/models), or a local folder where you’ve saved a pretrained model and a tokenizer.

If you are using a multilingual tokenizer such as mBART, mBART-50, or M2M100, you will need to set the language codes of your inputs and targets in the tokenizer by setting tokenizer.src_lang and tokenizer.tgt_lang to the right values.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [12]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [10537, 2, 67, 32, 15, 5776, 145, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

The preparation of our data is pretty straightforward. There’s just one thing to remember: you process the inputs as usual, but for the targets, you need to wrap the tokenizer inside the context manager `as_target_tokenizer()`.

In the case at hand, the context manager `as_target_tokenizer()` will set the tokenizer in the output language (here, French) before the indented block is executed, then set it back in the input language (here, English).

So, preprocessing one sample looks like this:

In [13]:
en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]

In [14]:
inputs = tokenizer(en_sentence)
with tokenizer.as_target_tokenizer():
    targets = tokenizer(fr_sentence)

In [15]:
inputs, targets

({'input_ids': [47591, 12, 9842, 19634, 9, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]},
 {'input_ids': [577, 5891, 2, 3184, 16, 2542, 5, 1710, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]})

In [16]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'])

['▁Default', '▁to', '▁expanded', '▁thread', 's', '</s>']

In [17]:
tokenizer.convert_ids_to_tokens(targets['input_ids'])

['▁Par',
 '▁défaut',
 ',',
 '▁développer',
 '▁les',
 '▁fils',
 '▁de',
 '▁discussion',
 '</s>']

In [18]:
en_sentence, fr_sentence

('Default to expanded threads',
 'Par défaut, développer les fils de discussion')

If we forget to tokenize the targets inside the context manager, they will be tokenized by the input tokenizer, which in the case of a Marian model is not going to go well at all

Both inputs and targets are dictionaries with our usual keys (input IDs, attention mask, etc.), so the last step is to set a "labels" key inside the inputs. We do this in the preprocessing function we will apply on the datasets:

In [19]:
max_input_length = 128
max_target_length = 128

In [20]:
def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["fr"] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Note that we set similar maximum lengths for our inputs and outputs. Since the texts we’re dealing with seem pretty short, we use 128.


If you are using a T5 model (more specifically, one of the t5-xxx checkpoints), the model will expect the text inputs to have a prefix indicating the task at hand, such as translate: English to French:.


We don’t pay attention to the attention mask of the targets, as the model won’t expect it. Instead, the labels corresponding to a padding token should be set to -100 so they are ignored in the loss computation. This will be done by our data collator later on since we are applying dynamic padding, but if you use padding here, you should adapt the preprocessing function to set all labels that correspond to the padding token to -100.



We can now apply that preprocessing in one go on all the splits of our dataset:

In [21]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)



  0%|          | 0/190 [00:00<?, ?ba/s]

  0%|          | 0/22 [00:00<?, ?ba/s]

Now that the data has been preprocessed, we are ready to fine-tune our pretrained model!

## Fine-tuning the Transformer Model 

Now that our data is ready, we can download the pretrained model and fine-tune it. 

Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. 

Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [23]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Note that this time we are using a model that was trained on a translation task and can actually be used already, so there is no warning about missing weights or newly initialized ones.

## Data Collation

We’ll need a data collator to deal with the padding for dynamic batching. We can’t just use a `DataCollatorWithPadding` like in classification as we have pairs of sequences here. 

Our labels should also be padded to the maximum length encountered in the labels. And, as mentioned previously, the padding value used to pad the labels should be -100 and not the padding token of the tokenizer, to make sure those padded values are ignored in the loss computation.

This is all done by a [`DataCollatorForSeq2Seq`](https://huggingface.co/transformers/main_classes/data_collator.html#datacollatorforseq2seq). Like the `DataCollatorWithPadding`, it takes the tokenizer used to preprocess the inputs, but it also takes the model. 

This is because this data collator will also be responsible for preparing the decoder input IDs, which are shifted versions of the labels with a special token at the beginning. 

Since this shift is done slightly differently for different architectures, the DataCollatorForSeq2Seq needs to know the model object:

In [24]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

To test this on a few samples, we just call it on a list of examples from our tokenized training set:

In [25]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [27]:
batch['input_ids']

tensor([[47591,    12,  9842, 19634,     9,     0, 59513, 59513, 59513, 59513,
         59513, 59513, 59513, 59513, 59513],
        [ 1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,   817,
         28149,   139, 33712, 25218,     0]])

In [28]:
tokenizer.convert_ids_to_tokens(batch['input_ids'][0])

['▁Default',
 '▁to',
 '▁expanded',
 '▁thread',
 's',
 '</s>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>']

We can check our labels have been padded to the maximum length of the batch, using -100:

In [26]:
batch["labels"]

tensor([[  577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,  -100,
          -100,  -100,  -100,  -100,  -100,  -100],
        [ 1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,   817,
           550,  7032,  5821,  7907, 12649,     0]])

In [30]:
tokenizer.convert_ids_to_tokens(batch['labels'][0])

['▁Par',
 '▁défaut',
 ',',
 '▁développer',
 '▁les',
 '▁fils',
 '▁de',
 '▁discussion',
 '</s>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>']

And we can also have a look at the decoder input IDs, to see that they are shifted versions of the labels:

In [31]:
batch["decoder_input_ids"]

tensor([[59513,   577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,
         59513, 59513, 59513, 59513, 59513, 59513],
        [59513,  1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,
           817,   550,  7032,  5821,  7907, 12649]])

Here are the labels for the first and second elements in our dataset:



In [32]:
for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])

[577, 5891, 2, 3184, 16, 2542, 5, 1710, 0]
[1211, 3, 49, 9409, 1211, 3, 29140, 817, 3124, 817, 550, 7032, 5821, 7907, 12649, 0]


We will pass this `data_collator` along to the `Seq2SeqTrainer`. Next, let’s have a look at the metric.

The decoder performs inference by predicting tokens one by one — something that’s implemented behind the scenes in 🤗 Transformers by the `generate()` method. The `Seq2SeqTrainer` will let us use that method for evaluation if we set `predict_with_generate=True`.

## Choosing the Evaluation Metric

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the function `load_metric`.  

The traditional metric used for translation is the [BLEU score](https://en.wikipedia.org/wiki/BLEU), introduced in a [2002 article](https://aclanthology.org/P02-1040.pdf) by Kishore Papineni et al. The BLEU score evaluates how close the translations are to their labels. It does not measure the intelligibility or grammatical correctness of the model’s generated outputs, but uses statistical rules to ensure that all the words in the generated outputs also appear in the targets. 

In addition, there are rules that penalize repetitions of the same words if they are not also repeated in the targets (to avoid the model outputting sentences like "the the the the the") and output sentences that are shorter than those in the targets (to avoid the model outputting sentences like "the").

One weakness with BLEU is that it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers. So instead, the most commonly used metric for benchmarking translation models today is [SacreBLEU](https://github.com/mjpost/sacrebleu), which addresses this weakness (and others) by standardizing the tokenization step. To use this metric, we first need to install the SacreBLEU library:

In [33]:
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacrebleu
  Downloading sacrebleu-2.1.0-py3-none-any.whl (92 kB)
[K     |████████████████████████████████| 92 kB 5.8 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting portalocker
  Downloading portalocker-2.4.0-py2.py3-none-any.whl (16 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.4 portalocker-2.4.0 sacrebleu-2.1.0


We can then load it via `load_metric()`

In [34]:
metric = load_metric("sacrebleu")
metric

Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
    references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
    smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
        - `'none'`: no smoothing
        - `'floor'`: increment zero counts
        - `'add-k'`: increment num/deno

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

This metric will take texts as inputs and targets. It is designed to accept several acceptable targets, as there are often multiple acceptable translations of the same sentence — the dataset we’re using only provides one, but it’s not uncommon in NLP to find datasets that give several sentences as labels. 

So, the predictions should be a list of sentences, but the references should be a list of lists of sentences.

Let’s try an example:

In [35]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]

references = [
    ["This plugin allows you to automatically translate web pages between several languages."]
]

metric.compute(predictions=predictions, references=references)

{'bp': 0.9200444146293233,
 'counts': [11, 6, 4, 3],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'ref_len': 13,
 'score': 46.750469682990165,
 'sys_len': 12,
 'totals': [12, 11, 10, 9]}

This gets a BLEU score of 46.75, which is rather good — for reference, the original Transformer model in the [“Attention Is All You Need”](https://arxiv.org/pdf/1706.03762.pdf) paper achieved a BLEU score of 41.8 on a similar translation task between English and French! 

(For more information about the individual metrics, like counts and bp, see the [SacreBLEU repository](https://github.com/mjpost/sacrebleu/blob/078c440168c6adc89ba75fe6d63f0d922d42bcfe/sacrebleu/metrics/bleu.py#L74).) On the other hand, if we try with the two bad types of predictions (lots of repetitions or too short) that often come out of translation models, we will get rather bad BLEU scores:

In [36]:
predictions = ["This This This This"]

references = [
    ["This plugin allows you to automatically translate web pages between several languages."]
]

metric.compute(predictions=predictions, references=references)

{'bp': 0.10539922456186433,
 'counts': [1, 0, 0, 0],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'ref_len': 13,
 'score': 1.683602693167689,
 'sys_len': 4,
 'totals': [4, 3, 2, 1]}

In [37]:
predictions = ["This plugin"]

references = [
    ["This plugin allows you to automatically translate web pages between several languages."]
]

metric.compute(predictions=predictions, references=references)

{'bp': 0.004086771438464067,
 'counts': [2, 1, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'ref_len': 13,
 'score': 0.0,
 'sys_len': 2,
 'totals': [2, 1, 0, 0]}

The score can go from 0 to 100, and higher is better.

To get from the model outputs to texts the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to clean up all the -100s in the labels (the tokenizer will automatically do the same for the padding token):

In [38]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. 

The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. 

It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [39]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"opus-mt-finetuned-kde4-en-to-fr",
    evaluation_strategy = "no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False
)

Apart from the usual hyperparameters (like learning rate, number of epochs, batch size, and some weight decay), here are a few changes compared to what we saw in the previous sections:

- We don’t set any regular evaluation, as evaluation takes a while; we will just evaluate our model once before training and after.
- We set `fp16=True`, which speeds up training on modern GPUs.
- We set `predict_with_generate=True`, as discussed above.

Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [40]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Using amp half precision backend


Before training, we’ll first look at the score our model gets, to double-check that we’re not making things worse with our fine-tuning. This command will take a bit of time:

In [41]:
trainer.evaluate(max_length=max_target_length)

***** Running Evaluation *****
  Num examples = 21018
  Batch size = 64


{'eval_bleu': 39.27124165416069,
 'eval_loss': 1.6964424848556519,
 'eval_runtime': 937.8507,
 'eval_samples_per_second': 22.411,
 'eval_steps_per_second': 0.351}

A BLEU score of 39 is not too bad, which reflects the fact that our model is already good at translating English sentences to French ones.

We can now finetune our model by just calling the `train` method:

In [42]:
trainer.train()

***** Running training *****
  Num examples = 189155
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 17736


Step,Training Loss
500,1.4163
1000,1.2196
1500,1.1699
2000,1.131
2500,1.1177
3000,1.0659
3500,1.0667
4000,1.0292
4500,1.0246
5000,1.0255


Saving model checkpoint to opus-mt-finetuned-kde4-en-to-fr/checkpoint-5912
Configuration saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-5912/config.json
Model weights saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-5912/pytorch_model.bin
tokenizer config file saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-5912/tokenizer_config.json
Special tokens file saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-5912/special_tokens_map.json
Saving model checkpoint to opus-mt-finetuned-kde4-en-to-fr/checkpoint-11824
Configuration saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-11824/config.json
Model weights saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-11824/pytorch_model.bin
tokenizer config file saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-11824/tokenizer_config.json
Special tokens file saved in opus-mt-finetuned-kde4-en-to-fr/checkpoint-11824/special_tokens_map.json
Saving model checkpoint to opus-mt-finetuned-kde4-en-to-fr/checkpoint-17736
Configuration saved i

TrainOutput(global_step=17736, training_loss=0.9371663198096827, metrics={'train_runtime': 4745.0217, 'train_samples_per_second': 119.592, 'train_steps_per_second': 3.738, 'total_flos': 1.1322351026307072e+16, 'train_loss': 0.9371663198096827, 'epoch': 3.0})

Once training is done, we evaluate our model again — hopefully we will see some improvement in the BLEU score!

In [43]:
trainer.evaluate(max_length=max_target_length)

***** Running Evaluation *****
  Num examples = 21018
  Batch size = 64


{'epoch': 3.0,
 'eval_bleu': 52.932594546181996,
 'eval_loss': 0.8559045791625977,
 'eval_runtime': 990.5987,
 'eval_samples_per_second': 21.217,
 'eval_steps_per_second': 0.332}

That’s a nearly 14-point improvement, which is great!

# Using your fine-tuned model for Translation

Once you’ve fine-tuned the model you can use it with a pipeline object, for inference as follows:

In [44]:
from transformers import pipeline

In [45]:
translate = pipeline(task='translation', model=model, tokenizer=tokenizer, device=0)
default_translate = pipeline(task='translation', model='Helsinki-NLP/opus-mt-en-fr', device=0)

loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-en-fr/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/5ad88432037ab18b1eb95761258d2b1b3a32e1e401d5f610f86eb3f479e59e8c.2b4f07b3f8de3922d42e6312c55d0597e44d2273507e7c5d0b6daf75fb2cc673
Model config MarianConfig {
  "_name_or_path": "Helsinki-NLP/opus-mt-en-fr",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      59513
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 59513,
  "decoder_vocab_size": 59514,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0

In [46]:
text = "Default to expanded threads"

In [47]:
translate(text)[0]

{'translation_text': 'Par défaut, développer les fils de discussion'}

In [48]:
default_translate(text)[0]

{'translation_text': 'Par défaut pour les threads élargis'}

In [49]:
text = "Unable to import %1 using the OFX importer plugin. This file is not the correct format."

In [50]:
translate(text)[0]

{'translation_text': "Impossible d'importer %1 en utilisant le module externe d'importation OFX. Ce fichier n'est pas le bon format."}

In [51]:
default_translate(text)[0]

{'translation_text': "Impossible d'importer %1 en utilisant le plugin d'importateur OFX. Ce fichier n'est pas le bon format."}

We can feed some examples from the test set (which the model has not seen) to our pipeline to get a feel for the quality of the summaries. 

In [55]:
for item in split_datasets['validation'].select(range(10)):
  en = item['translation']['en']
  fr = item['translation']['fr']
  print('English Sentence:', en)
  print('Actual French Translation:', fr)

  tr = translate(en, truncation=True)[0]['translation_text']
  print('Fine-tuned Transformer French Translation:', tr)
  df_tr = default_translate(en, truncation=True)[0]['translation_text']
  print('Pre-trained Transformer French Translation:', df_tr)
  print('\n')

English Sentence: User and Group Permissions
Actual French Translation: Droits d'accès de l'utilisateur et du groupe
Fine-tuned Transformer French Translation: Droits d'accès de l'utilisateur et du groupe
Pre-trained Transformer French Translation: Autorisations de l'utilisateur et du groupe


English Sentence: Customize Formatting
Actual French Translation: Personnaliser le formatage




Fine-tuned Transformer French Translation: Personnaliser le formatage
Pre-trained Transformer French Translation: Personnaliser le formatage


English Sentence: This filter will apply a grayish look to the icon. Click Setup... to configure the intensity of this filter. Note that it is customary for most user interfaces to use this effect for disabled icons only.
Actual French Translation: Ce filtre appliquera un ton gris à l'icône. Cliquez Configurer... pour configurer l'intensité de ce filtre. Remarquez qu'il est courant pour la plupart des interfaces utilisateurs d'utiliser cet effet pour désactiver seulement les icônes.
Fine-tuned Transformer French Translation: Ce filtre appliquera un aspect grisâtre à l'icône. Cliquez sur Configuration... pour configurer l'intensité de ce filtre. Notez qu'il est habituel pour la plupart des interfaces utilisateur d'utiliser cet effet uniquement pour les icônes désactivées.
Pre-trained Transformer French Translation: Ce filtre appliquera un look gr