# **Translation**
If you have a big enough corpus of texts in two (or more) languages, you can **train a new translation model from scratch**. It will be faster, however, to **fine-tune an existing translation model**, be it a multilingual one like mT5 or mBART that you want to fine-tune to a specific language pair, or even a model specialized for translation from one language to another that you want to fine-tune to your specific corpus.

In [1]:
!pip install transformers[sentencepiece] datasets evaluate accelerate

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m


**Fine-tune a *Marian model* pretrained to translate from English to Indonesia** on the `KDE4 dataset`, which is *a dataset of localized files for the KDE apps*.

The model we will use has been pretrained on a large corpus of Indonesia and English texts taken from the **Opus dataset**, which actually contains the KDE4 dataset. But even if the pretrained model we use has seen that data during its pretraining, we will see that we can **get a better version of it after fine-tuning**.

## **Preparing the Data**
To fine-tune or train a translation model from scratch, we will **need a dataset suitable for the task**. As mentioned previously, we’ll use the `KDE4 dataset`, but you can **adapt the code to use your own data quite easily, as long as you have pairs of sentences in the two languages you want to translate from and into**.

### **The KDE4 dataset**

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="id")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


kde4.py:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

The repository for kde4 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/kde4.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/354k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [3]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 14782
    })
})

We have 210,173 pairs of sentences, but in one single split, so we will need to create our own **validation set**. A `Dataset` has a `train_test_split()` method that can help us. We’ll provide a seed for reproducibility

In [4]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 13303
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 1479
    })
})

rename the `"test"` key to `"validation"`

In [5]:
split_datasets["validation"] = split_datasets.pop("test")
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 13303
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 1479
    })
})

Let's take a look at one element of the dataset

In [13]:
split_datasets["train"][100]['translation']

{'en': 'Delete window', 'id': 'Hapus jendelaName'}

We get a dictionary with **two sentences in the pair of languages we requested**. One particularity of this dataset full of technical computer science terms is that they are all fully translated in Indonesia. The pretrained model we use, which has been pretrained on a larger corpus of Indonesia and English sentences, takes the easier option of **leaving the word as is**.

In [14]:
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-en-id"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/291M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/796k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/801k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'translation_text': 'Baku ke thread yang diperluas'}]

When it translated to Indonesia the `'thread'` is not getting translated. Just like in course when `'thread'` getting translated into France

Another example of this behavior can be seen with the word `window,”` which isn’t officially an Indonesia word but which most native speakers will **understand and not bother to translate**. In the KDE4 dataset this word has been translated in French into the more official `“jendela”`

In [15]:
split_datasets["train"][102]["translation"]

{'en': 'Close the current window.', 'id': 'Tutup jendela saat ini.'}

Our pretrained model, however, **sticks with the compact and familiar English word**

In [16]:
translator("Unable to import %1 using the OFX importer plugin. This file is not the correct format.")

[{'translation_text': 'Tak bisa mengimpor% 1 memakai plugin importir OFX. Berkas ini bukan format yang benar.'}]

In [17]:
split_datasets["train"][20]['translation']['en']

'Klipper version'

In [19]:
split_datasets['train'][100]

{'id': '2123',
 'translation': {'en': 'Delete window', 'id': 'Hapus jendelaName'}}

In [22]:
for example in split_datasets['train']:
    if 'mouse' in example['translation']['en'].lower():
        print("Ids: ", example['id'])
        print("English:", example['translation']['en'])
        print("Indonesia:", example['translation']['id'])
        break

Ids:  4053
English: The threshold is the smallest distance that the mouse pointer must move on the screen before acceleration has any effect. If the movement is smaller than the threshold, the mouse pointer moves as if the acceleration was set to 1X; thus, when you make small movements with the physical device, there is no acceleration at all, giving you a greater degree of control over the mouse pointer. With larger movements of the physical device, you can move the mouse pointer rapidly to different areas on the screen.
Indonesia: Batas adalah jarak terkecil yang penunjuk tetikus harus pindah di layar sebelum akselerasi mempunyai efek. Jika pergerakan lebih kecil dari batas, penunjuk tetikus bergerak seperti akselerasi diatur ke 1X; sehingga, ketika anda membuat pergerakan kecil dengan divais fisik, tidak ada akselerasi sama sekali, memberikan anda kendali yang lebih besar terhadap penunjuk tetikus. Dengan pergerakan yang lebih besar dari divais fisik, anda dapat memindahkan penunjuk

### **Processing the Data**
The texts all need to be **converted into sets of token IDs** so the model can make sense of them. For this task, we’ll need to **tokenize both the inputs and the targets**. Our first task is to create our *tokenizer object*. As noted earlier, we’ll be using a **Marian English to French pretrained model**. If you are trying this code with another pair of languages, make sure to adapt the model checkpoint.

In [23]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-id"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

Need to **ensure that the tokenizer processes the targets in the output language** (here, Indonesia). You can do this by **passing the targets to the `text_targets` argument of the tokenizer’s `__call__` method**

In [28]:
en_sentence = split_datasets["train"][8]["translation"]["en"]
id_sentence = split_datasets["train"][8]["translation"]["id"]

inputs = tokenizer(en_sentence, text_target=id_sentence)
inputs

{'input_ids': [1477, 42871, 31856, 0], 'attention_mask': [1, 1, 1, 1], 'labels': [8201, 8712, 17151, 0]}

**The output contains the input `IDs` associated with the English sentence**, while the `IDs` associated with the Indonesia one are stored in the `labels` field. If you forget to indicate that you are tokenizing labels, they will be tokenized by the input tokenizer, which in the case of a Marian model is not going to go well at all:

In [29]:
wrong_targets = tokenizer(id_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(inputs["labels"]))

['▁Re', 'nice', '▁Pro', 's', 'es', '</s>']
['▁Ren', 'ice', '▁Proses', '</s>']


As we can see, using the English tokenizer to preprocess a Indonesia sentence **results in a lot more tokens, since the tokenizer doesn’t know any Indonesian words** (except those that also appear in the English language, like “discussion”).

Since `inputs` is a **dictionary with our usual keys** (input IDs, attention mask, etc.), the last step is to **define the `preprocessing` function we will apply on the datasets**.

In [30]:
max_length = 128

def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["id"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )

    return model_inputs

Note that we set the same maximum length for our inputs and outputs. Since the texts we’re dealing with seem pretty short, we use 128.

apply that **preprocessing** in one go on all the splits of our dataset

In [31]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Map:   0%|          | 0/13303 [00:00<?, ? examples/s]

Map:   0%|          | 0/1479 [00:00<?, ? examples/s]

In [32]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 13303
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1479
    })
})

Now that the data has been preprocessed, we are ready to fine-tune our pretrained model!

## **Fine-Tuning the Model with the Trainer API**
The actual code using the `Trainer` will be the same as before, with just one little change: we use a `Seq2SeqTrainer` here, which is a subclass of `Trainer` that will allow us to properly deal with the evaluation, using the `generate()` method to predict outputs from the inputs

First things first, we need an **actual model to fine-tune**. We’ll use the usual `AutoModel` API

In [33]:
from transformers import AutoModelForSeq2SeqLM, TFAutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
# model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)

### **Data Collation**
We’ll need a data collator to deal with the padding for dynamic batching. We can’t just use a `DataCollatorWithPadding` like in Chapter 3 in this case, because that only pads the inputs (input IDs, attention mask, and token type IDs).

Our labels should also be **padded to the maximum length encountered in the labels**. And, as mentioned previously, the padding value used to pad the labels should be `-100` and not the padding token of the tokenizer, to make sure **those padded values are ignored in the loss computation**.

This is all done by a `DataCollatorForSeq2Seq`. Like the `DataCollatorWithPadding`, it takes the **tokenizer used to preprocess the inputs, but it also takes the model**. This is because this data collator will also be responsible for **preparing the decoder input IDs, which are shifted versions of the labels with a special token at the beginning**. Since this shift is done slightly differently for different architectures, the `DataCollatorForSeq2Seq` needs to know the model object.

In [34]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

To test this on a few samples, we just call it on a list of examples from our tokenized training set

In [35]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 5)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

We can check our labels have been padded to the maximum length of the batch, using `-100`:

In [36]:
batch["labels"]

tensor([[  279,   454,  9834, 10642,  8849,  5709,   925,  3586,  1636,     0,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
        [26228,  2045,    22,   288, 25865, 47041,     0,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
        [17151,   968,     0,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
        [  770, 43985,   911,   933,   311,  1006,    37,   311,   825, 20228,
          1241, 20634,     2,   490,   801,   536,    36,     0]])

And we can also have a look at the decoder input IDs, to see that they are shifted versions of the labels

In [37]:
batch["decoder_input_ids"]

tensor([[54795,   279,   454,  9834, 10642,  8849,  5709,   925,  3586,  1636,
             0, 54795, 54795, 54795, 54795, 54795, 54795, 54795],
        [54795, 26228,  2045,    22,   288, 25865, 47041,     0, 54795, 54795,
         54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795],
        [54795, 17151,   968,     0, 54795, 54795, 54795, 54795, 54795, 54795,
         54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795],
        [54795,   770, 43985,   911,   933,   311,  1006,    37,   311,   825,
         20228,  1241, 20634,     2,   490,   801,   536,    36]])

Here are the labels for the first and second elements in our dataset:

In [38]:
for i in range(1, 5):
    print(tokenized_datasets["train"][i]["labels"])

[279, 454, 9834, 10642, 8849, 5709, 925, 3586, 1636, 0]
[26228, 2045, 22, 288, 25865, 47041, 0]
[17151, 968, 0]
[770, 43985, 911, 933, 311, 1006, 37, 311, 825, 20228, 1241, 20634, 2, 490, 801, 536, 36, 0]


We will pass this `data_collator` along to the `Seq2SeqTrainer`

### **Metrics**
The feature that `Seq2SeqTrainer` adds to its superclass `Trainer` is the ability to use the `generate()` method during evaluation or prediction. During training, the model will use the `decoder_input_ids` with an `attention mask` ensuring it **does not use the tokens after the token it’s trying to predict**, to speed up training. During inference we won’t be able to use those since we won’t have labels, so it’s a good idea to evaluate our model with the same setup.

The traditional metric used for *translation* is the **BLEU score**, introduced in a 2002 article by Kishore Papineni et al. The BLEU score **evaluates how close the translations are to their labels**. It does not measure the intelligibility or grammatical correctness of the model’s generated outputs, but **uses statistical rules to ensure that all the words in the generated outputs also appear in the targets**.

**One weakness** with `BLEU` is that **it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers**. So instead, the most commonly used metric for **benchmarking translation** models today is **`SacreBLEU`**, which addresses this weakness (and others) by **standardizing the tokenization step**.

In [39]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.4.3-py3-none-any.whl (103 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/104.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, colorama, sacr

In [40]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

This metric will **take texts as inputs and targets**. It is designed to **accept several acceptable targets, as there are often multiple acceptable translations of the same sentence** — the dataset we’re using only provides one, but it’s not uncommon in NLP to find datasets that give several sentences as labels. So, **the predictions should be a list of sentences, *but* the references should be a list of lists of sentences.**

In [41]:
predictions = ["This plugin lets you translate web pages between several languages automatically."]

references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]

metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

This gets a BLEU score of 46.75, which is rather good — for reference, the original Transformer model in the “Attention Is All You Need” paper achieved a BLEU score of 41.8 on a similar translation task between English and French! On the other hand, if we try with the two bad types of predictions (lots of repetitions or too short) that often come out of translation models, we will get rather bad BLEU scores

In [42]:
predictions = ["This This This This"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]

metric.compute(predictions=predictions, references=references)

{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}

In [43]:
predictions = ["This plugin"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 0.0,
 'counts': [2, 1, 0, 0],
 'totals': [2, 1, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 0.004086771438464067,
 'sys_len': 2,
 'ref_len': 13}

**The score can go from 0 to 100, and higher is better.**

To get from the **model outputs to texts** the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to **clean up all the `-100`s in the labels** (the tokenizer will automatically do the same for the padding token)

In [44]:
import numpy as np

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    # Model outputs to texts
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

## **Fine-tuning the Model with Keras**

In [45]:
from huggingface_hub import notebook_login
# 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
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Define our `Seq2SeqTrainingArguments`. Like for the `Trainer`, we use a subclass of `TrainingArguments` that contains a few more fields

In [46]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"marian-finetuned-kde4-en-to-id",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

# fp16=True --> speeds up training on modern GPUs
# Don’t set any regular evaluation, as evaluation takes a while; we will just evaluate our model once before training and after



Finally, we just pass everything to the `Seq2SeqTrainer`:

In [47]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Before training, we’ll first look at the score our model gets, to double-check that we’re not making things worse with our fine-tuning.

*This command will take a bit of time, so you can grab a coffee while it executes*

In [48]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 1.2910255193710327,
 'eval_model_preparation_time': 0.0067,
 'eval_bleu': 48.845870185372405,
 'eval_runtime': 77.9119,
 'eval_samples_per_second': 18.983,
 'eval_steps_per_second': 0.308}

A BLEU score of 48 is not too bad, which reflects the fact that our model is already good at translating English sentences to Indonesia ones.

Next is the training, which will also take a bit of time:

In [49]:
trainer.train()

Step,Training Loss
500,1.0463
1000,0.7885


Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[54795]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[54795]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[54795]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[54795]], 'forced_eos_token_id': 0}


TrainOutput(global_step=1248, training_loss=0.881784377953945, metrics={'train_runtime': 215.4036, 'train_samples_per_second': 185.275, 'train_steps_per_second': 5.794, 'total_flos': 545872256040960.0, 'train_loss': 0.881784377953945, 'epoch': 3.0})

Note that **while the training happens, each time the model is saved (here, every epoch) it is uploaded to the Hub in the background**. This way, you will be able to to resume your training on another machine if necessary

In [50]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 0.7495644092559814,
 'eval_model_preparation_time': 0.0067,
 'eval_bleu': 64.00608523431251,
 'eval_runtime': 79.088,
 'eval_samples_per_second': 18.701,
 'eval_steps_per_second': 0.303,
 'epoch': 3.0}

Finally, we use the `push_to_hub()` method to make sure we upload the latest version of the model. The `Trainer` also drafts a model card with all the evaluation results and uploads it. This model card contains** metadata that helps the Model Hub pick the widget for the inference demo**.

Usually, there is no need to say anything as it can infer the right widget from the model class, but in this case, the same model class can be used for all kinds of sequence-to-sequence problems, so we specify it’s a translation model

In [51]:
trainer.push_to_hub(tags="translation", commit_message="Training complete")

Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[54795]], 'forced_eos_token_id': 0}


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1727076780.f61acd88d395.287.0:   0%|          | 0.00/6.91k [00:00<?, ?B/s]

events.out.tfevents.1727077151.f61acd88d395.287.1:   0%|          | 0.00/473 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ditherr/marian-finetuned-kde4-en-to-fr/commit/e2ffad1bc976a6e4b0d19fde7777a0b3cb22521f', commit_message='Training complete', commit_description='', oid='e2ffad1bc976a6e4b0d19fde7777a0b3cb22521f', pr_url=None, pr_revision=None, pr_num=None)

In [52]:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("translation", model="ditherr/marian-finetuned-kde4-en-to-id")
pipe("Default to expanded threads")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/289M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/288 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/842 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/796k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/801k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'translation_text': 'Standar untuk thread yang diperluas'}]

In [55]:
pipe("Trying to understand what is behind the Translation process")

[{'translation_text': 'Mencoba untuk memahami apa yang ada di belakang proses Terjemahan'}]

## **A Custom Training Loop**
Let’s now take a look at the full training loop, so you can easily customize the parts you need

### **Preparing Everything for Training**

First we’ll build the `DataLoaders` from our datasets, after setting the datasets to the `"torch"` format so we get PyTorch tensors

In [None]:
from torch.utils.data import DataLoader

# Torch
tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

Next we **reinstantiate our model**, to make sure we’re not continuing the fine-tuning from before but starting from the pretrained model again

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

need an **optimizer**

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

Once we have all those objects, we can **send them to the `accelerator.prepare()` method.**

Remember that if you want to train on TPUs in a Colab notebook, you will need to move all of this code into a training function, and that shouldn’t execute any cell that instantiates an Accelerator

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that we have sent our `train_dataloader` to `accelerator.prepare()`, we can use its `length` to compute the number of training steps.

Remember we **should always do this after preparing the dataloader, as that method will change the length of the `DataLoader`**. We use a classic `linear schedule` from the learning rate to 0:

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

Lastly, to **push our model to the Hub**, we will need to create a `Repository` object in a working folder

In [None]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

Then we can clone that repository in a local folder

In [None]:
output_dir = "marian-finetuned-kde4-en-to-fr-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

We can now upload anything we save in `output_dir` by calling the `repo.push_to_hub()` method

### **Training Loop**
To simplify its evaluation part, we define this `postprocess()` function that takes predictions and labels and converts them to the lists of strings our metric object will expect

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

The first thing to note is that we use the `generate()` method to compute predictions, but this is a method on our base model, not the wrapped model 🤗 Accelerate created in the `prepare()` method. That’s why we unwrap the model first, then call this method.

The second thing is that, like with token classification, **two processes may have padded the inputs and labels to different shapes, so we use `accelerator.pad_across_processes()` to make the predictions and labels the same shape before calling the `gather()` method**. If we don’t do this, the evaluation will either error out or hang forever

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

### **TensorFlow**

In [None]:
# ...
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16,
)

In [None]:
import numpy as np
import tensorflow as tf
from tqdm import tqdm

generation_data_collator = DataCollatorForSeq2Seq(
    tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128
)

tf_generate_dataset = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    collate_fn=generation_data_collator,
    shuffle=False,
    batch_size=8,
)


@tf.function(jit_compile=True)
def generate_with_xla(batch):
    return model.generate(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        max_new_tokens=128,
    )


def compute_metrics():
    all_preds = []
    all_labels = []

    for batch, labels in tqdm(tf_generate_dataset):
        predictions = generate_with_xla(batch)
        decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        labels = labels.numpy()
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        decoded_preds = [pred.strip() for pred in decoded_preds]
        decoded_labels = [[label.strip()] for label in decoded_labels]
        all_preds.extend(decoded_preds)
        all_labels.extend(decoded_labels)

    result = metric.compute(predictions=all_preds, references=all_labels)
    return {"bleu": result["score"]}

In [None]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_epochs = 3
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [None]:
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
    output_dir="marian-finetuned-kde4-en-to-fr", tokenizer=tokenizer
)

model.fit(
    tf_train_dataset,
    validation_data=tf_eval_dataset,
    callbacks=[callback],
    epochs=num_epochs,
)

## **Using the Fine-Tuned model**
To use it locally in a pipe

```
# This is formatted as code
```

line, we just have to specify the proper model identifier

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

In [None]:
translator("Unable to import %1 using the OFX importer plugin. This file is not the correct format.")