# Fine-tuning of the Marian model (Estonian to English)

In [None]:
!pip install evaluate openai-whisper datasets peft transformers==4.45
!pip install sacrebleu unbabel-comet
!pip install -U bitsandbytes

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting transformers==4.45
  Downloading transformers-4.45.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.21,>=0.20 (from transformers==4.45)
  Downloading tokenizers-0.20.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-

In [None]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("csv", data_files="/content/drive/My Drive/dataset_asr.csv")

# raw_datasets = raw_datasets.select_columns(["transcription","translation"])

raw_datasets = raw_datasets["train"].train_test_split(test_size=0.83, shuffle=False)

print(raw_datasets)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['transcription', 'sentence', 'translation', 'hypothesis_clean', 'sentence_clean', 'translation_clean'],
        num_rows: 170
    })
    test: Dataset({
        features: ['transcription', 'sentence', 'translation', 'hypothesis_clean', 'sentence_clean', 'translation_clean'],
        num_rows: 830
    })
})


Show the training pairs (transcription-translation)

In [None]:
raw_datasets["train"][:12]["transcription"]

[' Ey, ey, ég vel hvort ey.',
 ' Kõik kirjeldatud probleimid on Sirlele pähe toonud ka elust loobumise mõtteid.',
 ' Lõpliku otsuse peaks viss sportlasa suhtes tegema järgmisel nädalal.',
 ' Puhkust on vaja taastamiseks, et tööd saaks teha selge peaga.',
 ' Rakett või battari on iga inimese enda valik.',
 ' Ma ei kohanud õpinguta jooksul ühtegi stereotüübsed nohikmatemaatikud.',
 ' Õlletootjate sõnul tuleneb alkoholivaba Õllahind kallimast tehnoloogiast.',
 ' Nætendid mängiti páril mæku ljubbu päeval.',
 ' Tagasiside on Ariadne sõnul olnud väga pozitiivne.',
 ' Senúa behiteld súr tepsust.',
 ' Paastamine ei seis näinult hoidus loobumises.',
 ' Kui sa mängid nüksa tugevama vastu, siis see on väga oluline omadus.']

In [None]:
raw_datasets["train"][:12]["translation"]

['No, no, and again, no.',
 'All these problems had made Sirli consider ending her life.',
 'The FIS should make their final decision regarding the athletes next week.',
 'They needed the vacation to recover so that they could work with a clear head.',
 'A rocket or a battery is every person’s personal choice.',
 'I never met a stereotypical nerdy mathematician during my studies.',
 'According to beer manufacturers, non-alcoholic beer’s price is formed by the more expensive technology used.',
 'The play was performed on a couple of days at the end of May.',
 'According to Ariadne, feedback has been very positive.',
 'This requires great accuracy from the constructor.',
 'Fasting is not only about giving up food.',
 'If you play against someone slightly stronger, this is a very important quality.']

<p style="page-break-after:always;"></p>

Now we load the pre-trained tokenizer for the NLLB model and apply it to the Estonian-English pair:

In [None]:
max_tok_length = 275

from transformers import AutoTokenizer

checkpoint = "Helsinki-NLP/opus-mt-et-en"
# from flores200_codes import flores_codes
src_code = "et"
tgt_code = "en"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint,
    padding=True,
    pad_to_multiple_of=8,
    src_lang=src_code,
    tgt_lang=tgt_code,
    truncation=False,
    max_length=max_tok_length,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/825k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/790k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.35M [00:00<?, ?B/s]



We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the training needs of the model that is to be finetuned:

In [None]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["transcription"],
        text_target=sample["translation"],
    )
    return model_inputs

<p style="page-break-after:always;"></p>

Now, we can apply the preprocess_function to the raw datasets (training, validation and test):

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/170 [00:00<?, ? examples/s]

Map:   0%|          | 0/830 [00:00<?, ? examples/s]

We can take a quick look at the length histogram in the source language:

In [None]:
dic = {}
for sample in tokenized_datasets["train"]:
    sample_length = len(sample["input_ids"])
    if sample_length not in dic:
        dic[sample_length] = 1
    else:
        dic[sample_length] += 1

for i in range(1, max_tok_length + 1):
    if i in dic:
        print(f"{i:>2} {dic[i]:>3}")

 7   1
 9   1
10   4
11   4
12   2
13   6
14   7
15   6
16   8
17   6
18  10
19   7
20   4
21   9
22   8
23   6
24   7
25   6
26   5
27  10
28   8
29   3
30   6
31   6
32   4
33   4
34   3
35   4
36   6
37   1
38   1
39   1
40   2
43   1
44   1
47   1
57   1


Checking a sample after filtering by maximum number of tokens:

bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [None]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config,
)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/299M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Next, you should call the prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [None]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=False,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

[LoRA (Low-Rank Adaptation of Large Language Models)](https://huggingface.co/docs/peft/task_guides/lora_based_methods) is a [parameter-efficient fine-tuning (PEFT)](https://huggingface.co/docs/peft/index) technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

<p style="page-break-after:always;"></p>

Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. For example, to train with LoRA, load and create a LoraConfig class and specify the following parameters:

<ul>
<li>task_type: the task to train for (sequence-to-sequence language modeling in this case)</li>
<li>r: the dimension of the low-rank matrices</li>
<li>lora_alpha: the scaling factor for the low-rank matrices</li>
<li>target_modules: determine what set of parameters are adapted</li>
<li>lora_dropout: the dropout probability of the LoRA layers</li>
</ul>

In [None]:
from peft import AdaLoraConfig, get_peft_model

config = AdaLoraConfig(
    r=8,
    init_r=12,
    tinit=200,
    tfinal=1000,
    deltaT=10,
    inference_mode=False,
    bias="none",
    task_type="SEQ_2_SEQ_LM",
    target_modules=["q_proj", "v_proj"],
)

Once LoRA and the quantization are setup, create a quantized PeftModel with the get_peft_model() function. It takes a quantized model and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [None]:
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 442,800 || all params: 75,245,012 || trainable%: 0.5885


The function that is responsible for putting together samples inside a batch is called a collate function. It is an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them. This is not possible in our case since the inputs we have are not all of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via DataCollatorForSeq2Seq that takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs), so we will also need to instantiate the model first to provide it to the collate function:

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer, model=lora_model, pad_to_multiple_of=8
)

## Evaluation

The last thing to define for our Seq2SeqTrainer is how to compute the metrics to evaluate the predictions of our model with respect to references. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu). You can see a simple example of usage below:

In [None]:
from evaluate import load

metric = load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

<p style="page-break-after:always;"></p>

We need to define a function compute_metrics to compute BLEU scores at each epoch. The example below performs a basic post-processing to decode the predictions into texts:

In [None]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace negative ids in the labels as we can't decode them.
    # labels = np.where(labels < 0, labels, tokenizer.pad_token_id)
    for i in range(len(labels)):
        labels[i] = [tokenizer.pad_token_id if j < 0 else j for j in labels[i]]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

<p style="page-break-after:always;"></p>

## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer#trainer) is to define a [Seq2SeqTrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [None]:
from transformers import Seq2SeqTrainingArguments

batch_size = 32
model_name = checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-est-to-eng",
    evaluation_strategy="epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=15,
    predict_with_generate=True,
)



Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer, the data collator and the compute_metrics function:

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    lora_model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer:

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,3.285895,26.1116,23.3072
2,No log,3.269742,26.1116,23.3072
3,No log,3.254211,26.1116,23.3072
4,No log,3.239388,26.1116,23.3072
5,No log,3.225369,26.1116,23.3072
6,No log,3.212259,26.1116,23.3072
7,No log,3.200169,26.1116,23.3072
8,No log,3.189204,26.1116,23.3072
9,No log,3.179469,26.1116,23.3072
10,No log,3.171054,26.1116,23.3072


TrainOutput(global_step=90, training_loss=3.593234592013889, metrics={'train_runtime': 923.1899, 'train_samples_per_second': 2.762, 'train_steps_per_second': 0.097, 'total_flos': 32342508088704.0, 'train_loss': 3.593234592013889, 'epoch': 15.0})

## Evaluation of the fine tuned NLLL model on the Europarl-ST test set

Let us first load the default inference parameters of NLLB:

In [None]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
)

print(generation_config)

GenerationConfig {
  "bad_words_ids": [
    [
      58865
    ]
  ],
  "bos_token_id": 0,
  "decoder_start_token_id": 58865,
  "eos_token_id": 0,
  "forced_eos_token_id": 0,
  "max_length": 512,
  "num_beams": 4,
  "pad_token_id": 58865,
  "renormalize_logits": true
}



We load the speech transcriptions (and reference translations) of Europarl-ST test set from the csv file generated in the experiment L4.1. Then, we prepare the test set in batches to be translated:

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("csv", data_files="/content/drive/My Drive/dataset_asr.csv")

print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['transcription', 'sentence', 'translation', 'hypothesis_clean', 'sentence_clean', 'translation_clean'],
        num_rows: 1000
    })
})


Each sample pair is preprocessed according to the training needs of the model that has been finetuned:

In [None]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["transcription"],
        text_target=sample["translation"],
    )
    return model_inputs

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
test_batch_size = 32
batch_tokenized_test = tokenized_datasets["train"].batch(test_batch_size)

Batching examples:   0%|          | 0/1000 [00:00<?, ? examples/s]

Processing in batches to add padding and converting to tensors, then perform inference with num_beams = 1 and do_sample = False, that is, greedy search.

In [None]:
number_of_batches = len(batch_tokenized_test["transcription"])
output_sequences = []
for i in range(number_of_batches):
    inputs = tokenizer(
        batch_tokenized_test["transcription"][i],
        max_length=max_tok_length,
        truncation=False,
        return_tensors="pt",
        padding=True,
    )
    output_batch = lora_model.generate(
        generation_config=generation_config,
        input_ids=inputs["input_ids"].cuda(),
        attention_mask=inputs["attention_mask"].cuda(),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_code),
        max_length=max_tok_length,
        num_beams=1,
        do_sample=False,
    )
    output_sequences.extend(output_batch.cpu())



We can recover the decoded predictions and references by applying [batch_decode](https://huggingface.co/docs/transformers/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode) of the tokenizer

In [None]:
decoded_preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)

In [None]:
references = tokenizer.batch_decode(
    tokenized_datasets["train"]["labels"], skip_special_tokens=True
)

Let's take a closer look at some decoded predictions and references:

In [None]:
references[:5]

['No, no, and again, no.',
 'All these problems had made Sirli consider ending her life.',
 'The FIS should make their final decision regarding the athletes next week.',
 'They needed the vacation to recover so that they could work with a clear head.',
 'A rocket or a battery is every persons personal choice.']

In [None]:
decoded_preds[:5]

['en, ey, ég vel hvort ey.',
 'en as well as the problems described, Sirle has also brought to her mind thoughts about giving up life.',
 'en the final decision on the sports gym next week.',
 'en there is a need for a resting holiday so that work can be done with a clear head.',
 'en a rocket or a trampari is a personal choice.']

Predictions and references are normalized using the Whisper basic text standardisation/normalization module

In [None]:
from whisper.normalizers.basic import BasicTextNormalizer

normalizer = BasicTextNormalizer()

decoded_preds_clean = [normalizer(text) for text in decoded_preds]
references_clean = [normalizer(text) for text in references]

In [None]:
from evaluate import load

metric = load("sacrebleu")

For evaluation, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu).

In [None]:
result = metric.compute(predictions=decoded_preds_clean, references=references_clean)
print(f'BLEU score: {result["score"]:.1f}')

BLEU score: 19.9


In [None]:
from evaluate import load

comet_metric = load("comet")

Downloading builder script:   0%|          | 0.00/6.97k [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/3.38k [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/9.69k [00:00<?, ?B/s]

model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/f49d328952c3470eff6bb6f545d62bfdb6e66304/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


Compute COMET figures using the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics.

In [None]:
comet_score = comet_metric.compute(
    predictions=decoded_preds_clean,
    references=references_clean,
    sources=raw_datasets["train"]["transcription"],
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


In [None]:
print(f"COMET: {comet_score['mean_score'] * 100:.2f} %")

COMET: 68.10 %
