# Fine-tuning

Fine-tuning refers to the process in transfer learning in which the parameter values of a model trained on a large dataset are modified when the training process continues on a small dataset (see [Kevin Murphy's book](https://probml.github.io/pml-book/book1.html) Section 19.2 for further details). The main motivation is to adapt a pre-trained model trained on a large amount of data to tackle a specific task providing better performance that would be achieved training on the small task-specific dataset.

In [1]:
!pip install datasets evaluate peft bitsandbytes transformers==4.45 #accelerate
!pip install sacrebleu unbabel-comet

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting transformers==4.45
  Downloading transformers-4.45.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.21,>=0.20 (from transformers==4.45)
  Downloading tokenizers-0.20.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none

In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).


In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("NilanE/ParallelFiction-Ja_En-100k", split="train")

print(raw_datasets)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

dataset-Ja_En-Massive-v2.jsonl:   0%|          | 0.00/2.18G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/106048 [00:00<?, ? examples/s]

Dataset({
    features: ['src', 'trg', 'meta'],
    num_rows: 106048
})


In [3]:
raw_datasets.features

{'src': Value(dtype='string', id=None),
 'trg': Value(dtype='string', id=None),
 'meta': {'general': {'series_title_eng': Value(dtype='string', id=None),
   'series_title_jap': Value(dtype='string', id=None),
   'sentence_alignment_score': Value(dtype='float64', id=None)},
  'novelupdates': {'link': Value(dtype='string', id=None),
   'genres': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
   'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
   'rating': Value(dtype='float64', id=None),
   'rating_votes': Value(dtype='int64', id=None)},
  'syosetu': {'link': Value(dtype='string', id=None),
   'series_active': Value(dtype='bool', id=None),
   'writer': Value(dtype='string', id=None),
   'fav_novel_cnt': Value(dtype='int64', id=None),
   'global_points': Value(dtype='int64', id=None)}}}

In [4]:
raw_datasets[0]

{'src': '77.素人の気づき\n「いい、アーニャ。今から行ったとしても、陛下が実際に選抜を通じて選ぶ妃は数人から数十人でしょう」\n「うん、そうだね」\n「ですが、それに既存の騎士選抜を重ね合わせることで、世の中の女達は陛下に選ばれるために、陛下が定めた基準――方向性に向かって成長していく流れになります」\n「......わあ」\n「騎士はどうしても男が中心、女が騎士になろうと考えるのは一部の物好き。ですが、玉の輿を望まない女なんてよほどの事でもなければいません。世の中の女は、陛下に気に入られる為に奮起するのです」\n「そこまで考えて......すごい!」\nアーニャは俺を尊敬しきった眼差しで見つめてきた。\n「余の考えを一瞬で読み切ったお前が凄いよ」\nそういい、微笑みながらオードリーを見た。\n「それは良いのですが、陛下は気に入った女はおられないのでしょうか?」\n「なんだ藪から棒に。この流れだと、女として、と言う意味なんだな?」\n聞きかえすと、オードリーは静かに頷いた。\n「なんでそんな事を聞く」\n「陛下は重要な事を忘れていらっしゃるように見受けられましたので」\n「重要な事?」\nなんか忘れてるか?\n「上皇陛下には多くの妃がおります。そして、\n「ふむ」\n様々な、という所で少しだけ吹き出しそうになった。\n中には臣下の妻だった女や、かつて自分の父親の妃――義理の母親だった女も妃にした。\n有名な話だ。\n時の皇帝が崩御した時は、政略的に妃にはしたが、まだ六歳という幼さ故に手付かずの女の子が一人いた。\nつまり、六歳の未亡人と言うことだ。\nそれが成長し、適齢期になった時、その美しさを見初めた父上が無理矢理自分の妻にした。\n武勇伝には事欠かないのが上皇、父上なのである。\n「臣下の妻をものにしたとき、自分の義理の母にあたる少女を手籠めにしたとき、上皇陛下は誰かに咎められまして?」\n「いいや?」\n皇帝がなぜ、その程度の事で咎められるものか。\nもっとあり得ない、非人道的な事をやっても咎められもしないのが皇帝という物だ。\n「ええ、陛下の反応そのままです」\n「何が言いたい」\n「陛下は貴族の義務と良くおっしゃってますが、貴族の権利を忘れているように思います」\n「......ふむ」\nなるほど、もっと地位と権力を享受しろって言いたいのか。\

In [5]:
raw_datasets = raw_datasets.remove_columns(["meta"])

## Preprocess

In [6]:
# Flatten and reduce the dataset
max_tok_length = 16


def flatten_examples(batch):
    flat_jp = []
    flat_en = []
    for jp, en in zip(batch["src"], batch["trg"]):
        # Too big for my resources we do a prefilter by the english size to reduce the time cost of tokenizing (as later we will be doing a filter by tokenizing length)
        i = 0
        # evitar frases iniciales o titulo de capitulo
        for e, j in zip(en.split("\n")[10:], jp.split("\n")[10:]):
            if len(e.split()) <= max_tok_length:
                flat_jp += [j]
                flat_en += [e]
                i += 1
                if i == 1:
                    ## Max of 2 sentence per character
                    break
    flat_data = {"src": flat_jp, "trg": flat_en}
    return flat_data


# Apply flattening
flat_dataset = raw_datasets.map(
    flatten_examples,
    batched=True,
    remove_columns=raw_datasets.column_names,
)

Map:   0%|          | 0/106048 [00:00<?, ? examples/s]

In [7]:
flat_dataset[0]

{'src': '「それは良いのですが、陛下は気に入った女はおられないのでしょうか?」',
 'trg': '"That\'s all well and good, but is there no woman you like?"'}

In [8]:
# split the data into train and test and validation
from datasets import DatasetDict

raw_datasets.shuffle()
raw_datasets = flat_dataset.train_test_split(test_size=0.2)
test_valid = raw_datasets["test"].train_test_split(test_size=0.5)
raw_datasets["test"] = test_valid["test"]
raw_datasets["valid"] = test_valid["train"]

Now we load the pre-trained tokenizer for the NLLB model and apply it to the Japanese-English pair:

In [9]:
from transformers import AutoTokenizer

checkpoint = "facebook/nllb-200-distilled-600M"
# from flores200_codes import flores_codes
src_code = "jpn_Jpan"
tgt_code = "eng_Latn"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint,
    padding=True,
    pad_to_multiple_of=8,
    src_lang=src_code,
    tgt_lang=tgt_code,
    truncation=True,
    max_length=max_tok_length,
)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]



We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the training needs of the model that is to be finetuned:

In [10]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["src"],
        text_target=sample["trg"],
    )
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*. We can check what the preprocess_function is doing with a small sample

In [11]:
sample = raw_datasets["train"].select(range(2))
model_input = preprocess_function(sample)
print(model_input)

{'input_ids': [[256079, 151003, 30345, 785, 55835, 18090, 968, 248203, 252773, 2], [256079, 151003, 37692, 253131, 250049, 248871, 249672, 249462, 248449, 249305, 248203, 30350, 137718, 250049, 248871, 249672, 249462, 248449, 249305, 248718, 248203, 252773, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[256047, 3555, 2125, 248116, 248065, 359, 1482, 248203, 2], [256047, 69, 39798, 248116, 248066, 248079, 13136, 12, 147, 905, 248203, 7177, 17606, 248, 13136, 12, 147, 905, 2097, 2]]}


In [12]:
for sample in model_input["input_ids"]:
    print(tokenizer.convert_ids_to_tokens(sample))

['jpn_Jpan', '▁「', 'そんな', 'こと', '出来る', 'わけ', 'ない', '!', '」', '</s>']
['jpn_Jpan', '▁「', 'だから', '、', 'ケ', 'リ', 'ナ', 'グ', 'ー', 'レ', '!', '▁私の', '名前は', 'ケ', 'リ', 'ナ', 'グ', 'ー', 'レ', 'よ', '!', '」', '</s>']


We can recover the source text by applying [batch_decode](https://huggingface.co/docs/transformers/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode) of the tokenizer

In [13]:
tokenizer.batch_decode(model_input["input_ids"])

['jpn_Jpan 「そんなこと出来るわけない!」</s>', 'jpn_Jpan 「だから、ケリナグーレ! 私の名前はケリナグーレよ!」</s>']

Now, we can apply the preprocess_function to the raw datasets (training, validation and test):

In [14]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/84606 [00:00<?, ? examples/s]

Map:   0%|          | 0/10576 [00:00<?, ? examples/s]

Map:   0%|          | 0/10576 [00:00<?, ? examples/s]

We are going to filter the tokenized datasets by maximum number of tokens in source and target language:

In [15]:
tokenized_datasets = tokenized_datasets.filter(
    lambda x: len(x["input_ids"]) <= max_tok_length
    and len(x["labels"]) <= max_tok_length,
    desc=f"Discarding source and target sentences with more than {max_tok_length} tokens",
)

Discarding source and target sentences with more than 16 tokens:   0%|          | 0/84606 [00:00<?, ? examples…

Discarding source and target sentences with more than 16 tokens:   0%|          | 0/10576 [00:00<?, ? examples…

Discarding source and target sentences with more than 16 tokens:   0%|          | 0/10576 [00:00<?, ? examples…

We can take a quick look at the length histogram in the source language:

In [16]:
dic = {}
for sample in tokenized_datasets["train"]:
    sample_length = len(sample["input_ids"])
    if sample_length not in dic:
        dic[sample_length] = 1
    else:
        dic[sample_length] += 1

for i in range(1, max_tok_length + 1):
    if i in dic:
        print(f"{i:>2} {dic[i]:>3}")

 3 116
 4 213
 5 887
 6 2036
 7 3144
 8 3618
 9 3956
10 4278
11 4304
12 4243
13 3801
14 3306
15 2598
16 2073


Checking a sample after filtering by maximum number of tokens:

In [17]:
for sample in tokenized_datasets["train"].select(range(5)):
    print(sample["input_ids"])
    print(sample["attention_mask"])
    print(sample["labels"])

[256079, 151003, 30345, 785, 55835, 18090, 968, 248203, 252773, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[256047, 3555, 2125, 248116, 248065, 359, 1482, 248203, 2]
[256079, 151003, 248979, 248243, 1851, 253935, 249431, 249305, 250254, 248854, 252773, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[256047, 709, 198972, 64246, 248079, 221447, 33349, 248059, 2]
[256079, 26723, 253131, 33315, 44486, 20380, 17156, 248183, 95252, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[256047, 11582, 248116, 248066, 117510, 248116, 248066, 52101, 40197, 30435, 248075, 2]
[256079, 151003, 248300, 253131, 249191, 2771, 248203, 252773, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1]
[256047, 69, 46520, 248079, 6370, 796, 248, 2097, 2]
[256079, 34569, 250438, 5823, 253935, 2]
[1, 1, 1, 1, 1, 1]
[256047, 11873, 155664, 1929, 10527, 248075, 2]


bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [18]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [19]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(
    checkpoint, quantization_config=quantization_config
)

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Next, you should call the prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [20]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=False,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

[LoRA (Low-Rank Adaptation of Large Language Models)](https://huggingface.co/docs/peft/task_guides/lora_based_methods) is a [parameter-efficient fine-tuning (PEFT)](https://huggingface.co/docs/peft/index) technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. For example, to train with LoRA, load and create a LoraConfig class and specify the following parameters:

<ul>
<li>task_type: the task to train for (sequence-to-sequence language modeling in this case)</li>
<li>r: the dimension of the low-rank matrices</li>
<li>lora_alpha: the scaling factor for the low-rank matrices</li>
<li>target_modules: determine what set of parameters are adapted</li>
<li>lora_dropout: the dropout probability of the LoRA layers</li>
</ul>

In [21]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    task_type="SEQ_2_SEQ_LM",
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)

Once LoRA and the quantization are setup, create a quantized PeftModel with the get_peft_model() function. It takes a quantized model and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [22]:
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 2,359,296 || all params: 617,433,088 || trainable%: 0.3821


The function that is responsible for putting together samples inside a batch is called a collate function. It is an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them. This is not possible in our case since the inputs we have are not all of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via DataCollatorForSeq2Seq that takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs), so we will also need to instantiate the model first to provide it to the collate function:

In [23]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer, model=lora_model, pad_to_multiple_of=8
)

## Evaluation

The last thing to define for our Seq2SeqTrainer is how to compute the metrics to evaluate the predictions of our model with respect to references. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu). You can see a simple example of usage below:

:

In [24]:
from evaluate import load

metric = load("sacrebleu")
# Importar COMET y cargar el modelo preentrenado
from comet import download_model, load_from_checkpoint

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

We need to define a function compute_metrics to compute BLEU scores at each epoch. The example below performs a basic post-processing to decode the predictions into texts:

In [25]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace negative ids in the labels as we can't decode them.
    # labels = np.where(labels < 0, labels, tokenizer.pad_token_id)
    for i in range(len(labels)):
        labels[i] = [tokenizer.pad_token_id if j < 0 else j for j in labels[i]]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result


def compute_comet(eval_preds):
    # Descargar y cargar el modelo COMET
    comet_model_path = download_model("Unbabel/wmt22-comet-da")
    comet_model = load_from_checkpoint(comet_model_path)
    preds, labels, source = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace negative ids in the labels as we can't decode them.
    # labels = np.where(labels < 0, labels, tokenizer.pad_token_id)
    for i in range(len(labels)):
        labels[i] = [tokenizer.pad_token_id if j < 0 else j for j in labels[i]]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    for i in range(len(source)):
        source[i] = [tokenizer.pad_token_id if j < 0 else j for j in source[i]]
    source = tokenizer.batch_decode(source, skip_special_tokens=True)
    source = [[s.strip()] for s in source]

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # evaluate comet
    data = [
        {"src": s, "mt": hyp, "ref": ref}
        for hyp, ref, s in zip(decoded_preds, decoded_labels, source)
    ]
    comet_score = comet_model.predict(data, batch_size=64, gpus=1)
    return comet_score.system_score

## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer#trainer) is to define a [Seq2SeqTrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [26]:
from transformers import Seq2SeqTrainingArguments

batch_size = 32
model_name = checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-jp-to-en",
    evaluation_strategy="epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=2,
    predict_with_generate=True,
)



Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer, the data collator and the compute_metrics function:

In [27]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    lora_model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer:

In [28]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.5278,1.435621,26.6186,11.9035
2,1.4928,1.419115,26.9024,11.9454


TrainOutput(global_step=2412, training_loss=1.5493338665558924, metrics={'train_runtime': 1597.065, 'train_samples_per_second': 48.305, 'train_steps_per_second': 1.51, 'total_flos': 2629714415714304.0, 'train_loss': 1.5493338665558924, 'epoch': 2.0})

## Inference

At inference time, it is recommended to use [generate()](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/text_generation#transformers.GenerationMixin.generate). This method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers. There’s also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how generation works in general in encoder-decoder models.

Let us first load the default inference parameters of NLLB:

In [29]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
)

print(generation_config)

GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "eos_token_id": 2,
  "max_length": 200,
  "pad_token_id": 1
}



We prepare the test set in batches to be translated:

In [30]:
test_batch_size = 32
batch_tokenized_test = tokenized_datasets["test"].batch(test_batch_size)

Batching examples:   0%|          | 0/4776 [00:00<?, ? examples/s]

Processing in batches to add padding and converting to tensors, then perform inference with num_beams = 1 and do_sample = False, that is, greedy search.

In [31]:
number_of_batches = len(batch_tokenized_test["src"])
output_sequences = []
for i in range(number_of_batches):
    inputs = tokenizer(
        batch_tokenized_test["src"][i],
        max_length=max_tok_length,
        truncation=True,
        return_tensors="pt",
        padding=True,
    )
    output_batch = lora_model.generate(
        generation_config=generation_config,
        input_ids=inputs["input_ids"].cuda(),
        attention_mask=inputs["attention_mask"].cuda(),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_code),
        max_length=max_tok_length,
        num_beams=1,
        do_sample=False,
    )
    output_sequences.extend(output_batch.cpu())

In [32]:
result = compute_metrics((output_sequences, tokenized_datasets["test"]["labels"]))
print(f'BLEU score: {result["bleu"]}')

BLEU score: 26.2851


In [33]:
metric = compute_comet(
    (
        output_sequences,
        tokenized_datasets["test"]["labels"],
        tokenized_datasets["test"]["input_ids"],
    )
)
print(metric)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

LICENSE:   0%|          | 0.00/9.69k [00:00<?, ?B/s]

model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.38k [00:00<?, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/f49d328952c3470eff6bb6f545d62bfdb6e66304/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 75/75 [01:00<00:00,  1.24it/s]


0.731359436820999
