# Fine tunning 1 shot inference Dora Bertscore

# Fine-tuning

Fine-tuning refers to the process in transfer learning in which the parameter values of a model trained on a large dataset are modified when the training process continues on a small dataset (see [Kevin Murphy's book](https://probml.github.io/pml-book/book1.html) Section 19.2 for further details). The main motivation is to adapt a pre-trained model trained on a large amount of data to tackle a specific task providing better performance that would be achieved training on the small task-specific dataset.

In [None]:
!pip install datasets evaluate peft bitsandbytes transformers==4.45 #accelerate
!pip install sacrebleu unbabel-comet
!pip install huggingface_hub



In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("NilanE/ParallelFiction-Ja_En-100k", split="train")

print(raw_datasets)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['src', 'trg', 'meta'],
    num_rows: 106048
})


In [None]:
raw_datasets[0]

{'src': '77.素人の気づき\n「いい、アーニャ。今から行ったとしても、陛下が実際に選抜を通じて選ぶ妃は数人から数十人でしょう」\n「うん、そうだね」\n「ですが、それに既存の騎士選抜を重ね合わせることで、世の中の女達は陛下に選ばれるために、陛下が定めた基準――方向性に向かって成長していく流れになります」\n「......わあ」\n「騎士はどうしても男が中心、女が騎士になろうと考えるのは一部の物好き。ですが、玉の輿を望まない女なんてよほどの事でもなければいません。世の中の女は、陛下に気に入られる為に奮起するのです」\n「そこまで考えて......すごい!」\nアーニャは俺を尊敬しきった眼差しで見つめてきた。\n「余の考えを一瞬で読み切ったお前が凄いよ」\nそういい、微笑みながらオードリーを見た。\n「それは良いのですが、陛下は気に入った女はおられないのでしょうか?」\n「なんだ藪から棒に。この流れだと、女として、と言う意味なんだな?」\n聞きかえすと、オードリーは静かに頷いた。\n「なんでそんな事を聞く」\n「陛下は重要な事を忘れていらっしゃるように見受けられましたので」\n「重要な事?」\nなんか忘れてるか?\n「上皇陛下には多くの妃がおります。そして、\n「ふむ」\n様々な、という所で少しだけ吹き出しそうになった。\n中には臣下の妻だった女や、かつて自分の父親の妃――義理の母親だった女も妃にした。\n有名な話だ。\n時の皇帝が崩御した時は、政略的に妃にはしたが、まだ六歳という幼さ故に手付かずの女の子が一人いた。\nつまり、六歳の未亡人と言うことだ。\nそれが成長し、適齢期になった時、その美しさを見初めた父上が無理矢理自分の妻にした。\n武勇伝には事欠かないのが上皇、父上なのである。\n「臣下の妻をものにしたとき、自分の義理の母にあたる少女を手籠めにしたとき、上皇陛下は誰かに咎められまして?」\n「いいや?」\n皇帝がなぜ、その程度の事で咎められるものか。\nもっとあり得ない、非人道的な事をやっても咎められもしないのが皇帝という物だ。\n「ええ、陛下の反応そのままです」\n「何が言いたい」\n「陛下は貴族の義務と良くおっしゃってますが、貴族の権利を忘れているように思います」\n「......ふむ」\nなるほど、もっと地位と権力を享受しろって言いたいのか。\

In [None]:
raw_datasets = raw_datasets.remove_columns(["meta"])

## Preprocess

In [None]:
# Flatten and reduce the dataset
max_tok_length = 16


def flatten_examples(batch):
    flat_jp = []
    flat_en = []
    for jp, en in zip(batch["src"], batch["trg"]):
        # Too big for my resources we do a prefilter by the english size to reduce the time cost of tokenizing (as later we will be doing a filter by tokenizing length)
        i = 0
        # evitar frases iniciales o titulo de capitulo
        for e, j in zip(en.split("\n")[10:], jp.split("\n")[10:]):
            if len(e.split()) <= max_tok_length:
                flat_jp += [j]
                flat_en += [e]
                i += 1
                ## Max of 1 sentence per character but with apropiete length
                break
        if len(flat_jp) == 1000:
            # max of 10K data for time constraints to train the model
            break

    flat_data = {"src": flat_jp, "trg": flat_en}
    return flat_data


# Apply flattening
flat_dataset = raw_datasets.map(
    flatten_examples,
    batched=True,
    remove_columns=raw_datasets.column_names,
)

In [None]:
flat_dataset[0]

{'src': '「それは良いのですが、陛下は気に入った女はおられないのでしょうか?」',
 'trg': '"That\'s all well and good, but is there no woman you like?"'}

Logging in HuggingFace to be granted access to Llama2 with 7B parameters:

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the needs of the model that is to be fine-tuned. In the case of Llama2, it is recommended to explicitly state a task prompt for each source sentence:

In [None]:
from transformers import AutoTokenizer

checkpoint = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint,
    use_auth_token=True,
    padding=True,
    pad_to_multiple_of=8,
    truncation=True,
    max_length=max_tok_length,
    padding_side="left",
)
tokenizer.pad_token = "[PAD]"



In [None]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["src"],
        text_target=sample["trg"],
    )
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*. We can check what the preprocess_function is doing with a small sample

In [None]:
sample = raw_datasets.select(range(2))
model_input = preprocess_function(sample)
print(model_input)

{'input_ids': [[1, 29871, 29955, 29955, 29889, 31605, 30313, 30199, 31648, 230, 132, 168, 30538, 13, 30481, 30298, 30298, 30330, 30310, 30185, 30635, 30561, 30267, 31482, 30412, 30513, 30448, 30665, 30366, 30364, 30326, 30466, 30723, 30330, 236, 156, 158, 30557, 30458, 31525, 236, 157, 158, 30353, 31562, 233, 141, 159, 30396, 30768, 31115, 30466, 31562, 31782, 232, 169, 134, 30449, 30354, 30313, 30412, 30513, 30354, 30802, 30313, 30499, 30326, 31414, 30465, 30482, 13, 30481, 30465, 30389, 30330, 31110, 30465, 30955, 31684, 30482, 13, 30481, 30499, 30427, 30458, 30330, 31110, 30553, 30353, 233, 154, 165, 30946, 30199, 236, 171, 145, 30927, 31562, 233, 141, 159, 30396, 30908, 31684, 30733, 31068, 31095, 30332, 30589, 30364, 30499, 30330, 30793, 30199, 30275, 30199, 30647, 31883, 30449, 236, 156, 158, 30557, 30353, 31562, 31254, 30553, 30332, 30366, 30954, 30353, 30330, 236, 156, 158, 30557, 30458, 30495, 30954, 30366, 31359, 233, 189, 153, 30217, 30217, 30525, 31331, 30952, 30353, 31331,

In [None]:
for sample in model_input["input_ids"]:
    print(tokenizer.convert_ids_to_tokens(sample))

['<s>', '▁', '7', '7', '.', '素', '人', 'の', '気', '<0xE3>', '<0x81>', '<0xA5>', 'き', '<0x0A>', '「', 'い', 'い', '、', 'ア', 'ー', 'ニ', 'ャ', '。', '今', 'か', 'ら', '行', 'っ', 'た', 'と', 'し', 'て', 'も', '、', '<0xE9>', '<0x99>', '<0x9B>', '下', 'が', '実', '<0xE9>', '<0x9A>', '<0x9B>', 'に', '選', '<0xE6>', '<0x8A>', '<0x9C>', 'を', '通', 'じ', 'て', '選', 'ぶ', '<0xE5>', '<0xA6>', '<0x83>', 'は', '数', '人', 'か', 'ら', '数', '十', '人', 'で', 'し', 'ょ', 'う', '」', '<0x0A>', '「', 'う', 'ん', '、', 'そ', 'う', 'だ', 'ね', '」', '<0x0A>', '「', 'で', 'す', 'が', '、', 'そ', 'れ', 'に', '<0xE6>', '<0x97>', '<0xA2>', '存', 'の', '<0xE9>', '<0xA8>', '<0x8E>', '士', '選', '<0xE6>', '<0x8A>', '<0x9C>', 'を', '重', 'ね', '合', 'わ', 'せ', 'る', 'こ', 'と', 'で', '、', '世', 'の', '中', 'の', '女', '達', 'は', '<0xE9>', '<0x99>', '<0x9B>', '下', 'に', '選', 'ば', 'れ', 'る', 'た', 'め', 'に', '、', '<0xE9>', '<0x99>', '<0x9B>', '下', 'が', '定', 'め', 'た', '基', '<0xE6>', '<0xBA>', '<0x96>', '―', '―', '方', '向', '性', 'に', '向', 'か', 'っ', 'て', '成', '長', 'し', 'て', 'い', 'く', '流', 'れ', 'に

We can recover the source text by applying [batch_decode](https://huggingface.co/docs/transformers/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode) of the tokenizer

In [None]:
tokenizer.batch_decode(model_input["input_ids"])

['<s> 77.素人の気づき\n「いい、アーニャ。今から行ったとしても、陛下が実際に選抜を通じて選ぶ妃は数人から数十人でしょう」\n「うん、そうだね」\n「ですが、それに既存の騎士選抜を重ね合わせることで、世の中の女達は陛下に選ばれるために、陛下が定めた基準――方向性に向かって成長していく流れになります」\n「......わあ」\n「騎士はどうしても男が中心、女が騎士になろうと考えるのは一部の物好き。ですが、玉の輿を望まない女なんてよほどの事でもなければいません。世の中の女は、陛下に気に入られる為に奮起するのです」\n「そこまで考えて......すごい!」\nアーニャは俺を尊敬しきった眼差しで見つめてきた。\n「余の考えを一瞬で読み切ったお前が凄いよ」\nそういい、微笑みながらオードリーを見た。\n「それは良いのですが、陛下は気に入った女はおられないのでしょうか?」\n「なんだ藪から棒に。この流れだと、女として、と言う意味なんだな?」\n聞きかえすと、オードリーは静かに頷いた。\n「なんでそんな事を聞く」\n「陛下は重要な事を忘れていらっしゃるように見受けられましたので」\n「重要な事?」\nなんか忘れてるか?\n「上皇陛下には多くの妃がおります。そして、\n「ふむ」\n様々な、という所で少しだけ吹き出しそうになった。\n中には臣下の妻だった女や、かつて自分の父親の妃――義理の母親だった女も妃にした。\n有名な話だ。\n時の皇帝が崩御した時は、政略的に妃にはしたが、まだ六歳という幼さ故に手付かずの女の子が一人いた。\nつまり、六歳の未亡人と言うことだ。\nそれが成長し、適齢期になった時、その美しさを見初めた父上が無理矢理自分の妻にした。\n武勇伝には事欠かないのが上皇、父上なのである。\n「臣下の妻をものにしたとき、自分の義理の母にあたる少女を手籠めにしたとき、上皇陛下は誰かに咎められまして?」\n「いいや?」\n皇帝がなぜ、その程度の事で咎められるものか。\nもっとあり得ない、非人道的な事をやっても咎められもしないのが皇帝という物だ。\n「ええ、陛下の反応そのままです」\n「何が言いたい」\n「陛下は貴族の義務と良くおっしゃってますが、貴族の権利を忘れているように思います」\n「......ふむ」\nなるほど、もっと地位と権力を享受しろって言いたいのか。\n「話

Now, we can apply the preprocess_function to the raw datasets (training, validation and test):

In [None]:
tokenized_datasets = flat_dataset.map(preprocess_function, batched=True)

We are going to filter the tokenized datasets by maximum number of tokens in source and target language:

In [None]:
tokenized_datasets = tokenized_datasets.filter(
    lambda x: len(x["input_ids"]) <= max_tok_length
    and len(x["labels"]) <= max_tok_length,
    desc=f"Discarding source and target sentences with more than {max_tok_length} tokens",
)

We can take a quick look at the length histogram in the source language:

In [None]:
# split the data into train and test and validation
from datasets import DatasetDict

small_tokenized_datasets = tokenized_datasets.shuffle(seed=42).select(range(2000))

tokenized_datasets = small_tokenized_datasets.train_test_split(test_size=0.2)
test_valid = tokenized_datasets["test"].train_test_split(test_size=0.5)
tokenized_datasets["test"] = test_valid["test"]
tokenized_datasets["valid"] = test_valid["train"]

In [None]:
dic = {}
for sample in tokenized_datasets["train"]:
    sample_length = len(sample["input_ids"])
    if sample_length not in dic:
        dic[sample_length] = 1
    else:
        dic[sample_length] += 1

for i in range(1, max_tok_length + 1):
    if i in dic:
        print(f"{i:>2} {dic[i]:>3}")

 2   3
 3   1
 4   5
 5  21
 6  78
 7  84
 8 108
 9 139
10 133
11 153
12 151
13 192
14 162
15 178
16 192


Checking a sample after filtering by maximum number of tokens:

In [None]:
for sample in tokenized_datasets["train"].select(range(5)):
    print(sample["input_ids"])
    print(sample["attention_mask"])
    print(sample["labels"])

[1, 29871, 30481, 30290, 30290, 30290, 29991, 29973, 30449, 30330, 30449, 30298, 30290, 30290, 30290, 30482]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 338, 29892, 4874, 856, 376]
[1, 29871, 30481, 30441, 30366, 236, 132, 141, 31298, 30353, 30538, 30466, 31684, 30482]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 376, 27796, 1250, 304, 1708, 29892, 1449, 29892, 1047, 5410, 1699, 1497, 1790, 29889]
[1, 29871, 30987, 30255, 30203, 29991]
[1, 1, 1, 1, 1, 1]
[1, 3446, 15789, 29991]
[1, 29871, 30481, 30641, 30330, 30641, 30453, 30458, 30364, 3045, 636, 30482]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 518, 29911, 29899, 25271, 366, 17361, 29871]
[1, 29871, 30481, 31244, 30486, 30723, 31855, 31992, 30332, 29973, 30482]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1938, 366, 864, 697, 29973]


In [None]:
import torch

src = "Japanese"
tgt = "English"
task_prefix = f"Translate from {src} to {tgt}:\n"
s = ""

prefix_tok_len = len(tokenizer.encode(f"{task_prefix}{src}: {s} = {tgt}: "))
max_tok_len = prefix_tok_len
# Adding 2 for new line in target sentence and eos_token_id token
max_tok_len += 2 * max_tok_length + 2


def preprocess4training_function(sample):

    sample_size = len(sample["src"])

    # Creating the prompt with the task description for each source sentence
    inputs = [f"{task_prefix}{src}: {s} = {tgt}: " for s in sample["src"]]

    # Appending new line after each sample in the batch
    targets = [f"{s}\n" for s in sample["trg"]]

    # Applying the Llama2 tokenizer to the inputs and targets
    # to obtain "input_ids" (token_ids) and "attention mask"
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)

    # Each input is appended with its target
    # Each target is prepended with as many special token id (-100) as the original input length
    # Both input and target (label) has the same max_tok_len
    # Attention mask is all 1s
    for i in range(sample_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id]
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])

    # Each input is applied left padding up to max_tok_len
    # Attention mask is 0 for padding
    # Each target (label) is left filled with special token id (-100)
    # Finally inputs, attention_mask and targets (labels) are truncated to max_tok_len
    for i in range(sample_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_tok_len - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (
            max_tok_len - len(sample_input_ids)
        ) + model_inputs["attention_mask"][i]
        labels["input_ids"][i] = [-100] * (
            max_tok_len - len(sample_input_ids)
        ) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(
            model_inputs["input_ids"][i][:max_tok_len]
        )
        model_inputs["attention_mask"][i] = torch.tensor(
            model_inputs["attention_mask"][i][:max_tok_len]
        )
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_tok_len])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


s = ""
num_shots = 1
shots = ""

prefix_tok_len = len(tokenizer.encode(f"{task_prefix}{shots}{src}: {s} = {tgt}: "))
shot_tok_len = len(tokenizer.encode(f"{src}: {s} = {tgt}: {s}\n"))
max_tok_len_test = prefix_tok_len
max_tok_len_test += num_shots * (shot_tok_len + 2 * max_tok_length)
max_tok_len_test += max_tok_length

random_seed = 13
sample = tokenized_datasets["train"].shuffle(seed=random_seed).select(range(num_shots))
for s in sample:
    shots += f"{src}: {s['src']} = {tgt}: {s['trg']}\n"


def preprocess4test_function(sample):
    inputs = [f"{task_prefix}{shots}{src}: {s} = {tgt}: " for s in sample["src"]]
    model_inputs = tokenizer(
        inputs,
        max_length=max_tok_len_test,
        truncation=True,
        return_tensors="pt",
        padding=True,
    )
    return model_inputs

We can check what the preprocess4training_function is doing:

In [None]:
sample = tokenized_datasets["train"].select(range(2))
model_input = preprocess4training_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input.input_ids))

{'input_ids': [tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     1,  4103,  9632,   515, 10369,   304,  4223, 29901,
           13, 29967, 21419,   968, 29901, 29871, 30481, 30290, 30290, 30290,
        29991, 29973, 30449, 30330, 30449, 30298, 30290, 30290, 30290, 30482,
          353,  4223, 29901, 29871,     1,   338, 29892,  4874,   856,   376,
           13,     2]), tensor([    0,     0,     0,     0,     0,     1,  4103,  9632,   515, 10369,
          304,  4223, 29901,    13, 29967, 21419,   968, 29901, 29871, 30481,
        30441, 30366,   236,   132,   141, 31298, 30353, 30538, 30466, 31684,
        30482,   353,  4223, 29901, 29871,     1,   376, 27796,  1250,   304,
         1708, 29892,  1449, 29892,  1047,  5410,  1699,  1497,  1790, 29889,
           13,     2])], 'attention_mask': [tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

We need to replace -100 by 0 to apply batch_decode:

In [None]:
import numpy as np

for i in range(len(model_input["labels"])):
    print(
        tokenizer.batch_decode(
            [
                np.where(
                    model_input["labels"][i] < 0,
                    tokenizer.pad_token_id,
                    model_input["labels"][i],
                )
            ]
        )
    )

['<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><s> is, yes... "\n</s>']
['<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><s> "Come back to play, again, sometime," said another.\n</s>']


In the case of the test set, we just preprocess the inputs (source sentences)

In [None]:
# def preprocess4test_function(sample):
#     inputs = [f"{task_prefix}{src}: {s} = {tgt}: " for s in sample["src"]]
#     model_inputs = tokenizer(inputs,padding=True,)
#     return model_inputs

We can check what the preprocess4test_function is doing:

In [None]:
sample = tokenized_datasets["train"].select(range(2))
model_input = preprocess4test_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input.input_ids))

{'input_ids': tensor([[    1,  4103,  9632,   515, 10369,   304,  4223, 29901,    13, 29967,
         21419,   968, 29901, 29871, 30617, 30203, 30391, 30907, 30203, 30199,
           353,  4223, 29901, 11810,   793,    13, 29967, 21419,   968, 29901,
         29871, 30481, 30290, 30290, 30290, 29991, 29973, 30449, 30330, 30449,
         30298, 30290, 30290, 30290, 30482,   353,  4223, 29901, 29871],
        [    0,     0,     1,  4103,  9632,   515, 10369,   304,  4223, 29901,
            13, 29967, 21419,   968, 29901, 29871, 30617, 30203, 30391, 30907,
         30203, 30199,   353,  4223, 29901, 11810,   793,    13, 29967, 21419,
           968, 29901, 29871, 30481, 30441, 30366,   236,   132,   141, 31298,
         30353, 30538, 30466, 31684, 30482,   353,  4223, 29901, 29871]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1],
        [0, 0

Preprocessing train and dev sets:

In [None]:
preprocessed_train_dataset = tokenized_datasets["train"].map(
    preprocess4training_function, batched=True
)
preprocessed_dev_dataset = tokenized_datasets["valid"].map(
    preprocess4training_function, batched=True
)

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
for sample in preprocessed_train_dataset.select(range(5)):
    print(sample["input_ids"])
    print(sample["attention_mask"])
    print(sample["labels"])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 10369, 304, 4223, 29901, 13, 29967, 21419, 968, 29901, 29871, 30481, 30290, 30290, 30290, 29991, 29973, 30449, 30330, 30449, 30298, 30290, 30290, 30290, 30482, 353, 4223, 29901, 29871, 1, 338, 29892, 4874, 856, 376, 13, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1, 338, 29892, 4874, 856, 376, 13, 2]
[0, 0, 0, 0, 0, 1, 4103, 9632, 515, 10369, 304, 4223, 29901, 13, 29967, 21419, 968, 29901, 29871, 30481, 30441, 30366, 236, 132, 141, 31298, 30353, 30538, 30466, 31684, 30482, 353, 4223, 29901, 29871, 1, 376, 27796, 1250, 304, 1708, 29892, 1449, 29892, 1047, 5

Preprocessing test set:

In [None]:
preprocessed_test_dataset = tokenized_datasets["test"].map(
    preprocess4test_function, batched=True
)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
for sample in preprocessed_test_dataset.select(range(5)):
    print(sample["input_ids"])
    print(sample["attention_mask"])
    print(sample["labels"])

[0, 0, 0, 0, 0, 1, 4103, 9632, 515, 10369, 304, 4223, 29901, 13, 29967, 21419, 968, 29901, 29871, 30617, 30203, 30391, 30907, 30203, 30199, 353, 4223, 29901, 11810, 793, 13, 29967, 21419, 968, 29901, 29871, 30793, 30967, 30458, 234, 184, 133, 31068, 30332, 30267, 353, 4223, 29901, 29871]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 450, 3186, 10614, 29889]
[0, 0, 1, 4103, 9632, 515, 10369, 304, 4223, 29901, 13, 29967, 21419, 968, 29901, 29871, 30617, 30203, 30391, 30907, 30203, 30199, 353, 4223, 29901, 11810, 793, 13, 29967, 21419, 968, 29901, 29871, 30481, 234, 170, 132, 30458, 235, 132, 161, 30538, 30366, 30298, 30482, 353, 4223, 29901, 29871]
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 376, 29902, 864, 304, 8293, 372, 29908]
[0, 0, 1, 4103, 9632, 515, 10369, 304, 4223, 29

[bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index) is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [None]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Next, you should call the prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [None]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=False,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

[LoRA (Low-Rank Adaptation of Large Language Models)](https://huggingface.co/docs/peft/task_guides/lora_based_methods) is a [parameter-efficient fine-tuning (PEFT)](https://huggingface.co/docs/peft/index) technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. For example, to train with LoRA, load and create a LoraConfig class and specify the following parameters:

<ul>
<li>task_type: the task to train for (sequence-to-sequence language modeling in this case)</li>
<li>r: the dimension of the low-rank matrices</li>
<li>lora_alpha: the scaling factor for the low-rank matrices</li>
<li>target_modules: determine what set of parameters are adapted</li>
<li>lora_dropout: the dropout probability of the LoRA layers</li>
</ul>

In [None]:
from peft import AdaLoraConfig, get_peft_model

config = AdaLoraConfig(
    task_type="CAUSAL_LM",
    r=8,
    init_r=12,
    tinit=200,
    tfinal=1000,
    deltaT=10,
    inference_mode=False,
    bias="none",
)

Once LoRA and the quantization are setup, create a quantized PeftModel with the get_peft_model() function. It takes a quantized model and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [None]:
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 6,292,224 || all params: 6,744,707,904 || trainable%: 0.0933


The function that is responsible for putting together samples inside a batch is called a collate function.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False, pad_to_multiple_of=8
)

## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) is to define a [TrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [None]:
from transformers import TrainingArguments

batch_size = 8
gradient_accumulation_steps = 8
model_name = checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-jpn-to-en",
    evaluation_strategy="epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,
    warmup_steps=100,
    optim="adamw_bnb_8bit",
    prediction_loss_only=True,
    gradient_accumulation_steps=gradient_accumulation_steps,
    bf16=True,
    bf16_full_eval=True,
    group_by_length=True,
)



Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer and the data collator:

In [None]:
from transformers import Trainer

trainer = Trainer(
    lora_model,
    args,
    train_dataset=preprocessed_train_dataset,
    eval_dataset=preprocessed_dev_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer. However, the [wandb library](https://docs.wandb.ai/guides) is used and it requires to have a [wandb account and login](https://docs.wandb.ai/guides/integrations/huggingface/).

In [None]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mdiegorospagan[0m ([33mdiegorospagan-universitat-polit-cnica-de-val-ncia[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Epoch,Training Loss,Validation Loss
1,No log,5.456931
2,No log,5.265535


Epoch,Training Loss,Validation Loss
1,No log,5.456931
2,No log,5.265535
3,No log,4.968456


TrainOutput(global_step=75, training_loss=5.349221598307292, metrics={'train_runtime': 4214.458, 'train_samples_per_second': 1.139, 'train_steps_per_second': 0.018, 'total_flos': 1.06664719859712e+16, 'train_loss': 5.349221598307292, 'epoch': 3.0})

## Inference

At inference time, it is recommended to use [generate()](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate). This method takes care of encoding the input and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.

Let us first load the default inference parameters of Llama-2:

In [None]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
)

print(generation_config)

GenerationConfig {
  "bos_token_id": 1,
  "do_sample": true,
  "eos_token_id": 2,
  "max_length": 4096,
  "pad_token_id": 0,
  "temperature": 0.6,
  "top_p": 0.9
}



As observed, the default search strategy for Llama-2 is Top-p with probability 0.9 and temperature 0.6 ($0<T<1$ amplifies output probability differences and makes output more deterministic). [The search strategy can be selected](https://huggingface.co/docs/transformers/en/generation_strategies) at inference time.

First, the test set is divided in small batches to reduce GPU memory comsumption:

In [None]:
test_batch_size = 16
batch_tokenized_test = preprocessed_test_dataset.batch(test_batch_size)

Batching examples:   0%|          | 0/200 [00:00<?, ? examples/s]

Batches are provided to the [generate()](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate) together with inference parameters to define the search strategy. In this case, num_beams = 1 and do_sample = False means greedy search.

In [None]:
number_of_batches = len(batch_tokenized_test["input_ids"])
print(number_of_batches)
output_sequences = []
for i in range(number_of_batches):
    output_batch = lora_model.generate(
        generation_config=generation_config,
        input_ids=torch.tensor(batch_tokenized_test["input_ids"][i]).cuda(),
        attention_mask=torch.tensor(batch_tokenized_test["attention_mask"][i]).cuda(),
        max_length=max_tok_len_test,
        num_beams=1,
        do_sample=False,
    )
    output_sequences.extend(output_batch)

13


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


## Evaluation

The output of the model is automatically evaluated compared to the reference translations. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu).

In [None]:
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [None]:
from evaluate import load
from comet import download_model, load_from_checkpoint

metric = load("sacrebleu")
comet_model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(comet_model_path)

metric_bert = load("bertscore")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/f49d328952c3470eff6bb6f545d62bfdb6e66304/checkpoints/model.ckpt`


The example below performs a basic post-processing to decode the predictions and extract the translation:

In [None]:
import re


def compute_metrics(sample, output_sequences):
    inputs = [f"{task_prefix}{shots}{src}: {s} = {tgt}: " for s in sample["src"]]
    preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
    print(inputs)
    print(preds)
    for i, (input, pred) in enumerate(zip(inputs, preds)):
        pred = re.search(r"^.*\n", pred.removeprefix(input).lstrip())
        if pred is not None:
            preds[i] = pred.group()[:-1]
        else:
            preds[i] = ""
    print(sample["src"])
    print(sample["trg"])
    print(preds)
    result = metric.compute(predictions=preds, references=sample["trg"])
    result = {"bleu": result["score"]}

    data = [
        {"src": s, "mt": hyp, "ref": ref}
        for hyp, ref, s in zip(preds, sample["trg"], sample["src"])
    ]
    comet_score = comet_model.predict(data, batch_size=64, gpus=1)
    result["comet"] = comet_score.system_score
    result["bert"] = metric_bert.compute(
        predictions=preds, references=sample["trg"], lang="en"
    )
    return result

In [None]:
result = compute_metrics(preprocessed_test_dataset, output_sequences)
print(f'BLEU score: {result["bleu"]}')
print(f'COMET score: {result["comet"]}')
print(f'BERT score: {result["bert"]}')

['Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 世界が終わる。 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 「私が聞きたい」 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 「この人何言ってるの...」 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 「ええ、もちろん」 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 「およそ三十です!」 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 「......ロザリア、いけるか?」 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 「つまり?」 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: は前方に目をやる。 = English: ', 'Translate from Japanese to English:\nJapanese: ダンジョンの = English: Titles\nJapanese: 「ん、その話はあとでな」 = English: ', 'Translate from

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 4/4 [00:02<00:00,  2.00it/s]


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BLEU score: 12.505173514818223
COMET score: 0.7239703592658043
BERT score: {'precision': [0.7288625240325928, 0.9517673254013062, 0.7264583706855774, 0.7178710699081421, 0.8412075042724609, 0.8709268569946289, 0.721587061882019, 0.8860756158828735, 0.7155346274375916, 0.7747470140457153, 0.6951544880867004, 0.72942054271698, 0.8947609663009644, 0.8288899064064026, 0.7724850177764893, 0.7046142816543579, 0.7671598196029663, 0.707313597202301, 0.7554371356964111, 0.9529846906661987, 0.7895702123641968, 0.7593603134155273, 0.8250941038131714, 0.7295812964439392, 0.7600378394126892, 0.8317292332649231, 0.805440366268158, 0.7343340516090393, 0.8336200714111328, 0.744515597820282, 0.7495718002319336, 0.7120800018310547, 0.9210644960403442, 0.7191663384437561, 0.7060800790786743, 0.7791399955749512, 0.908947229385376, 0.696564793586731, 0.7266285419464111, 0.848137617111206, 0.910399317741394, 0.7505790591239929, 0.8199219703674316, 0.0, 0.8561563491821289, 0.8824658393859863, 0.7996928691864



In [None]:
print(f'BLEU score: {result["bleu"]}')
print(f'COMET score: {result["comet"]}')
print(
    f'Average BERT score: {sum(result["bert"]["precision"])/len(result["bert"]["precision"])}'
)
print(
    f'Average BERT recall: {sum(result["bert"]["recall"])/len(result["bert"]["recall"])}'
)
print(f'Average Bert F1: {sum(result["bert"]["f1"])/len(result["bert"]["f1"])}')

BLEU score: 12.505173514818223
COMET score: 0.7239703592658043
Average BERT score: 0.7491818386316299
Average BERT recall: 0.8016983643174171
Average Bert F1: 0.773612901866436
