# Prompting

Prompting in LLMs is the design of a structured input to provide task description, demostrations and the actual input for the model to generate a desired output.

In [1]:
!pip install datasets evaluate peft bitsandbytes transformers==4.45 #accelerate
!pip install sacrebleu unbabel-comet
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting transformers==4.45
  Downloading transformers-4.45.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.21,>=0.20 (from transformers==4.45)
  Downloading tokenizers-0.20.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none

In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

More precisely, we are going to explain how to perform In-Context Learning with the [Llama2 model](https://huggingface.co/docs/transformers/model_doc/llama2) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("NilanE/ParallelFiction-Ja_En-100k", split="train")

print(raw_datasets)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

dataset-Ja_En-Massive-v2.jsonl:   0%|          | 0.00/2.18G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/106048 [00:00<?, ? examples/s]

Dataset({
    features: ['src', 'trg', 'meta'],
    num_rows: 106048
})


In [3]:
raw_datasets[0]

{'src': '77.素人の気づき\n「いい、アーニャ。今から行ったとしても、陛下が実際に選抜を通じて選ぶ妃は数人から数十人でしょう」\n「うん、そうだね」\n「ですが、それに既存の騎士選抜を重ね合わせることで、世の中の女達は陛下に選ばれるために、陛下が定めた基準――方向性に向かって成長していく流れになります」\n「......わあ」\n「騎士はどうしても男が中心、女が騎士になろうと考えるのは一部の物好き。ですが、玉の輿を望まない女なんてよほどの事でもなければいません。世の中の女は、陛下に気に入られる為に奮起するのです」\n「そこまで考えて......すごい!」\nアーニャは俺を尊敬しきった眼差しで見つめてきた。\n「余の考えを一瞬で読み切ったお前が凄いよ」\nそういい、微笑みながらオードリーを見た。\n「それは良いのですが、陛下は気に入った女はおられないのでしょうか?」\n「なんだ藪から棒に。この流れだと、女として、と言う意味なんだな?」\n聞きかえすと、オードリーは静かに頷いた。\n「なんでそんな事を聞く」\n「陛下は重要な事を忘れていらっしゃるように見受けられましたので」\n「重要な事?」\nなんか忘れてるか?\n「上皇陛下には多くの妃がおります。そして、\n「ふむ」\n様々な、という所で少しだけ吹き出しそうになった。\n中には臣下の妻だった女や、かつて自分の父親の妃――義理の母親だった女も妃にした。\n有名な話だ。\n時の皇帝が崩御した時は、政略的に妃にはしたが、まだ六歳という幼さ故に手付かずの女の子が一人いた。\nつまり、六歳の未亡人と言うことだ。\nそれが成長し、適齢期になった時、その美しさを見初めた父上が無理矢理自分の妻にした。\n武勇伝には事欠かないのが上皇、父上なのである。\n「臣下の妻をものにしたとき、自分の義理の母にあたる少女を手籠めにしたとき、上皇陛下は誰かに咎められまして?」\n「いいや?」\n皇帝がなぜ、その程度の事で咎められるものか。\nもっとあり得ない、非人道的な事をやっても咎められもしないのが皇帝という物だ。\n「ええ、陛下の反応そのままです」\n「何が言いたい」\n「陛下は貴族の義務と良くおっしゃってますが、貴族の権利を忘れているように思います」\n「......ふむ」\nなるほど、もっと地位と権力を享受しろって言いたいのか。\

In [4]:
raw_datasets = raw_datasets.remove_columns(["meta"])

## Preprocess

In [5]:
# Flatten and reduce the dataset
max_tok_length = 16


def flatten_examples(batch):
    flat_jp = []
    flat_en = []
    for jp, en in zip(batch["src"], batch["trg"]):
        # Too big for my resources we do a prefilter by the english size to reduce the time cost of tokenizing (as later we will be doing a filter by tokenizing length)
        i = 0
        # evitar frases iniciales o titulo de capitulo
        for e, j in zip(en.split("\n")[10:], jp.split("\n")[10:]):
            if len(e.split()) <= max_tok_length:
                flat_jp += [j]
                flat_en += [e]
                i += 1
                if i == 1:
                    ## Max of 2 sentence per character
                    break
    flat_data = {"src": flat_jp, "trg": flat_en}
    return flat_data


# Apply flattening
flat_dataset = raw_datasets.map(
    flatten_examples,
    batched=True,
    remove_columns=raw_datasets.column_names,
)
flat_dataset[0]

Map:   0%|          | 0/106048 [00:00<?, ? examples/s]

{'src': '「それは良いのですが、陛下は気に入った女はおられないのでしょうか?」',
 'trg': '"That\'s all well and good, but is there no woman you like?"'}

In [6]:
# split the data into train and test and validation
from datasets import DatasetDict

raw_datasets.shuffle()
raw_datasets = flat_dataset.train_test_split(test_size=0.2)
test_valid = raw_datasets["test"].train_test_split(test_size=0.5)
raw_datasets["test"] = test_valid["test"]
raw_datasets["valid"] = test_valid["train"]

Logging in HuggingFace to be granted access to Llama2 with 7B parameters:

In [7]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `GCP` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `GCP`


We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the needs of the model that is to be prompted. In the case of Llama2, it is recommended to explicitly state a task prompt for each source sentence:

In [8]:
from transformers import AutoTokenizer


checkpoint = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint,
    use_auth_token=True,
    padding=True,
    pad_to_multiple_of=8,
    truncation=True,
    max_length=max_tok_length,
    padding_side="left",
)
tokenizer.pad_token = "[PAD]"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [9]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["src"],
        text_target=sample["trg"],
    )
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*. We can check what the preprocess_function is doing with a small sample

In [10]:
sample = raw_datasets["train"].select(range(2))
model_input = preprocess_function(sample)
print(model_input)

{'input_ids': [[1, 29871, 30481, 30566, 30427, 30458, 30353, 30990, 30880, 30458, 233, 133, 173, 30412, 30665, 30366, 30499, 30427, 31684, 30267, 234, 170, 132, 30199, 30525, 30449, 234, 184, 135, 30735, 30733, 31068, 31095, 30199, 236, 132, 142, 30723, 30641, 30453, 30441, 30326, 30366, 30326, 30482], [1, 29871, 30481, 30978, 30203, 30279, 30330, 31250, 30465, 31111, 30665, 30466, 30672, 30458, 30533, 234, 144, 135, 30412, 30513, 233, 141, 159, 30807, 30544, 30326, 30366, 30199, 30412, 30330, 30748, 30412, 30332, 30412, 230, 132, 138, 29973, 30482]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[1, 518, 2887, 3806, 29892, 366, 925, 750, 4319, 9885, 411, 23995, 1237, 29889, 306, 750, 1781, 9885, 411, 590, 1993, 14340, 5586], [1, 376, 29968, 296, 29

In [11]:
for sample in model_input["input_ids"]:
    print(tokenizer.convert_ids_to_tokens(sample))

['<s>', '▁', '「', 'さ', 'す', 'が', 'に', '相', '手', 'が', '<0xE6>', '<0x82>', '<0xAA>', 'か', 'っ', 'た', 'で', 'す', 'ね', '。', '<0xE7>', '<0xA7>', '<0x81>', 'の', '方', 'は', '<0xE7>', '<0xB5>', '<0x84>', 'み', '合', 'わ', 'せ', 'の', '<0xE9>', '<0x81>', '<0x8B>', 'も', 'あ', 'り', 'ま', 'し', 'た', 'し', '」']
['<s>', '▁', '「', 'ケ', 'ン', 'ト', '、', 'ど', 'う', 'や', 'っ', 'て', '我', 'が', '地', '<0xE7>', '<0x8D>', '<0x84>', 'か', 'ら', '<0xE6>', '<0x8A>', '<0x9C>', 'け', '出', 'し', 'た', 'の', 'か', '、', '分', 'か', 'る', 'か', '<0xE3>', '<0x81>', '<0x87>', '?', '」']


We can recover the source text by applying [batch_decode](https://huggingface.co/docs/transformers/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode) of the tokenizer

In [12]:
tokenizer.batch_decode(model_input["input_ids"])

['<s> 「さすがに相手が悪かったですね。私の方は組み合わせの運もありましたし」',
 '<s> 「ケント、どうやって我が地獄から抜け出したのか、分かるかぇ?」']

Now, we can apply the preprocess_function to the raw datasets (training, validation and test):

In [13]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/84606 [00:00<?, ? examples/s]

Map:   0%|          | 0/10576 [00:00<?, ? examples/s]

Map:   0%|          | 0/10576 [00:00<?, ? examples/s]

We are going to filter the tokenized datasets by maximum number of tokens in source and target language:

In [14]:
tokenized_datasets = tokenized_datasets.filter(
    lambda x: len(x["input_ids"]) <= max_tok_length
    and len(x["labels"]) <= max_tok_length,
    desc=f"Discarding source and target sentences with more than {max_tok_length} tokens",
)

Discarding source and target sentences with more than 16 tokens:   0%|          | 0/84606 [00:00<?, ? examples…

Discarding source and target sentences with more than 16 tokens:   0%|          | 0/10576 [00:00<?, ? examples…

Discarding source and target sentences with more than 16 tokens:   0%|          | 0/10576 [00:00<?, ? examples…

We can take a quick look at the length histogram in the source language:

In [15]:
dic = {}
for sample in tokenized_datasets["train"]:
    sample_length = len(sample["input_ids"])
    if sample_length not in dic:
        dic[sample_length] = 1
    else:
        dic[sample_length] += 1

for i in range(1, max_tok_length + 1):
    if i in dic:
        print(f"{i:>2} {dic[i]:>3}")

 2  33
 3  37
 4  67
 5 253
 6 724
 7 799
 8 1328
 9 1398
10 1468
11 1645
12 1763
13 1789
14 1874
15 1881
16 2001


Checking a sample after filtering by maximum number of tokens:

In [16]:
for sample in tokenized_datasets["train"].select(range(5)):
    print(sample["input_ids"])
    print(sample["attention_mask"])
    print(sample["labels"])

[1, 29871, 234, 170, 132, 30449, 31110, 30465, 235, 171, 183, 30914, 30441, 30427, 30267]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 2193, 338, 825, 306, 29892, 15313, 1312, 1048, 372, 29889]
[1, 29871, 30481, 31110, 31206, 31110, 31206, 30364, 31579, 30665, 30466, 30298, 30441, 30326, 30366, 30482]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 376, 29902, 2714, 372, 471, 931, 29908]
[1, 29871, 30988, 31438, 365, 29894, 29889, 29896, 29945, 29974]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 24928, 1920, 6387, 802, 365, 29894, 29889, 29896, 29945, 29974]
[1, 29871, 30371, 30332, 31938, 31250, 30330, 234, 165, 189, 30412, 30353, 30267]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 10428, 29892, 8959, 29889]
[1, 29871, 30481, 30847, 31502, 30371, 30566, 30553, 30441, 30427, 30412, 29973, 30482]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 376, 5328, 508, 366, 437, 393, 29973]


In [17]:
src = "Japanese"
tgt = "English"
task_prefix = f"Translate from {src} to {tgt}:\n"
s = ""
num_shots = 1
shots = ""

prefix_tok_len = len(tokenizer.encode(f"{task_prefix}{shots}{src}: {s} = {tgt}: "))
shot_tok_len = len(tokenizer.encode(f"{src}: {s} = {tgt}: {s}\n"))
max_tok_len = prefix_tok_len
max_tok_len += num_shots * (shot_tok_len + 2 * max_tok_length)
max_tok_len += max_tok_length

random_seed = 13
sample = tokenized_datasets["train"].shuffle(seed=random_seed).select(range(num_shots))
for s in sample:
    shots += f"{src}: {s['src']} = {tgt}: {s['trg']}\n"


def preprocess4test_function(sample):
    inputs = [f"{task_prefix}{shots}{src}: {s} = {tgt}: " for s in sample["src"]]
    model_inputs = tokenizer(
        inputs,
        max_length=max_tok_len,
        truncation=True,
        return_tensors="pt",
        padding=True,
    )
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*:

In [18]:
sample = tokenized_datasets["test"].select(range(5))
model_input = preprocess4test_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input["input_ids"]))

{'input_ids': tensor([[    0,     1,  4103,  9632,   515, 10369,   304,  4223, 29901,    13,
         29967, 21419,   968, 29901, 29871, 30481, 30787, 30465, 30805, 30366,
         30371, 30482,   353,  4223, 29901,   518,  3492,  7146,  2996,  5586,
         29871,    13, 29967, 21419,   968, 29901, 29871, 30670,   232,   163,
           184, 30326, 30366, 30364, 30980, 30974, 30955, 30665, 30366, 30267,
           353,  4223, 29901, 29871],
        [    1,  4103,  9632,   515, 10369,   304,  4223, 29901,    13, 29967,
         21419,   968, 29901, 29871, 30481, 30787, 30465, 30805, 30366, 30371,
         30482,   353,  4223, 29901,   518,  3492,  7146,  2996,  5586, 29871,
            13, 29967, 21419,   968, 29901, 29871, 31110, 30465, 30310, 30260,
         30303, 30255, 30458,   232,   179,   142, 31684, 30332, 30364, 30330,
           353,  4223, 29901, 29871],
        [    0,     0,     0,     0,     1,  4103,  9632,   515, 10369,   304,
          4223, 29901,    13, 29967, 2141

In [19]:
preprocessed_test_dataset = tokenized_datasets["test"].map(
    preprocess4test_function, batched=True
)

Map:   0%|          | 0/2066 [00:00<?, ? examples/s]

In [20]:
for sample in preprocessed_test_dataset.select(range(5)):
    print(sample["input_ids"])
    print(sample["attention_mask"])
    print(sample["labels"])

[0, 1, 4103, 9632, 515, 10369, 304, 4223, 29901, 13, 29967, 21419, 968, 29901, 29871, 30481, 30787, 30465, 30805, 30366, 30371, 30482, 353, 4223, 29901, 518, 3492, 7146, 2996, 5586, 29871, 13, 29967, 21419, 968, 29901, 29871, 30670, 232, 163, 184, 30326, 30366, 30364, 30980, 30974, 30955, 30665, 30366, 30267, 353, 4223, 29901, 29871]
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 739, 471, 472, 278, 1021, 931, 5929, 1463, 29889]
[1, 4103, 9632, 515, 10369, 304, 4223, 29901, 13, 29967, 21419, 968, 29901, 29871, 30481, 30787, 30465, 30805, 30366, 30371, 30482, 353, 4223, 29901, 518, 3492, 7146, 2996, 5586, 29871, 13, 29967, 21419, 968, 29901, 29871, 31110, 30465, 30310, 30260, 30303, 30255, 30458, 232, 179, 142, 31684, 30332, 30364, 30330, 353, 4223, 29901, 29871]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [21]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [22]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

# Inference

Loading default inference parameters for the model, so that additional parameters could be added and passed to the [generate function](https://huggingface.co/docs/transformers/main_classes/text_generation):

In [23]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
)

print(generation_config)

GenerationConfig {
  "bos_token_id": 1,
  "do_sample": true,
  "eos_token_id": 2,
  "max_length": 4096,
  "pad_token_id": 0,
  "temperature": 0.6,
  "top_p": 0.9
}



As observed, the default search strategy for Llama-2 is Top-p with probability 0.9 and temperature 0.6 ($0<T<1$ amplifies output probability differences and makes output more deterministic). [The search strategy can be selected](https://huggingface.co/docs/transformers/en/generation_strategies) at inference time.

First, the test set is divided in small batches to reduce GPU memory comsumption:

In [24]:
test_batch_size = 32
batch_tokenized_test = preprocessed_test_dataset.batch(test_batch_size)

Batching examples:   0%|          | 0/2066 [00:00<?, ? examples/s]

In [25]:
number_of_batches = len(batch_tokenized_test["input_ids"])
output_sequences = []
for i in range(number_of_batches):
    output_batch = model.generate(
        generation_config=generation_config,
        input_ids=torch.tensor(batch_tokenized_test["input_ids"][i]).cuda(),
        attention_mask=torch.tensor(batch_tokenized_test["attention_mask"][i]).cuda(),
        max_length=max_tok_len,
        num_beams=1,
        do_sample=False,
    )
    output_sequences.extend(output_batch)

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


## Evaluation

The output of the model is automatically evaluated compared to the reference translations. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu).

In [26]:
from evaluate import load
from comet import download_model, load_from_checkpoint

metric = load("sacrebleu")
comet_model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(comet_model_path)

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/9.69k [00:00<?, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.38k [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/f49d328952c3470eff6bb6f545d62bfdb6e66304/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


The example below performs a basic post-processing to decode the predictions and extract the translation:

In [29]:
import re


def compute_metrics(sample, output_sequences):
    inputs = [f"{task_prefix}{shots}{src}: {s} = {tgt}: " for s in sample["src"]]
    preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
    print(inputs)
    print(preds)
    for i, (input, pred) in enumerate(zip(inputs, preds)):
        pred = re.search(r"^.*\n", pred.removeprefix(input).lstrip())
        if pred is not None:
            preds[i] = pred.group()[:-1]
        else:
            preds[i] = ""
    print(sample["src"])
    print(sample["trg"])
    print(preds)
    result = metric.compute(predictions=preds, references=sample["trg"])
    result = {"bleu": result["score"]}
    # Compute COMET score
    data = [
        {"src": s, "mt": hyp, "ref": ref}
        for hyp, ref, s in zip(preds, sample["trg"], sample["src"])
    ]
    comet_score = comet_model.predict(data, batch_size=64, gpus=1)
    result["comet"] = comet_score.system_score
    return result

In [30]:
result = compute_metrics(preprocessed_test_dataset, output_sequences)
print(f'BLEU score: {result["bleu"]}')
print(f'COMET score: {result["comet"]}')

['Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: 安堵したと同時だった。 = English: ', 'Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: そうアイリスが尋ねると、 = English: ', 'Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: (あ、......私......) = English: ', 'Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: この戦がなければ。 = English: ', 'Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: 「特殊というのは?」 = English: ', 'Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: 我はゴーレムなり。 = English: ', 'Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: 「どうされますか?」 = English: ', 'Translate from Japanese to English:\nJapanese: 「よう来たな」 = English: [You finally came.] \nJapanese: 「ラナ様?」 = English: ', 'Tr

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 33/33 [00:19<00:00,  1.65it/s]


BLEU score: 14.469212361326514
COMET score: 0.6831625701872215
