<a href="https://colab.research.google.com/github/VincentCCL/MTAT/blob/main/notebooks/MTAT26_Translation%26Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this hands-on session we will see how we can send sentences to online translation engines, and how we can automatically evaluate the results, provided that we have human translated reference translations at our disposal.


# 3.4.1 Accessing Translation engines in Python
We can connect python to existing online translation engines by using their so-called *API*s: Application Programming Interface.

We first need to install the python module *translators*, which allows easy access to several online engines. For more info on this module check https://pypi.org/project/translators/.

For python, we install modules with the `pip install` command. We tell Google Colab that the code cell does not concern python code, but contains a linux command by prepending it with a `!`.

In [2]:
!pip -q install translators --upgrade

To show how it works, we will use a set of Dutch sentences created as the development set of the Tatoeba English-to-Dutch parallel data.

We download the testset using `!wget`.

In [2]:
!wget https://raw.githubusercontent.com/VincentCCL/MTAT/refs/heads/main/data/tatoeba-en-nl/dev.en

--2026-02-18 14:51:26--  https://raw.githubusercontent.com/VincentCCL/MTAT/refs/heads/main/data/tatoeba-en-nl/dev.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32441 (32K) [text/plain]
Saving to: ‘dev.en’


2026-02-18 14:51:26 (10.2 MB/s) - ‘dev.en’ saved [32441/32441]



We then read in the files in python, in the list `sourcelines`.

In [3]:
sourcefile = "dev.en"
sourcelines=open(sourcefile,'r').readlines()
sourcelines = [l.strip() for l in sourcelines]
sourcelines

['Nobody reads about my country .',
 'I sleep in my car .',
 "There 're clean sheets under the bed .",
 'I have a donkey .',
 'Betty drives fast .',
 'Come to help me .',
 "She 's not penniless .",
 'This mountain is covered in snow all-year-round .',
 'Tom thought Mary was cute .',
 "I don 't suppose it 's possible to read a book by moonlight .",
 'You just have to find her .',
 "It 's very cold today .",
 'Can you ship it to New York City ?',
 'What else does Tom want ?',
 "I 'd like to buy a Picasso .",
 "I don 't mind your groping in the dark for a solution , but I wish you 'd come to a decision .",
 'Tom has his own business .',
 'Horses run fast .',
 'I feel depressed .',
 'How did your test go ?',
 'What would you do if you were in my place ?',
 'Take a short cut .',
 'What is the name of this river ?',
 'Lisa speaks not only English but also French .',
 'Tom said that he needed to go to bed .',
 "I don 't drink wine .",
 "We 're a little busy now .",
 'Tom offended Mary .',
 'L

The next code block shows how we loop over the sentences of the article and use the online translation engines. We print out the results per sentence, so we can easily compare the different engines.

Note that the engines are real engines, and that it is possible that they stop offering the service of accepting sentences through an API. As far as I've tested, it seems to work for google and bing, and only sometimes for deepl. These APIs are only intended for research and teaching, so don't overuse them!

The engines apply, by default, automatic source language identification. The target language defaults to English. As we want to translate to Dutch we have to set the target language to `nl`.

In [4]:
import translators as ts


We first define a function that takes a sentence, an engine and a target language as input arguments and that strips the translated sentence of potential newline characters at the end of the line. The function also catches the case where the engine does not return anything, so the code will not crash.

In [5]:
def translate_sentence(sentence, engine, targetlang):
    try:
        out = ts.translate_text(
            sentence,
            translator=engine,
            to_language=targetlang
        )
        return str(out).strip()
    except Exception:
        return "-"

Then we call the function on the MT engines `bing` and `google`, and keep the results in a python dictionary `results`, which contains one list per engine.

We limit the loop to the translation of the first 30 source lines `[:30]`.

In [6]:
engines=['bing','google']
targetlang = 'nl'
results = {eng: [] for eng in engines}

for sentence in sourcelines[:30]:
    sentence = sentence.strip()
    if not sentence:
        continue
    print("SRC:",sentence)
    for eng in engines:
      result = translate_sentence(sentence, eng,targetlang)
      print(eng,':',result)
      results[eng].append(result)
    print("\n")


SRC: Nobody reads about my country .
bing : Niemand leest over mijn land.
google : Niemand leest over mijn land.


SRC: I sleep in my car .
bing : Ik slaap in mijn auto.
google : Ik slaap in mijn auto.


SRC: There 're clean sheets under the bed .
bing : Er liggen schone lakens onder het bed.
google : Er liggen schone lakens onder het bed.


SRC: I have a donkey .
bing : Ik heb een ezel.
google : Ik heb een ezel.


SRC: Betty drives fast .
bing : Betty rijdt snel.
google : Betty rijdt snel.


SRC: Come to help me .
bing : Kom me helpen.
google : Kom mij helpen.


SRC: She 's not penniless .
bing : Ze is niet kansarm.
google : Ze heeft geen cent .


SRC: This mountain is covered in snow all-year-round .
bing : Deze berg is het hele jaar door met sneeuw bedekt.
google : Deze berg is het hele jaar door bedekt met sneeuw.


SRC: Tom thought Mary was cute .
bing : Tom vond Mary schattig.
google : Tom vond Mary schattig.


SRC: I don 't suppose it 's possible to read a book by moonlight .
bi

##3.4.1.1 DeepL

In order to automatically translate with DeepL, you first need to create a DeepL account at https://www.deepl.com/en/pro-api, and then go to API keys and limits and click Create key.

Then we click on the key symbol in the left pane and add a secret key with the name `deepl`. These keys are not visible to people you share the colab session with.

We get the values of these keys using `userdata.get('deepl')` and can use that value to get access to the DeepL web service.

In [7]:
from google.colab import userdata
auth_key = userdata.get('deepl')

Then we install the deepl python library


In [8]:
!pip install deepl



Now we translate the first 30 lines and add them to the `results` dictionary and print them to screen.

In [9]:
import deepl


translator = deepl.Translator(auth_key)

translations = translator.translate_text(sourcelines[:30], target_lang="NL")
results['deepl']=[]

for t in translations:
    print(t.text)
    results['deepl'].append(t.text)

Niemand leest over mijn land.
Ik slaap in mijn auto.
Er liggen schone lakens onder het bed.
Ik heb een ezel.
Betty rijdt hard.
Kom me helpen.
Ze is niet blut.
Deze berg is het hele jaar door bedekt met sneeuw.
Tom vond Mary schattig.
Ik denk niet dat het mogelijk is om bij maanlicht een boek te lezen.
Je moet haar gewoon vinden.
Het is vandaag erg koud.
Kunt u het naar New York City verzenden?
Wat wil Tom nog meer?
Ik zou graag een Picasso willen kopen.
Ik vind het niet erg dat je in het donker naar een oplossing zoekt, maar ik zou willen dat je een beslissing neemt.
Tom heeft zijn eigen bedrijf.
Paarden rennen snel.
Ik voel me depressief.
Hoe ging je test?
Wat zou jij doen als je in mijn plaats was?
Neem een kortere weg.
Hoe heet deze rivier?
Lisa spreekt niet alleen Engels, maar ook Frans.
Tom zei dat hij naar bed moest.
Ik drink geen wijn.
We hebben het nu een beetje druk.
Tom heeft Mary beledigd.
Leer deze zinnen.
Het is allemaal heerlijk!


We can now loop over our dictionary and source sentences again and compare the three engines

In [10]:
engines=['google','bing','deepl']
for (index,source) in enumerate(sourcelines[:30]):
  print(f'SRC : {source}')
  for engine in engines:
    print(f'{engine} : {results[engine][index]}')
  print("")

SRC : Nobody reads about my country .
google : Niemand leest over mijn land.
bing : Niemand leest over mijn land.
deepl : Niemand leest over mijn land.

SRC : I sleep in my car .
google : Ik slaap in mijn auto.
bing : Ik slaap in mijn auto.
deepl : Ik slaap in mijn auto.

SRC : There 're clean sheets under the bed .
google : Er liggen schone lakens onder het bed.
bing : Er liggen schone lakens onder het bed.
deepl : Er liggen schone lakens onder het bed.

SRC : I have a donkey .
google : Ik heb een ezel.
bing : Ik heb een ezel.
deepl : Ik heb een ezel.

SRC : Betty drives fast .
google : Betty rijdt snel.
bing : Betty rijdt snel.
deepl : Betty rijdt hard.

SRC : Come to help me .
google : Kom mij helpen.
bing : Kom me helpen.
deepl : Kom me helpen.

SRC : She 's not penniless .
google : Ze heeft geen cent .
bing : Ze is niet kansarm.
deepl : Ze is niet blut.

SRC : This mountain is covered in snow all-year-round .
google : Deze berg is het hele jaar door bedekt met sneeuw.
bing : Deze 

In [11]:
results['deepl']

['Niemand leest over mijn land.',
 'Ik slaap in mijn auto.',
 'Er liggen schone lakens onder het bed.',
 'Ik heb een ezel.',
 'Betty rijdt hard.',
 'Kom me helpen.',
 'Ze is niet blut.',
 'Deze berg is het hele jaar door bedekt met sneeuw.',
 'Tom vond Mary schattig.',
 'Ik denk niet dat het mogelijk is om bij maanlicht een boek te lezen.',
 'Je moet haar gewoon vinden.',
 'Het is vandaag erg koud.',
 'Kunt u het naar New York City verzenden?',
 'Wat wil Tom nog meer?',
 'Ik zou graag een Picasso willen kopen.',
 'Ik vind het niet erg dat je in het donker naar een oplossing zoekt, maar ik zou willen dat je een beslissing neemt.',
 'Tom heeft zijn eigen bedrijf.',
 'Paarden rennen snel.',
 'Ik voel me depressief.',
 'Hoe ging je test?',
 'Wat zou jij doen als je in mijn plaats was?',
 'Neem een kortere weg.',
 'Hoe heet deze rivier?',
 'Lisa spreekt niet alleen Engels, maar ook Frans.',
 'Tom zei dat hij naar bed moest.',
 'Ik drink geen wijn.',
 'We hebben het nu een beetje druk.',
 'T

##3.4.1.2 Save the outputs to files

Now that we've translated 30 sentences with 3 different MT engines, we store the results, so we can reuse and evaluate them later.

In [12]:
for engine, outputs in results.items():
    with open(f"tatoeba-{engine}.txt", "w") as f:
        for sent in outputs:
            f.write(sent + "\n")


The above made local copies. We need to copy them to a path in our Google Drive, using `!cp`.

In [None]:
!cp tatoeba-*.txt drive/MyDrive/MTAT/.../tatoeba/

# 3.4.2 Automatic metrics

In this section we’ll demonstrate how to calculate several evaluation metrics.

##3.4.2.1 MaTEO: Machine Translation Evaluation Online

In this hands-on section, we focus on automatic evaluation: computing quantitative scores that summarise how close a system output is to one or more reference translations.

A practical way to do this is MATEO (MAchine Translation Evaluation Online) available at https://mateo.ivdnt.org/, a web-based interface that lets you upload MT outputs and obtain scores from a battery of established metrics in a single, consistent evaluation pipeline. Tools like MATEO make it easier to compare systems, settings, or datasets without re-implementing evaluation code each time—while also reminding us that metric scores are proxies that must be
interpreted with care.

The online version will only work for relatively small files, not for test files with thousands of sentences.

### Exercise: Evaluate the three engines with MATEO
Use the tatoeba-engine.txt files that you just created to evaluate the engines with MATEO. Remember that you need to create a version of only 30 sentences for the source and the reference files. (You can use the `!head` command for that).


The reference file is found here: https://raw.githubusercontent.com/VincentCCL/MTAT/refs/heads/main/data/tatoeba-en-nl/dev.nl

Interpret the results.

##3.4.2.2 SacreBLEU: BLEU scores / chrF / TER

There is a python library that allows us to calculate BLEU scores easily. We first need to install it.

In [17]:
!pip -q install -U sacrebleu

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/100.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.8/100.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h

 Now we can use SacreBleu from within python, but we can also use it as a command on the commandline.

### In python

First we need to import the metrics

In [18]:
from sacrebleu.metrics import BLEU, CHRF, TER

Then we need a list of references. Be aware that this should be a list of lists, as we can have multiple references per source sentence.

And of course we also need system output.


In [19]:
refs = [ # First set of references
           ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
         # Second set of references
           ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
       ]
sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']



Then we can calculate the scores like this:

In [20]:
bleu = BLEU()
bleu.corpus_score(sys, refs)

BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)

In [21]:
chrf = CHRF()
chrf.corpus_score(sys, refs)

chrF2 = 59.73

In [22]:
ter = TER()
ter.corpus_score(sys,refs)

TER = 40.00

### Command line

To test this, we can download some automatically translated sentences from Romanian into Portuguese, automatically transcribed from a speech file. These sentences come from an experiment at the European Parliament where three anonymous commercial companies performed speech translation on a speech from Romanian to Portuguese. It is only a very short fragment.

In [23]:
!wget https://github.com/VincentCCL/MTAT/raw/refs/heads/main/data/ep/RO_PT_AV.zip
!unzip RO_PT_AV.zip

--2026-02-18 15:30:11--  https://github.com/VincentCCL/MTAT/raw/refs/heads/main/data/ep/RO_PT_AV.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/VincentCCL/MTAT/refs/heads/main/data/ep/RO_PT_AV.zip [following]
--2026-02-18 15:30:11--  https://raw.githubusercontent.com/VincentCCL/MTAT/refs/heads/main/data/ep/RO_PT_AV.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2704 (2.6K) [application/zip]
Saving to: ‘RO_PT_AV.zip’


2026-02-18 15:30:11 (45.0 MB/s) - ‘RO_PT_AV.zip’ saved [2704/2704]

Archive:  RO_PT_AV.zip
  inflating: RO_PT_AdinaValean_A.txt.mt.tok.align  
  inflating: RO_PT_AdinaVal

This gives us the results of the automatic translation by three MT engines of the same snippet of speech in the European Parliament. It also includes a reference translation.

In [24]:
!sacrebleu RO_PT_AdinaValean_REF.txt.sent -i RO_PT_AdinaValean_A.txt.mt.tok.align -m bleu ter chrf

[
{
 "name": "BLEU",
 "score": 23.5,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.6.0",
 "verbose_score": "49.4/26.4/18.2/12.8 (BP = 1.000 ratio = 1.108 hyp_len = 164 ref_len = 148)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.6.0"
},
{
 "name": "chrF2",
 "score": 50.6,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.6.0",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "0",
 "space": "no",
 "version": "2.6.0"
},
{
 "name": "TER",
 "score": 71.6,
 "signature": "nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.6.0",
 "nrefs": "1",
 "case": "lc",
 "tok": "tercom",
 "norm": "no",
 "punct": "yes",
 "asian": "no",
 "version": "2.6.0"
}
]
[0m

### Exercise
Take the bing, google and deepl translations we made for the first 30 tatoeba sentences and calculate their BLEU, TER and chrF scores, using the command line or python approaches.


In [None]:
# Put your code here

##3.4.2.3 Significance levels for BLEU / chrF / TER
Sacrebleu also allows you to calculate whether the difference in scores is significant. In order to calculate this we need to give sacrebleu the output of (at least) two MT systems, and add the `--paired-bs` option.

In [None]:
!sacrebleu reference_file -i mt-output-file_baseline mt-output-file_system2 mt-output-file_systenm3 ... -m metrics --paired-bs

usage: sacrebleu [-h] [--citation] [--list] [--test-set TEST_SET]
                 [--language-pair LANGPAIR] [--origlang ORIGLANG]
                 [--subset SUBSET] [--download DOWNLOAD]
                 [--echo ECHO [ECHO ...]] [--input [INPUT ...]]
                 [--num-refs NUM_REFS] [--encoding ENCODING]
                 [--metrics {bleu,chrf,ter} [{bleu,chrf,ter} ...]]
                 [--sentence-level] [--smooth-method {none,floor,add-k,exp}]
                 [--smooth-value BLEU_SMOOTH_VALUE]
                 [--tokenize {none,zh,13a,intl,char,ja-mecab,ko-mecab,spm,flores101,flores200,spBLEU-1K}]
                 [--lowercase] [--force] [--chrf-char-order CHRF_CHAR_ORDER]
                 [--chrf-word-order CHRF_WORD_ORDER] [--chrf-beta CHRF_BETA]
                 [--chrf-whitespace] [--chrf-lowercase] [--chrf-eps-smoothing]
                 [--ter-case-sensitive] [--ter-asian-support] [--ter-no-punct]
                 [--ter-normalized] [--confidence]
                 [--c

In [25]:
!sacrebleu RO_PT_AdinaValean_REF.txt.sent -i RO_PT_AdinaValean_A.txt.mt.tok.align RO_PT_AdinaValean_B.txt.mt.tok.align RO_PT_AdinaValean_C.txt.mt.tok.align -m bleu ter chrf --paired-bs --format text

sacreBLEU: Found 3 systems.
sacreBLEU: Pre-computing BLEU statistics for 'Baseline: RO_PT_AdinaValean_A.txt.mt.tok.align'
sacreBLEU: Pre-computing TER statistics for 'Baseline: RO_PT_AdinaValean_A.txt.mt.tok.align'
sacreBLEU: Pre-computing CHRF statistics for 'Baseline: RO_PT_AdinaValean_A.txt.mt.tok.align'
sacreBLEU: Computing BLEU for 'RO_PT_AdinaValean_B.txt.mt.tok.align' and extracting sufficient statistics
sacreBLEU:  > Performing paired bootstrap resampling test (# resamples: 1000)
sacreBLEU: Computing TER for 'RO_PT_AdinaValean_B.txt.mt.tok.align' and extracting sufficient statistics
sacreBLEU:  > Performing paired bootstrap resampling test (# resamples: 1000)
sacreBLEU: Computing chrF2 for 'RO_PT_AdinaValean_B.txt.mt.tok.align' and extracting sufficient statistics
sacreBLEU:  > Performing paired bootstrap resampling test (# resamples: 1000)
sacreBLEU: Computing BLEU for 'RO_PT_AdinaValean_C.txt.mt.tok.align' and extracting sufficient statistics
sacreBLEU:  > Performing paired b

#3.4.3 BERTScore

##3.4.3.1 Hugging Face access token (BERTScore)

BERTScore relies on pretrained Transformer models that are downloaded from
the **Hugging Face Hub**. To access these models, you need a **Hugging Face
account** and a **read-only access token**.

If you do not yet have a Hugging Face account:

1. Go to https://huggingface.co
2. Create a free account (or log in if you already have one).
3. Navigate to **Access Tokens**.
4. Create a new token with role **Read** and copy the token (it starts with
   `hf_...`).

In Google Colab, this token must be stored as a **secret**:

1. Click the 🔑 **Secrets** icon in the left sidebar.
2. Add a new secret with:
   - **Name:** `HF_TOKEN`
   - **Value:** your Hugging Face access token
3. Save the secret.

Once the token is stored as a secret, it is automatically made available to
Python libraries in the notebook. No additional authentication code is
required when running BERTScore from Python.

If the token is missing or invalid, the model download will fail. After the
model has been downloaded once, it is cached and reused for the remainder
of the Colab session.


To make the token available for command-line commands, execute this code:

In [1]:
from google.colab import userdata
from huggingface_hub import login

hf_token = userdata.get("HF_TOKEN")
login(hf_token)


## 3.4.3.2 Running BERTScore

We first need to instal the appropriate module.

In [2]:
!pip -q install bert-score

And we have a bit of data in this format:

In [27]:
refs_bertscore = [
    'The dog bit the man. ||| The dog had bit the man.',
    'It was not unexpected. ||| No one was surprised.',
    'The man bit him first. ||| The man had bitten the dog.',
]

sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']


Then we can calculate BERTScore Precision, Recall and F-score with the following command. Note that you need to set the target language explicitly.

In [31]:
from bert_score import score
(P, R, F), hashname = score(sys, refs_bertscore, lang="en", return_hash=True)
print(
    f"{hashname}: P={P.mean().item():.6f} R={R.mean().item():.6f} F={F.mean().item():.6f}"
)

Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


roberta-large_L17_no-idf_version=0.3.12(hug_trans=5.0.0): P=0.957206 R=0.915080 F=0.935619


With the command-line on the previously downloaded files from the European Parliament

In [3]:
!bert-score \
  -r RO_PT_AdinaValean_REF.txt.sent \
  -c RO_PT_AdinaValean_A.txt.mt.tok.align \
  --lang pt \
  --m xlm-roberta-large

Loading weights: 100% 391/391 [00:00<00:00, 916.56it/s, Materializing param=pooler.dense.weight]
[1mXLMRobertaModel LOAD REPORT[0m from: xlm-roberta-large
Key                       | Status     |  | 
--------------------------+------------+--+-
lm_head.layer_norm.bias   | [38;5;208mUNEXPECTED[0m |  | 
lm_head.dense.weight      | [38;5;208mUNEXPECTED[0m |  | 
lm_head.dense.bias        | [38;5;208mUNEXPECTED[0m |  | 
lm_head.bias              | [38;5;208mUNEXPECTED[0m |  | 
lm_head.layer_norm.weight | [38;5;208mUNEXPECTED[0m |  | 

[3mNotes:
- [38;5;208mUNEXPECTED[0m[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
xlm-roberta-large_L17_no-idf_version=0.3.12(hug_trans=5.0.0)_fast-tokenizer P: 0.928678 R: 0.927544 F1: 0.928097


### Exercise
* Calculate BERT-score on the 30 sentences from the corpus we've translated for bing, google and deepl.

#3.4.4 BLEURT

First, we install:

In [4]:
!pip install -q git+https://github.com/google-research/bleurt.git
!pip install -q huggingface_hub


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone


Then, we download a TF BLEURT-20 checkpoint mirror from Hugging Face.

In [5]:
from huggingface_hub import snapshot_download

ckpt_dir = snapshot_download(
    repo_id="BramVanroy/BLEURT-20",
    local_dir="bleurt-20",
    local_dir_use_symlinks=False
)
print("Checkpoint in:", ckpt_dir)




Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Checkpoint in: /content/bleurt-20


And this is how we run Bleurt.

In [6]:
# 3) Score
from bleurt import score
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer("bleurt-20")
scores = scorer.score(references=references, candidates=candidates)
print("BLEURT mean:", sum(scores)/len(scores))


BLEURT mean: 0.7928391098976135


#3.4.5 COMET

To get COMET to work requires us to start a new colab session, as it is not compatible with the things we've previously installed. You can find it [HERE](https://colab.research.google.com/drive/1m65-TU26XJYaXBNRkjlxDd1_oYtAhgLI?usp=sharing)