# Machine Translation Automatic Metrics

- BLEU
- chrF
- COMET



In [1]:
# install the metrics
# we only have to do it once per session
# install BLEU and chrF
!pip install sacrebleu #==2.5.1



In [2]:
#install COMET
!pip install unbabel-comet #==2.2.6



## Upload Text Files

We need to upload the text files with the Source, Reference translation, and MT systems (engines) output.

You upload files in the **directory** icon.

--> Alternative: clone repo

In [3]:
#clone repo from Github and navigate to correct working directory
!git clone https://github.com/fubotz/BMT_2025S
%cd /content/BMT_2025S/week11_files

Cloning into 'BMT_2025S'...
remote: Enumerating objects: 567, done.[K
remote: Counting objects: 100% (199/199), done.[K
remote: Compressing objects: 100% (190/190), done.[K
remote: Total 567 (delta 128), reused 8 (delta 8), pack-reused 368 (from 1)[K
Receiving objects: 100% (567/567), 172.54 MiB | 28.52 MiB/s, done.
Resolving deltas: 100% (303/303), done.
Updating files: 100% (78/78), done.
/content/BMT_2025S/week11_files


## Compute BLEU and chrF Scores

This is the command to compute BLEU and chrF:

sacrebleu reference-file -l language pair -i MT output -m metrics

!sacrebleu en-de.vaccine.reference.de -l en-de -i en-de.opuscat.original.de -m bleu chrf

#IMPORTANT:

Using --paired-bs tells sacrebleu to evaluate both system1 AND system2 on the same set of references and to perform statistical significance testing using paired bootstrap resampling. This method is used to determine whether the difference in BLEU/chrF between the two systems is statistically significant, not just due to chance.

In [5]:
!sacrebleu Vienna_Environmental.en-de.test.de -l en-de -i Vienna_Environmental.en-de.test.marian.de Vienna_Environmental.en-de.test.nllb.de -m bleu chrf --paired-bs
#gold standard translation - sytem1 - system2

sacreBLEU: Found 2 systems.
sacreBLEU: Pre-computing BLEU statistics for 'Baseline: Vienna_Environmental.en-de.test.marian.de'
sacreBLEU: Pre-computing CHRF statistics for 'Baseline: Vienna_Environmental.en-de.test.marian.de'
sacreBLEU: Computing BLEU for 'Vienna_Environmental.en-de.test.nllb.de' and extracting sufficient statistics
sacreBLEU:  > Performing paired bootstrap resampling test (# resamples: 1000)
sacreBLEU: Computing chrF2 for 'Vienna_Environmental.en-de.test.nllb.de' and extracting sufficient statistics
sacreBLEU:  > Performing paired bootstrap resampling test (# resamples: 1000)
[
    {
        "system": "Baseline: Vienna_Environmental.en-de.test.marian.de",
        "BLEU": {
            "score": 18.767901172132877,
            "p_value": null,
            "mean": 19.002123586430987,
            "ci": 4.49199024497201
        },
        "chrF2": {
            "score": 57.09016511907269,
            "p_value": null,
            "mean": 57.11701761337507,
            "ci":

## Evaluation Results: BLEU and chrF2 with Paired Bootstrap Resampling

We compared two MT systems (Marian and NLLB) on the `Vienna_Environmental.en-de.test` dataset using **BLEU** and **chrF2** metrics via `sacreBLEU` with **paired bootstrap resampling** (1000 samples). This allows us to assess not only the raw scores but also whether the differences are **statistically significant**.


--> If p < 0.05 = significant



### BLEU Scores

| System       | BLEU Score | Bootstrap Mean | 95% Confidence Interval | p-value |
|--------------|------------|----------------|--------------------------|---------|
| Marian       | 18.77      | 19.00          | ±4.49                    | —       |
| NLLB         | 18.17      | 18.19          | ±2.99                    | 0.332   |

- Marian achieves a slightly higher BLEU score than NLLB.
- However, the **p-value of 0.332** indicates that this difference is **not statistically significant**.

### chrF2 Scores

| System       | chrF2 Score | Bootstrap Mean | 95% Confidence Interval | p-value |
|--------------|-------------|----------------|--------------------------|---------|
| Marian       | 57.09       | 57.12          | ±2.82                    | —       |
| NLLB         | 52.13       | 52.16          | ±2.54                    | **0.001** |

- Marian outperforms NLLB by ~5 points in chrF2.
- The **p-value of 0.001** shows this difference is **statistically significant**.

### Summary

- **BLEU**: No significant difference between Marian and NLLB.
- **chrF2**: Marian performs **significantly better** than NLLB.

### Interpretation

BLEU is a precision-based metric on word-level n-grams, while chrF2 operates on character n-grams and tends to better capture **morphological** and **fluency-related** variations. The results suggest that although both systems perform similarly on BLEU, **Marian generates more fluent or morphologically accurate outputs**, as evidenced by the higher and statistically significant chrF2 score.


#IMPORTANT:

The code below (using one reference and one system output) simply evaluates Marian’s output against the reference using BLEU and chrF, but it does not perform any significance testing.

In [6]:
!sacrebleu Vienna_Environmental.en-de.test.de -l en-de -i Vienna_Environmental.en-de.test.marian.de -m bleu chrf
#gold standard translation - sytem1

[
{
 "name": "BLEU",
 "score": 18.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.5.1",
 "verbose_score": "47.5/24.5/13.5/7.9 (BP = 1.000 ratio = 1.163 hyp_len = 2425 ref_len = 2086)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.5.1"
},
{
 "name": "chrF2",
 "score": 57.1,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.5.1",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "0",
 "space": "no",
 "version": "2.5.1"
}
]
[0m

# Compute COMET Scores

Command for COMET score

comet-score -s source-file  -t MT ouput -r reference-file

!comet-score -s en-de.vaccine.source.en -t en-de.opuscat.original.de -r en-de.vaccine.reference.de




In [7]:
# compute COMET  system 1
#comet downloads a language model once per session
!comet-score -s Vienna_Environmental.en-de.test.en -r Vienna_Environmental.en-de.test.de -t Vienna_Environmental.en-de.test.nllb.de

#source sentences in en - golden standard translation - system1

2025-06-29 13:25:58.479882: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751203558.508168    2150 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751203558.514060    2150 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-29 13:25:58.537038: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Seed set to 1
Fetching 5 files:   0% 0/5 [00:00<?, ?it/s]
hparams.yaml: 100% 567/567 [00:00<00:00, 4.02MB/s]

README.md: 3.40

In [8]:
#pvalues for bleu: --paired-bs
#sacrebleu ref.en-de.de -l en-de -i sys1.de sys2.de -m bleu chrf ter --paired-bs

#pvalues in comet: ???
#comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en

## TODO

1. compare bleu and chrF for orig and fine-tuned NMT (beam!!!!)
2. compare with the p-values and CI
3. compare with comet
4. extra compare with QE (https://github.com/Unbabel/COMET/blob/master/MODELS.md)

  Reference-free Model: Unbabel/wmt22-cometkiwi-da - This reference-free model employs a regression approach and is built on top of InfoXLM. It has been trained using direct assessments from WMT17 to WMT20, as well as direct assessments from the MLQE-PE corpus. Similar to other models, it generates scores ranging from 0 to 1. For those interested, we also offer larger versions of this model: Unbabel/wmt23-cometkiwi-da-xl with 3.5 billion parameters and Unbabel/wmt23-cometkiwi-da-xxl with 10.7 billion parameters.
5. extra change to cometinho (https://github.com/Unbabel/COMET
https://github.com/Unbabel/COMET/blob/master/MODELS.md)




##reference file:

- Vienna_Environmental.en-de.test.de

---

##output files:

- Vienna_Environmental.en-de.test.marian.de (translated using original model)
- Vienna_Environmental.en-de.test.marian.vienna.beam6.de (translated using the vienna ft model)
- Vienna_Environmental.en-de.test.marian.europat.beam6.de (translated using the europat ft model)


##1:



In [4]:
%cd /content/BMT_2025S/week10_files/inference

/content/BMT_2025S/week10_files/inference


In [5]:
#1: Compare bleu and chrF for orig and finetuned NMT (beam6)

!sacrebleu Vienna_Environmental.en-de.test.de -l en-de -i Vienna_Environmental.en-de.test.marian.de -m bleu chrf
#gold standard translation - output of sytem0: (original model)

[
{
 "name": "BLEU",
 "score": 18.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.5.1",
 "verbose_score": "47.5/24.5/13.5/7.9 (BP = 1.000 ratio = 1.163 hyp_len = 2425 ref_len = 2086)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.5.1"
},
{
 "name": "chrF2",
 "score": 57.1,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.5.1",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "0",
 "space": "no",
 "version": "2.5.1"
}
]
[0m

In [6]:
#1: Compare bleu and chrF for orig and finetuned NMT (beam6)

!sacrebleu Vienna_Environmental.en-de.test.de -l en-de -i Vienna_Environmental.en-de.test.marian.vienna.beam6.de -m bleu chrf
#gold standard translation - output of sytem1: (model ft on Vienna_Environmental.en-de.train.json)

[
{
 "name": "BLEU",
 "score": 23.2,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.5.1",
 "verbose_score": "53.6/29.4/17.6/10.5 (BP = 1.000 ratio = 1.039 hyp_len = 2168 ref_len = 2086)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.5.1"
},
{
 "name": "chrF2",
 "score": 58.2,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.5.1",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "0",
 "space": "no",
 "version": "2.5.1"
}
]
[0m

In [7]:
#1: Compare bleu and chrF for orig and finetuned NMT (beam6)

!sacrebleu Vienna_Environmental.en-de.test.de -l en-de -i Vienna_Environmental.en-de.test.marian.europat.beam6.de -m bleu chrf
#gold standard translation - output of sytem2: (model ft on EuroPat.de-en.20k.train.json)

[
{
 "name": "BLEU",
 "score": 15.0,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.5.1",
 "verbose_score": "47.2/21.1/10.0/5.0 (BP = 1.000 ratio = 1.050 hyp_len = 2190 ref_len = 2086)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.5.1"
},
{
 "name": "chrF2",
 "score": 51.8,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.5.1",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "0",
 "space": "no",
 "version": "2.5.1"
}
]
[0m

##2:

In [None]:
#Compare with the p-values and CI

!sacrebleu Vienna_Environmental.en-de.test.de -l en-de -i Vienna_Environmental.en-de.test.marian.de Vienna_Environmental.en-de.test.nllb.de -m bleu chrf --paired-bs
#gold standard translation - sytem0 - system1