# This tutorial introduces sevearl evaluation scores for machine tranlastion and generation.



In [1]:
import evaluate


# BLEU SCORE

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Neither intelligibility nor grammatical correctness are not taken into account.

Paper: https://aclanthology.org/P02-1040/

For more information about this function https://huggingface.co/spaces/evaluate-metric/bleu

In [36]:
bleu = evaluate.load('bleu')
predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
results = bleu.compute(predictions=predictions, references=references)

In [37]:
results

{'bleu': 0.41180376356915777,
 'precisions': [0.6111111111111112,
  0.47058823529411764,
  0.375,
  0.26666666666666666],
 'brevity_penalty': 1.0,
 'length_ratio': 1.125,
 'translation_length': 18,
 'reference_length': 16}

max_order is a paramerter for n gram. The default value is 4 and is also called BLEU-4 in many papers.

In [41]:
bleu.compute(predictions=predictions, references=references,max_order=1) 

{'bleu': 0.6111111111111112,
 'precisions': [0.6111111111111112],
 'brevity_penalty': 1.0,
 'length_ratio': 1.125,
 'translation_length': 18,
 'reference_length': 16}

In [43]:
bleu.compute(predictions=predictions, references=references,max_order=2) 

{'bleu': 0.5362664443598958,
 'precisions': [0.6111111111111112, 0.47058823529411764],
 'brevity_penalty': 1.0,
 'length_ratio': 1.125,
 'translation_length': 18,
 'reference_length': 16}

# METEOR SCORE

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a machine translation evaluation metric, which is calculated based on the harmonic mean of precision and recall, with recall weighted more than precision.

METEOR is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.

### While the correlation between METEOR and human judgments was measured for Chinese and Arabic and found to be significant, further experimentation is needed to check its correlation for other languages.

Paper: https://aclanthology.org/W05-0909.pdf

For more information about this function https://huggingface.co/spaces/evaluate-metric/meteor

In [2]:
meteor = evaluate.load('meteor')
predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
results = meteor.compute(predictions=predictions, references=references)

Downloading builder script:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\qishe\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\qishe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\qishe\AppData\Roaming\nltk_data...


In [3]:
results

{'meteor': 0.6944444444444445}

In [18]:
from nltk.translate import meteor
from nltk import word_tokenize

In [25]:
meteor([word_tokenize(predictions[0])],word_tokenize(references[0]))

0.6320224719101123

# BERT SCORE

BERTScore is an automatic evaluation metric for text generation that computes a similarity score for each token in the candidate sentence with each token in the reference sentence. It leverages the pre-trained contextual embeddings from BERT models and matches words in candidate and reference sentences by cosine similarity.

Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

### The original BERTScore paper showed that BERTScore correlates well with human judgment on sentence-level and system-level evaluation, but this depends on the model and language pair selected.

Paper: https://openreview.net/pdf?id=SkeHuCVFDr

For more information about this function https://huggingface.co/spaces/harshhpareek/bertscore

In [33]:
bertscore = evaluate.load("bertscore")
results = bertscore.compute(predictions=predictions, references=references, lang="en")

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

In [34]:
results

{'precision': [0.9492406249046326],
 'recall': [0.9543976783752441],
 'f1': [0.9518121480941772],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.24.0)'}

We can change the model by using different model_type parameter.

In [35]:
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
print(results)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

{'precision': [0.9270912408828735], 'recall': [0.9384735822677612], 'f1': [0.9327476620674133], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.24.0)'}


Reference: Huggingface

https://huggingface.co/docs/evaluate/index