# How Reliable is MT Evaluation?

## note : This is an assignment of course 'Mulitilinguality in NLP' at University of Paris Cité by professor M. Guillaume Wisniewski.


---



# Introduction
Bleu is the de facto evaluation metric used in MT. It is, for instance, used by Meta AI in their “No Language Left Behind” (NLLB) initiative to claim that they achieve “an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.” [3]. 1

# The goal of this lab
 is to understand the inner working of Bleu and show how using this value can easily result in wrong analysis : a thorough analysis of the result of NLLB [2] conclude that “many of Meta AI claims made in NLLB are : unfounded, misleading, and the result of a deeply flawed evaluation.”

# Using the WMT’15 test sets, 2 evaluate the performance of mBart and MarianMT.

## Installation

In [None]:
!pip install beautifulsoup4
!pip install -U spacy
!python -m spacy download en_core_web_sm
!pip install transformers
!pip install sentencepiece

In [None]:
!git clone https://github.com/google/sentencepiece.git
!cd sentencepiece
! mkdir build
! cd build
! cmake ..
! make -j $(nproc)
! sudo make install
! sudo ldconfig -v

# Data load

In [None]:
from bs4 import BeautifulSoup

data_ls = [ "enfr-ref.fr","enfr-src.en","fren-ref.en","fren-src.fr"]

text={}
for data in data_ls:
  with open("/content/newsdiscusstest2015-"+data+".sgm", 'r', encoding='utf-8') as file:
    data_content = file.read()
    soup = BeautifulSoup(data_content, 'html.parser')
    text[data]=[segment.get_text() for segment in soup.find_all('seg')]
  print(len(text[data]))

1500
1500
1500
1500


# Translation

## Marian

In [None]:
from transformers import MarianTokenizer, MarianMTModel

model_name = 'Helsinki-NLP/opus-mt-romance-en'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_marian(sentence):
  tokenized_text = tokenizer(sentence, return_tensors="pt", padding=True)
  translated_tokens = model.generate(**tokenized_text)
  translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
  return translated_text

Downloading (…)okenizer_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/800k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.46M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [None]:
# translated_fren = [translate_marian(sent) for sent in text["fren-src.fr"]]

In [None]:
# translated_fren[:10]

In [None]:
# with open("translated_fren_marian.txt", "w", encoding="utf-8") as file:
#   for sent in translated_fren:
#     file.write(sent+"\n")

In [None]:
# from transformers import MarianTokenizer, MarianMTModel

# model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
# tokenizer = MarianTokenizer.from_pretrained(model_name)
# model = MarianMTModel.from_pretrained(model_name)

# def translate_marian(sentence):
#   tokenized_text = tokenizer(">>fr<< "+sentence, return_tensors="pt", padding=True)
#   translated_tokens = model.generate(**tokenized_text)
#   translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
#   return translated_text

In [None]:
# translated_enfr = [translate_marian(sent) for sent in text["enfr-src.en"]] # 41

In [None]:
# with open("translated_enfr_marian.txt", "w", encoding="utf-8") as file:
#   for sent in translated_enfr:
#     file.write(sent+"\n")

## MBart

In [None]:
# import torch
# from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model_name = "facebook/mbart-large-50-many-to-many-mmt"
# # model_name = "facebook/mbart-large-cc25"
# model = MBartForConditionalGeneration.from_pretrained(model_name).to(device)
# tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
# tokenizer.src_lang = "en_XX"

# def translate_mbart(sentence):
#     inputs = tokenizer(sentence, return_tensors="pt").to(device)
#     translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
#     translated_sentence = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

#     return translated_sentence


In [None]:
# translated_enfr = [translate_mbart(sent) for sent in text["enfr-src.en"]]

In [None]:
# with open("translated_enfr_mbart.txt", "w", encoding="utf-8") as file:
#   for sent in translated_enfr:
#     file.write(sent+"\n")

text file to list

In [None]:
file_ls = ["translated_fren_mbart","translated_enfr_mbart","translated_fren_marian","translated_enfr_marian"]
translated_text={}

for filename in file_ls:
  with open("/content/"+filename+".txt") as file:
    content = file.read()
    translated_text[filename] = [sent for sent in content.split("\n")[:-1]]


In [None]:
# for filename in file_ls:print(len(translated_text[filename]))

# BLEU evaluation

In [None]:
!pip install nltk


In [None]:
import nltk
nltk.download('punkt')


In [None]:
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu

def bleu(machine_translations, human_translations, language):
  tokenized_machine_translations = [nltk.word_tokenize(sentence, language) for sentence in machine_translations]
  tokenized_human_translations = [nltk.word_tokenize(sentence, language) for sentence in human_translations]
  # Each reference translation should be in a list of lists
  references = [[ref] for ref in tokenized_human_translations]
  hypotheses = tokenized_machine_translations
  bleu_score = corpus_bleu(references, hypotheses)
  return round(bleu_score,2)


In [None]:
bleu(translated_text["translated_fren_mbart"], text['fren-ref.en'], "english")

In [None]:
# for i in range(30):
#   print(translated_text["translated_fren_mbart"][i])
#   print(text['fren-ref.en'][i])

In [None]:
bleu(translated_text["translated_enfr_mbart"], text['enfr-ref.fr'], "french")

In [None]:
bleu(translated_text["translated_fren_marian"], text['fren-ref.en'], "english")

In [None]:
bleu(translated_text["translated_enfr_marian"], text['enfr-ref.fr'], "french")

In [None]:
bleu(text['enfr-ref.fr'], text['fren-src.fr'], "french")

In [None]:
bleu(text['fren-ref.en'], text['enfr-src.en'], "english")

Result :

|        | English to French | French to English |
|--------|-------------------|---------------------|
| **mBart**  |    0.32               |    0.36                 |
| **MarianMT** |        0.37           |      0.39               |


Analysis:

The training differences between the two models are as follows: MarianMT is a neural machine translation system based on supervised training with parallel data (pairs consisting of one language as the source and another as the target). In contrast, mBART is trained on monolingual data from multiple languages, learning how to reconstruct masked (or noised) representations of sentences from various languages. This is possible because it internalizes shared linguistic features across different languages and becomes capable of translating sentences that were not seen during training (known as zero-shot translation).
I believe MarianMT performs better because its strength lies in its optimized models for specific language pairs, in our case, the Romance language and English pairs ('Helsinki-NLP/opus-mt-romance-en'). While mBART is trained on various languages ("facebook/mbart-large-50-many-to-many-mmt"), it likely performs better on translations involving a variety of language requests (especially where low-resource languages are concerned, showcasing its capabilities). However, for this French-English translation task, which is not a low- resource task and is a designated one-to-one language translation, the performance of mBART may not surpass that of MarianMT.

## bigram mismatching ?

As noticed by [1], Bleu places no explicit constraints on the order that matching n-grams occur in. It is therefore possible, given a sentence, to generate many new sentences with at least as many n-gram matches by permuting words around bigram mismatches.

### Explain on an example why such permutations will never decrease the Bleu score.

Blue score is obtained by
$\text{BLEU} = BP \times \exp\left(\sum_{n=1}^{N} w_n \log(P_n)\right)$
where $P_n$ is number of correct ngrams in the hypothesis sentence divided by total number of ngram in the hypothesis sentence and BP is penalty for short sentence.

Let's see this with an example of french-english translation.
- source sentence : *Si vous avez d'autres questions ou besoin de plus d'exemples, n'hésitez pas à demander !*

- reference translation : *If you have any other questions or need more examples, don't be afraid to ask!*

- hypothesis translation A: *If you have any further questions or need more examples, please don't hesitate to ask!*

Among 18 bigrams in hypothesis translation 12 are matching with bigrams of reference translation.  These are corresponding bigrams marked by parenthesis.
	*(If you have any) further (questions or need more examples), please (don't )hesitate (to ask!)*

However this translation below reaches the same score :
- hypothesis translation B:
  *(don't ) (questions or need more examples)  (If you have any)   , please hesitate  (to ask!) further*
  
This is possible because keeping the correct ngram chunks but placing them in different order doesn't affect counting the number of matches and Bleu score. Therefore, even if translation A and B which are totally different in terms of grammatical correctness and semantic ( B almost sounds like the opposite suggestion of A), they would get same Bleu score.


### Given a sentence with *n* words and *b* bigram mismatches, how many sentences can you generate with this principle. Compute the number of sentences you will obtain on the WMT’15 test set.

If there is b mismatching bigram,  it means n-b matching bigram, permutation of them is (n-b)!  As french-english data of WMT’15 test set has 1500 sentences.

Comparing translated english sentence of WMT’15 with  that of marianMT,
reference translation of 1500 sentences contains 27097 tokens obtained by nltk word tokenizer and there are 14111  unmatching bigrams. So we can say that there is (27097 - 14111)! ways of permutations generating same Bleu score.

In [None]:
def bigram(text):
  bigrams = []
  for sent in text :
    bigrams.append( list(nltk.bigrams(nltk.word_tokenize(sent))) )
  return bigrams

# count matching bigrams → subtact it from totam to get nb of unmatching bigrams
ref_bi = bigram(text['fren-ref.en'])
hypo_bi = bigram(translated_text["translated_fren_marian"])

count = 0
for h, r in zip(hypo_bi, ref_bi):
  for ele in h:
    if ele in r: count+=1

In [None]:
# def bigram(text):
#   bigrams = []
#   for sent in text :
#     bigrams.append( list(nltk.bigrams(nltk.word_tokenize(sent))) )
#   return bigrams


In [None]:
# # count matching bigrams
# ref_bi = bigram(text['fren-ref.en'])
# hypo_bi = bigram(translated_text["translated_fren_marian"])

# count = 0
# for h, r in zip(hypo_bi, ref_bi):
#   for ele in h:
#     if ele in r: count+=1

# count

In [None]:
# nb_bigram = 0
# for bigrams_for_sent in hypo_bi:
#   nb_bigram += len(bigrams_for_sent)

# nb_bigram

In [None]:
# token_nb = 0
# for sent in translated_text["translated_fren_marian"] :
#   token_nb += len(nltk.word_tokenize(sent))

# token_nb

# SacreBleu

In [None]:
!pip install sacrebleu

In [None]:
# ! sacrebleu -t wmt15 -l en-fr -i translated_enfr_marian.txt -b
# ! sacrebleu -t wmt15 -l fr-en -i translated_fren_marian.txt -b
# ! sacrebleu -t wmt15 -l en-fr -i translated_enfr_mbart.txt -b
# ! sacrebleu -t wmt15 -l fr-en -i translated_fren_mbart.txt -b

|        | English to French | French to English |
|--------|-------------------|---------------------|
| **mBart**  |    32.6              |    36.1                |
| **MarianMT** |        37.8           |      38.7               |

There is just a subtle difference between the results of Bleu and sacreBleu. All of fours results showed differences  less than 1%. I assume that the way of tokenization is not that influential for our translation result evaluatio

-when considering the “raw” translation hypotheses and references ;

In [None]:
data_ls = [ "enfr-ref.fr","enfr-src.en","fren-ref.en","fren-src.fr"]

for data in data_ls:
  with open(str(data)+".txt", "w", encoding="utf_8") as file:
    for sent in text[data]:
      file.write(sent+ "\n")


    # with open("translated_enfr_marian.txt", "w", encoding="utf-8") as file:
#   for sent in translated_enfr:
#     file.write(sent+"\n")

In [None]:
! sacrebleu -t wmt15 -l en-fr -i fren-src.fr.txt -b


-when the translation hypotheses and references have been tokenized in subword units;

In [None]:
from transformers import BertTokenizer

def subword_tokenization(sentence):
  tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
  tokens = tokenizer.tokenize(sentence)
  tokenized_sent = " ".join(t for t in tokens)
  return tokenized_sent


In [None]:
tokenized_sentences = {}
for key in translated_text.keys():
# for key in list(translated_text.keys()):
  tokenized_sentences[key] = [subword_tokenization(sent) for sent in translated_text[key]] #30m


In [None]:
# tokenized_sentences['translated_enfr_marian'][:10]

In [None]:
# for key in tokenized_sentences.keys():
#   with open("tokenized_"+str(key)+".txt", "w", encoding = "utf-8") as file:
#     for sent in tokenized_sentences[key]:
#       file.write(sent+"\n")

In [None]:
tokenized_ref = {}
keys = ["enfr-ref.fr", "fren-ref.en"]
for key in keys:
# for key in list(translated_text.keys()):
  tokenized_ref[key] = [subword_tokenization(sent) for sent in text[key]]


In [None]:
# for key in tokenized_ref.keys():
#   with open("tokenized_" +str(key)+ ".txt", "w", encoding = "utf-8") as file:
#     for sent in tokenized_ref[key]:
#       file.write(sent+"\n")

In [None]:
!cat tokenized_translated_enfr_marian.txt | sacrebleu --force tokenized_enfr-ref.fr.txt
!cat tokenized_translated_fren_marian.txt | sacrebleu --force tokenized_fren-ref.en.txt
!cat tokenized_translated_enfr_mbart.txt | sacrebleu --force tokenized_enfr-ref.fr.txt
!cat tokenized_translated_fren_mbart.txt | sacrebleu --force tokenized_fren-ref.en.txt

|        | English to French | French to English |
|--------|-------------------|---------------------|
| **mBart**  |         55.5          |        41.1       
| **MarianMT** |     60.5       |       44.4           |

Adopting subword tokenizations gives higher score for all models and all languages (the largest gap is 22.9 higher). This seems general and normal because subword tokenization generates more smaller tokens to be counted allowing augmenting the number of matches that full word has not.  

-when the translation hypotheses and references have been tokenized in characters(this amounts to adding a space between each character of the references and of the translation hypotheses).

In [None]:
def char_tokenize(sent):
  chars = [ch for ch in sent]
  return " ".join(chars)

In [None]:
ch_tokenized = {}
for key in translated_text.keys():
# for key in list(translated_text.keys()):
  ch_tokenized[key] = [char_tokenize(sent) for sent in translated_text[key]]


In [None]:
# ch_tokenized = {}
keys = ["enfr-ref.fr", "fren-ref.en"]
for key in keys:
# for key in list(translated_text.keys()):
  ch_tokenized[key] = [char_tokenize(sent) for sent in text[key]]


In [None]:
# ch_tokenized["translated_fren_mbart"][:10]

In [None]:
for key in ch_tokenized.keys():
  with open("ch_" +str(key)+ ".txt", "w", encoding = "utf-8") as file:
    for sent in ch_tokenized[key]:
      file.write(sent+"\n")

In [None]:
# !cat ch_translated_enfr_marian.txt | sacrebleu --force ch_enfr-ref.fr.txt
# !cat ch_translated_fren_marian.txt | sacrebleu --force ch_fren-ref.en.txt
# !cat ch_translated_enfr_mbart.txt | sacrebleu --force ch_enfr-ref.fr.txt
# !cat ch_translated_fren_mbart.txt | sacrebleu --force ch_fren-ref.en.txt

|        | English to French | French to English |
|--------|-------------------|---------------------|
| **mBart**  |         64.1        |        65.1      
| **MarianMT** |     68.6       |       67.1          |

Similarly, the unit of tokenization becomes much smaller in to character therefore, scores get much higher for all of four categories.