ELEC-E5550 - Statistical Natural Language Processing
# SET 6: Machine Translation Evaluation

# Released: 26.03.2020
# Deadline: 9.04.2020 at midnight

After completing this assignment, you'll learn how to automatically evaluate MT system using BLEU score.

KEYWORDS:
* BLEU

With Machine Translation, it can not only be difficult to find good data to train the system on, but it can also be difficult to evaluate the system's output. The perfect way would be to ask a human to score the system's output. This approach is not really feasible, especially at the stage when you're only tuning the system. Instead of human assessments, we can use some quick automatic approximation of it.

## TASK 1
## BLEU score

**BLEU** aka bilingual evaluation understudy (https://www.aclweb.org/anthology/P02-1040.pdf) is an algorithm for automatic evaluation of translation produced by a MT system. Given a reference translation (or translations) produced by a human, BLEU evaluates how close the machine produced output is to the human translation on a scale from 0 to 1 (the larger the better). 

BLEU counts N-gram overlaps between machine translation output and its reference translation. These matches are position-independent, the match can happen anywhere in the reference sentence (or sentences). The more matches, the better the candidate translation is. We also want to punish the system for giving out a translation that only has a couple of words (think why, it will be a question in the last task).
The score is calculated in the following way:

$BLEU = min (1, \frac{output-length}{reference-length})(\prod_{i=1}^{N} precision_{i})^{\frac{1}{N}}$

As you probably remember from IR assignment, precision measures how many right answers were given out of all answers. It is calculated in the following way:
Precision = $\frac{tp}{tp + fp}$

So for the N-gram case, where N is 4, the computation of the BLEU score will be:
* STEP 1: count unigram precision: how many words out of translation were seen in the references.
* STEP 2: count 2-grams precision: how many bi-grams out of all bi-grams in translation were seen in the bi-grams fron the references.
* STEP 3: count 3-gram precision
* STEP 4: count 4-gram precision
* STEP 5: count average precision
* STEP 6: apply brevity penalty

### Count n-gram matches
## 1.1
Write a function that will estimate the n-gram precision. Once an n-gram is matched to something, a reference n-gram should be considered exhausted (you can't match anything new to it). 

In [1]:
def count_ngram_matches(translated_sentence, reference_sentences, n):
    """
    this function takes in a translated sentence, a list of reference sentences
    and a length of ngrams we want to look for,
    and outputs the proportion of ngrams from the translated sentence that were found in the references.
    
    INPUT:
    translated_sentence - a sentence that is being compared to a reference (list of strings)
    reference_sentences - a list of reference sentences (list of lists of strings)
    n - the length of ngrams that are compared (integer)
    OUTPUT:
    precision - the fractions of matched ngrams out of all ngrams in the sentence1 (float)
    """
    # YOUR CODE HERE
    tp = 0
    for sentence in reference_sentences:
        sen = ' '.join(sentence)
        for i in range(len(translated_sentence) - n + 1):
            if ' '.join(translated_sentence[i:i + n]) in sen:
                tp += 1
    precision = tp / (len(translated_sentence) - n + 1)
    return precision

In [2]:
from numpy.testing import assert_almost_equal
from nose.tools import assert_equal


dummy_translated_sentence = ['a','b','c','d','3']
dummy_ref_sentence = [['a','b','c','4']]
# CHECKING THE GENERAL PROPERTIES OF THE OUTPUT
# check that the output is a float number
assert_equal(type(count_ngram_matches(dummy_translated_sentence, dummy_ref_sentence, 2)), float)

# CHECKING THAT THE FUNCTION IS WORKING AS IT SHOULD
assert_almost_equal(count_ngram_matches(dummy_translated_sentence, dummy_ref_sentence, 1), 0.6, 2)

assert_almost_equal(count_ngram_matches(dummy_translated_sentence, dummy_ref_sentence, 2), 0.5, 2)

assert_almost_equal(count_ngram_matches(dummy_translated_sentence, dummy_ref_sentence, 3), 0.33, 2)


dummy_translated_sentence2 = ['1','2','1']
dummy_ref_sentence2 = [['1','1','3'],['4','3'],['3']]

assert_almost_equal(count_ngram_matches(dummy_translated_sentence2, dummy_ref_sentence2, 1), 0.67 ,2)

### Estimate a brevity penalty
## 1.2
Now let's punish our translation for being too short. Compare the length of the translated sentence with the length of the longest of the references. If tranlation is longer than any reference, your function should output 1.

In [3]:
def brevity_penalty(len_translation, len_references):
    """
    this function takes in the length of the translated snetence and a list of lengths of the reference sentences,
    and outputs a brevity penalty compared to the longest out of reference sentences
    
    INPUT:
    len_translation - the number of words in the translated sentence (integer)
    len_references - a list of reference sentences lengths [1,3,5] (a list of integers)
    OUTPUT:
    score - brevity penalty score
    """
    # YOUR CODE HERE
    frac = [len_translation / l for l in len_references]
    score = min(1, min(frac))
    return score

In [4]:
from numpy.testing import assert_almost_equal
from nose.tools import assert_equal


dummy_len_translation = 11
dummy_len_references = [12, 9, 11]
# CHECKING THAT THE FUNCTION IS WORKING AS IT SHOULD

assert_almost_equal(brevity_penalty(dummy_len_translation, dummy_len_references), 0.92, 2)

dummy_len_translation2 = 15
dummy_len_references2 = [12, 9, 11]

assert_equal(brevity_penalty(dummy_len_translation2, dummy_len_references2), 1)


### Combine all elements
## 1.3
Now let's compose everything into one function that outputs BLEU score. It should take a translated sentence, a list of references and the maximun length of n-grams and output a BLEU score.

There can be a situation, when there are no n-grams seen in the reference sentences. If that happens to the unigram translation, the score should be 0. In other cases, instead of givin 0 for precision, smoothe it to be $\frac{1}{N}$

In [5]:
def BLEU(translation, references, n):
    """
    this function takes in a translation sentence
    
    INPUT:
    translation - a translated sentence (a list of strings)
    references - a list of reference sentences (a list of lists of strings)
    OUTPUT:
    score - a BLEU score for the translation
    """
    # YOUR CODE HERE
    ref_len = [len(r) for r in references]
    penalty = brevity_penalty(len(translation), ref_len)
    precision = 0

    for i in range(n):
        i = i + 1
        p = count_ngram_matches(translation, references, i)
        if p == 0:
            if precision == 0:
                continue
            else:
                precision *= (1 / n)
        else:
            if precision == 0:
                precision = p
            else:
                precision *= p
    precision = precision ** (1 / n)
    score = precision * penalty
    return score

In [6]:
from numpy.testing import assert_almost_equal
from nose.tools import assert_equal


dummy_translated_sentence = ['1','2','3','4']
dummy_ref_sentence = [['1','2','3','5']]
# CHECKING THE GENERAL PROPERTIES OF THE OUTPUT
# check that the output is a float number
assert_equal(type(BLEU(dummy_translated_sentence, dummy_ref_sentence, 2)), float)

# CHECKING THAT THE FUNCTION IS WORKING AS IT SHOULD
assert_almost_equal(BLEU(dummy_translated_sentence, dummy_ref_sentence, 4), 0.5, 3)

dummy_translated_sentence2 = ['1']
dummy_ref_sentence2 = [['1','2','3','5']]

assert_almost_equal(BLEU(dummy_translated_sentence2, dummy_ref_sentence2, 1), 0.25, 3)

dummy_translated_sentence3 = ['1','2']
dummy_ref_sentence3 = [['1','3','1'],['2']]

assert_almost_equal(BLEU(dummy_translated_sentence3, dummy_ref_sentence3, 2), 0.471, 3)


### Your reflection on BLEU score
## 1.4
Briefly answer the following questions:

1. Will providing more references increase the BLEU score? Why?

2. What is the benefit of brevity score? Why do we need it?

3. Will a human translator always get a BLEU score of 1? Why?

4. What are the problems with the way BLEU calculates its score? 

YOUR ANSWER HERE