#### **Bleu** (Bilingual evaluation understudy) can be defined as the algorithm for evaluating the quality of the text that has been Machine-Translated (MT) from one language to the other (Papineni et al. 2002). 

Bleu score ranges between 0 and 1, with values closer to 1 indicating more similar texts from the translation. Invented at IBM in 2001, it was one of the first metric to attain a high correlation with human translations, and still remains the most automated and inexpensive metric to gauge translation quality.

# Bleu Formula
$$
\textrm{BLEU} = BP \times \exp {1\over N} \sum_{n=1}^N \log p_n
$$

Where: 
$$
p_n = {\textrm{Number of ngram tokens in system and reference translations} \over \textrm{Number of ngram tokens in system translation}}
$$

And:

The brevity penalty = $\exp(1-r/c)$, where $c$ is the length of the hypothesis translation (in tokens), and $r$ is the length of the *closest* reference translation.

##### We need to implement Bleu in Python for possible automations in other studies. Hence we start with the hypothesis and reference sentences



In [1]:
#Importing the libraries needed
from math import sqrt, log, exp
from collections import Counter


In [2]:
hypothesis="Abandon all hope , ye who enter here"
references=["All hope abandon , ye who enter here", "All hope abandon , ye who enter in !", "Leave every hope, ye that enter", "Leave all hope , ye that enter"]

In [3]:
#Getting the n-grams from the given text
def get_ngrams(text, order):
    """
    Given a string `text` and an integer `order`, returns a Counter object containing
    the frequency counts of all ngrams of size `order` in the string.
    """
    ngrams = Counter()

    words = text.split()
    for i in range(len(words)- order+1):
      ngram = " ". join(words[i: i + order])
      ngrams[ngram] += 1

    return ngrams

In [4]:
print(dict(get_ngrams(hypothesis, 2))) # sanity check: expected output should be
# {'Abandon all': 1, 'all hope': 1, 'hope ,': 1, ', ye': 1, 'ye who': 1, 'who enter': 1, 'enter here': 1}

{'Abandon all': 1, 'all hope': 1, 'hope ,': 1, ', ye': 1, 'ye who': 1, 'who enter': 1, 'enter here': 1}


In [5]:
def calculate_bleu(hypothesis, references):
    
    bleu=0
    p1=0
    p2=0
    p3=0
    p4=0
    bp=1
    
    # BEGIN SOLUTION

    # 1. Find the closest reference to the hypothesis
    closest_size=100000
    closest_ref=[]

    for ref in references:
      ref_size = len(ref)
      if abs(len(hypothesis) - ref_size) < closest_size:
        closest_size = abs(len(hypothesis) - ref_size)
        closest_ref = ref
        pass

    # 2. Calculating pn
    pns=[]
    for order in range(1,5):
      # calculate intersection and union of n-grams
      # hint: use the get_ngrams function you implemented
      # calculate pn for each order
        hyp_ngrams = get_ngrams(hypothesis, order)
        hyp_count = Counter(hyp_ngrams)
        closest_ref_ngrams = get_ngrams(closest_ref, order)
        closest_ref_count = Counter(closest_ref_ngrams)
        intersection_count = dict(hyp_count & closest_ref_count)
        intersection_size = sum(intersection_count.values())
        hyp_size = max(len(hyp_ngrams), 1)
        p_n = intersection_size / hyp_size
        pns.append(p_n)
        pass

    # 3. Calculating the brevity penalty
    bp=1
    c=len(hypothesis)
    r=min(abs(len(ref) - c) for ref in references)
    if c > r:
      bp = 1.0
    else:
      bp = exp(1 - r / c)

    # 4. Calculating the BLEU score
    weights = [0.25] * 4
    bleu=bp * exp(sum(w * log(p_n) for w, p_n in zip(weights, pns)))    
    
    # Assigning values to p1, p2, p3, p4!
    p1, p2, p3, p4 = pns

    
    # Do not change the variable name
    return bleu, p1, p2, p3, p4, bp


In [6]:
bleu, p1, p2, p3, p4, bp=calculate_bleu(hypothesis, references)
print("BLEU: %.3f" % bleu) # sanity check: 0.5 < BLEU < 1

BLEU: 0.541
