<a href="https://colab.research.google.com/github/gcunhase/NLPMetrics/blob/master/notebooks/bleu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BLEU: BiLingual Evaluation Understudy

*NLP evaluation metric used in Machine Translation tasks*

*$N$-gram comparison between words in candidate sentence and reference sentences*

### 1. Libraries
*Install and import necessary libraries*


In [0]:
import nltk
import nltk.translate.bleu_score as bleu

import numpy
import os

try:
  nltk.data.find('tokenizers/punkt')
except LookupError:
  nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### 2. Dataset
*Array of words: candidate and reference sentences split into words*

In [0]:
hyp = str('she read the book because she was interested in world history').split()
ref_a = str('she read the book because she was interested in world history').split()
ref_b = str('she was interested in world history because she read the book').split()

### 3. *Sentence* score calculation
*Compares 1 hypothesis (candidate or source sentence) and 1+ reference sentences.*

* *hyp and ref_a are the same so the BLEU score should be 1*

In [0]:
score_ref_a = bleu.sentence_bleu([ref_a], hyp)
print(score_ref_a)

1.0


* *hyp and ref_b are different, so the BLEU score should be lower than 1*

In [0]:
score_ref_b = bleu.sentence_bleu([ref_b], hyp)
print(score_ref_b)

0.7400828044922853


* *When comparing a candidate sentence with multiple reference sentences, the function will return the highest score*

In [0]:
score_ref_ab = bleu.sentence_bleu([ref_a, ref_b], hyp)
print(score_ref_ab)

1.0


### 4. *Corpus* score calculation
*Compares 1 candidate document with multiple sentence and 1+ reference documents also with multiple sentences.*

* Different than averaging BLEU scores of each sentence, it calculates the score by *"summing the numerators and denominators for each hypothesis-reference(s) pairs before the division"*

In [0]:
score_ref_a = bleu.corpus_bleu([[ref_a]], [hyp])
print("1 document with 1 reference sentence: {}".format(score_ref_a))
score_ref_a = bleu.corpus_bleu([[ref_a, ref_b]], [hyp])
print("1 document with 2 reference sentences: {}".format(score_ref_a))
score_ref_a = bleu.corpus_bleu([[ref_a], [ref_b]], [hyp, hyp])
print("2 documents with 1 reference sentence each: {}".format(score_ref_a))

### 5. N-gram
*N-gram scores can be obtained in both **sentence** and **corpus** calculations and they're indicated by the **weights** parameter.*

* *weights*: length 4, where each index contains a weight corresponding to its respective N-gram.
* N-gram with $N \in \{1, 2, 3, 4\}$
* $\textit{weights}=(W_{N=1}, W_{N=2}, W_{N=3}, W_{N=4})$



In [0]:
score_1gram = bleu.sentence_bleu([ref_a], hyp, weights=(1,0,0,0))
score_2gram = bleu.sentence_bleu([ref_a], hyp, weights=(0,1,0,0))
score_3gram = bleu.sentence_bleu([ref_a], hyp, weights=(0,0,1,0))
score_4gram = bleu.sentence_bleu([ref_a], hyp, weights=(0,0,0,1))
print("N-grams: 1-{}, 2-{}, 3-{}, 4-{}".format(score_1gram, score_2gram, score_3gram, score_4gram))

* Cumulative N-grams: *by default, the score is calculatedby considering all $N$-grams equally*

In [0]:
score_ngram1 = bleu.sentence_bleu([ref_b], hyp)
score_ngram = bleu.sentence_bleu([ref_b], hyp, weights=(0.25,0.25,0.25,0.25))
print("N-grams: {}, {}".format(score_ngram1, score_ngram))