In [1]:
# install the latest VertexAI and langchain package
! pip install nltk rouge_score --upgrade --user



**BLEU (Bilingual Evaluation Understudy)**

BLEU measures the n-gram overlap between the generated text and a reference text. It is a widely used metric, but it has been criticized for being insensitive to fluency and coherence.


In [2]:
from nltk.translate.bleu_score import sentence_bleu

# Reference text and generated text
reference = ["The quick brown fox jumps over the lazy dog."]
generated = "A quick brown fox jumps over a lazy dog."

# Calculate BLEU score
bleu_score = sentence_bleu(reference, generated)
print(f"BLEU Score: {bleu_score}")


BLEU Score: 0.8212432636664931


**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: ROUGE measures the overlap between the generated text and a reference text, but it is more lenient than BLEU and is better at capturing fluency and coherence.
- **ROUGE-N**: Overlap of n-grams between the system and reference summaries.
  - **ROUGE-1** refers to overlap of unigrams between the generated text and reference text.
  - **ROUGE-2** refers to the overlap of bigrams between the generated text and reference text.
- **ROUGE-L**: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
- **ROUGE-W**: Weighted LCS-based statistics that favors consecutive LCSes.
- **ROUGE-S**: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
- **ROUGE-SU**: Skip-bigram plus unigram-based co-occurrence statistics.



**ROUGE vs BLEU**

- **BLEU** focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
- **ROUGE** focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.

These results are complementing, as is often the case in the precision-recall tradeoff.

In [6]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score("The quick brown fox jumps over the lazy dog.", "A brown fox jumps over a lazy dog.")

print("ROUGE Scores:")
print(f"ROUGE-1: {scores['rouge1']}")
print(f"ROUGE-2: {scores['rouge2']}")
print(f"ROUGE-L: {scores['rougeL']}")

ROUGE Scores:
ROUGE-1: Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765)
ROUGE-2: Score(precision=0.5714285714285714, recall=0.5, fmeasure=0.5333333333333333)
ROUGE-L: Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765)


Libs:
- https://www.nltk.org/
- https://keras.io/api/keras_nlp/metrics/
- https://pypi.org/project/rouge-score/