# "Statistical" metrics

In [7]:
from IPython.display import display, Markdown, Latex

When we are dealing with applications that generate text and we want to have more flexibility when evaluating the quality of the generated text, we can use "statistical" metrics.

BLEU and ROUGE are two of the most popular statistical metrics for evaluating the quality of the generated text. 

These metrics work by comparing the generated text to a "golden" reference, their flexibility comes from the fact that it not only checks for specific words (like we did with the rule-based evaluations), but for the entire sequence, and the order of the words, the sentence length, etc.

At the core, BLEU and ROUGE decomposes the generated text into **n-grams** (chunks of words) and compare them to the n-grams of the "golden" reference.


For example, imagine we have the following text:

> Policy Lab has experimented with Artificial Intelligence (AI) in policy development with teams across government, and beyond, for a number of years. In 2019 we worked with the Department for Transport’s data science team to consider the role that AI could play in improving the efficiency and effectiveness of the policy consultation process. In 2022 we used AI to create a vision for the future of Hounslow with the local authority. In 2023, we commissioned the creation of the Ecological Intelligence Agency, a speculative artefact to help experience the role AI might have in future decision-making in environmental policy. 

And we have asked the original writer to provide a summary of the text, **This is our golden reference**:

> Policy Lab has explored the use of AI in policy development across various government projects, including improving policy consultation processes, envisioning future urban planning, and investigating AI's potential role in environmental policy decision-making.

### What we want to evaluate

And we have requested different language models to provide a summary of the text using a variety of prompts.


In [28]:
reference_summary = "Policy Lab has explored the use of AI in policy development across various government projects, including improving policy consultation processes, envisioning future urban planning, and investigating AI's potential role in environmental policy decision-making."
summary_1 = "Policy Lab has explored the application of Artificial Intelligence in diverse government policy initiatives, including enhancing policy consultation, envisioning local community futures, and examining AI's potential role in environmental policy decisions."
summary_2 = "Policy Lab has explored AI applications in governmental policy creation, collaborating with various agencies to enhance consultation procedures, generate urban forecasts, and envision futuristic environmental decision-making tools."
summary_3 = "Policy Lab has been exploring the application of Artificial Intelligence in various government policy initiatives for several years, including improving policy consultations, envisioning the future of local communities, and speculating on the role of AI in environmental decision-making."

display(Markdown(f"""
## Reference summary\n{reference_summary}\n
## Model 1\n{summary_1}\n
## Model 2\n{summary_2}\n
## Model 3\n{summary_3}\n
"""))


## Reference summary
Policy Lab has explored the use of AI in policy development across various government projects, including improving policy consultation processes, envisioning future urban planning, and investigating AI's potential role in environmental policy decision-making.

## Model 1
Policy Lab has explored the application of Artificial Intelligence in diverse government policy initiatives, including enhancing policy consultation, envisioning local community futures, and examining AI's potential role in environmental policy decisions.

## Model 2
Policy Lab has explored AI applications in governmental policy creation, collaborating with various agencies to enhance consultation procedures, generate urban forecasts, and envision futuristic environmental decision-making tools.

## Model 3
Policy Lab has been exploring the application of Artificial Intelligence in various government policy initiatives for several years, including improving policy consultations, envisioning the future of local communities, and speculating on the role of AI in environmental decision-making.



## BLEU

BLEU stands for Bilingual Evaluation Understudy; and it is a metric that was designed for evaluating the quality of the generated text by comparing it to a set of reference translations.

However, it can be used for other applications, such as evaluating the quality of a generated answer from a RAG pipeline, a summarization task, or even the expected response of a chatbot.


We can then evaluate the quality of the generated summaries using BLEU:

In [29]:
from nltk.translate.bleu_score import sentence_bleu

In [30]:
# Perfect match
sentence_bleu(
    [reference_summary.lower().split()],
    reference_summary.lower().split()
)


1.0

In [31]:
sentence_bleu(
    [summary_1.lower().split()],
    reference_summary.lower().split()
)

0.27940187870698063

In [32]:
sentence_bleu(
    [summary_2.lower().split()],
    reference_summary.lower().split()
)

0.0925329498915617

In [33]:
sentence_bleu(
    [summary_3.lower().split()],
    reference_summary.lower().split()
)

The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


3.6524797765262606e-78

## ROUGE

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. Just like BLEU, it was designed for evaluating the quality of the generated text by comparing it to a set of reference.

In practice ROUGE is actually a set of metrics, each one capturing a different aspect of the generated text, for example:

- ROUGE-1: Unigram
- ROUGE-2: Bigram
- ROUGE-L: Longest Common Subsequence


In [38]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [40]:
scores = scorer.score(reference_summary, summary_1)
scores

{'rouge1': Score(precision=0.71875, recall=0.6571428571428571, fmeasure=0.6865671641791045),
 'rouge2': Score(precision=0.3870967741935484, recall=0.35294117647058826, fmeasure=0.36923076923076925),
 'rougeL': Score(precision=0.6875, recall=0.6285714285714286, fmeasure=0.6567164179104478)}

In [41]:
scores = scorer.score(reference_summary, summary_2)
scores

{'rouge1': Score(precision=0.5357142857142857, recall=0.42857142857142855, fmeasure=0.47619047619047616),
 'rouge2': Score(precision=0.14814814814814814, recall=0.11764705882352941, fmeasure=0.13114754098360654),
 'rougeL': Score(precision=0.5, recall=0.4, fmeasure=0.4444444444444445)}

In [42]:
scores = scorer.score(reference_summary, summary_3)
scores

{'rouge1': Score(precision=0.5897435897435898, recall=0.6571428571428571, fmeasure=0.6216216216216216),
 'rouge2': Score(precision=0.2894736842105263, recall=0.3235294117647059, fmeasure=0.30555555555555564),
 'rougeL': Score(precision=0.5384615384615384, recall=0.6, fmeasure=0.5675675675675675)}