# Evaluating Large Language Models (LLMs)

This notebook demonstrates methods for evaluating LLMs.  We focus on the task of summarization and cover accuracy, ROUGE-N, and perplexity.

###Learning Objectives
1. Know how to compute ROUGE-N and other metrics.
2. Gain an intuitive understanding of ROUGE-N.
3. Test various models and model sizes on the same data, and compare their results.


In [2]:
!pip install rouge_score==0.1.2 huggingface_hub langchain openai transformers datasets


Collecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface_hub
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.0.305-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-

## How can we evaluate summarization?

Suppose you are developing a smartphone news app and need to display automatically generated summaries of breaking news articles.  How can you evaluate whether or not the summaries you are generating are good?

![](https://drive.google.com/uc?export=view&id=1V6cMD1LgivCb850JDhva1DO9EWVH8rJ7)


## Dataset

We will use a subset of the `cnn_dailymail` dataset from See et al., 2017, downloadable from the [Hugging Face `datasets` hub](https://huggingface.co/datasets/cnn_dailymail).

This dataset provides news article paired with summaries (in the "highlights" column).  Let's load the data and take a look at some examples.


In [33]:
import torch
from datasets import load_dataset

full_dataset = load_dataset("cnn_dailymail", version="3.0.0",cache_dir="sample_data/")
sample_size = 10

sample = full_dataset["train"].filter(lambda r: 'CNN' in r['article'][:25]).shuffle(seed=42).select(range(sample_size))
sample



Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 10
})

In [4]:
display(sample.to_pandas())



Unnamed: 0,article,highlights,id
0,(CNN) -- A magnitude 6.7 earthquake rattled Pa...,Papua New Guinea is on the so-called Ring of F...,8093dba7bc2260c26f18939826909ef27549c758
1,(CNN) -- Pakistan took big steps towards level...,Australia collapse to 88 all out on opening da...,67d626156f971d0bf55e5f2a48e1ed965eb622a6
2,(CNN) -- Federal prosecutors are pushing to fo...,Jared Loughner is refusing the government's re...,0d02fb8f0d406db956b128a5c1cc7bf3f13860a6
3,"Centennial, Colorado (CNN) -- McKayla Hicks sa...",Shooting victim McKayla Hicks went to hearing ...,39aee887c6d34bd311c826142b14037e6f2639ee
4,(CNN) -- Double-amputee sprinter Oscar Pistori...,Oscar Pistorius to become first double-amputee...,cc83ecdf08f0b598c3b97b3e2819c7e0ae7ca4f2
5,(CNN) -- A grand jury has indicted Texas Gov. ...,"NEW: Perry lawyer calls indictments ""political...",51fb6465303595cb201b427ca04b594b182a9722
6,(CNN)An Argentine prosecutor said Friday there...,Prosecutor to judge: Enough evidence for inves...,f4d3394791035a0571f1841d5d21661fdb39d74f
7,"Warsaw, Poland (CNN) -- European football's go...",NEW: UEFA president Michel Platini urges fans ...,76ba8e9110a66a1b1293abe34ef4fab254371af8
8,(CNN) -- Two issues -- security and immigratio...,A new high-level group to discuss economic coo...,fbca9bf96c440bbfab59de6bd5f6d06ed609ed99
9,(CNN) -- More than 100 police officers and ot...,Four inmates escape from jail in St. Tammany P...,a91b42eb3bfaa9dd1d6fe5e07d595f0acdbf29bc


In [5]:
example_article = sample["article"][9]
example_highlights = sample["highlights"][9]

print("Article: \n"+example_article)
print("\nHighlights: \n"+example_highlights)



Article: 
(CNN)  -- More than 100 police officers and others were searching Friday in a southeastern Louisiana parish for a murder suspect who escaped from jail with three other inmates, a law enforcement official said. Timothy Murray, 29, who is charged with murder, remains at large, authorities in Louisiana say. Searchers are still focusing inside St. Tammany Parish, on the northern shore of Lake Pontchartrain, 30 miles north of New Orleans, said Capt. George Bonnett of the St. Tammany Parish Sheriff's Office. At large is Timothy Murray, 29, who is charged with murder, Bonnett said. Authorities believe Murray may have been injured during the escape, but Bonnett wouldn't elaborate. The inmates escaped about 9 p.m. Thursday from the St. Tammany Parish Jail in Covington, Bonnett said. As many as 250 sheriff's deputies, Covington police officers, Louisiana State police and corrections officials were involved in the search overnight, using dogs, two helicopters and thermal-imaging equipme

In [6]:
import transformers as tr
model_checkpoint = "t5-small"
tokenizer = tr.AutoTokenizer.from_pretrained(model_checkpoint)
model = tr.AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [7]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
t5_small_summaries=summarizer(sample['article'])

Your max_length is set to 200, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)
Token indices sequence length is longer than the specified maximum sequence length for this model (697 > 512). Running this sequence through the model will result in indexing errors


In [32]:
def summarize(model):
  model_checkpoint = model
  tokenizer = tr.AutoTokenizer.from_pretrained(model_checkpoint)
  model = tr.AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
  summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, truncation=True)
  summaries=summarizer(sample['article'])
  return summaries


In [42]:
import pandas as pd
reference_summaries = sample['highlights']
t5_small__final_summaries =[x['summary_text'] for x in t5_small_summaries]
final_df = pd.DataFrame.from_dict({"generated":t5_small__final_summaries, "reference":reference_summaries})
final_df


Unnamed: 0,generated,reference
0,magnitude 6.7 quake rattles Papua new Guinea e...,Papua New Guinea is on the so-called Ring of F...
1,australia bowled out their opponents for just ...,Australia collapse to 88 all out on opening da...
2,federal prosecutors are pushing to force jared...,Jared Loughner is refusing the government's re...
3,"""he tried to kill people,"" a 17-year-old high ...",Shooting victim McKayla Hicks went to hearing ...
4,double amputee sprinter Oscar Pistorius named ...,Oscar Pistorius to become first double-amputee...
5,a grand jury indicted the governor on charges ...,"NEW: Perry lawyer calls indictments ""political..."
6,a prosecutor says there is enough evidence to ...,Prosecutor to judge: Enough evidence for inves...
7,"UEFA says it is acting over ""the setting-off a...",NEW: UEFA president Michel Platini urges fans ...
8,new: presidents agree to create a new high-lev...,A new high-level group to discuss economic coo...
9,more than 100 police officers and others are s...,Four inmates escape from jail in St. Tammany P...


###Accuracy

In [10]:
accuracy=0.0
for i in range(len(reference_summaries)):
  if(reference_summaries[i]==t5_small__final_summaries[i]):
    accuracy+=1
accuracy/len(reference_summaries)

accuracy


0.0

Now that we can generate summaries---and we know 0/1 accuracy is useless here---let's look at how we can compute a meaningful metric designed to evaluate summarization: ROUGE.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of evaluation metrics designed for comparing summaries from Lin et al., 2004.  See [Wikipedia](https://en.wikipedia.org/wiki/ROUGE_&#40;metric&#41;) for more info.  Here, we use the Hugging Face Evaluator wrapper to call into the `rouge_score` package.  This package provides 4 scores:

* `rouge1`: ROUGE computed over unigrams (single words or tokens)
* `rouge2`: ROUGE computed over bigrams (pairs of consecutive words or tokens)
* `rougeL`: ROUGE based on the longest common subsequence shared by the summaries being compared
* `rougeLsum`: like `rougeL`, but at "summary level," i.e., ignoring sentence breaks (newlines)


In [11]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.0 responses-0.18.0


In [12]:
import evaluate
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

rouge = evaluate.load("rouge")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

You can call `rouge_score` evaluator directly, but we provide a convenience function below to handle the expected input format. route functiona expects '\n' after each token. So, we need to do formatting


In [43]:
def compute_rouge_score(generated: list, reference: list) -> dict:
    """
    Compute ROUGE scores on a batch of articles.

    This is a convenience function wrapping Hugging Face `rouge_score`,
    which expects sentences to be separated by newlines.

    :param generated: Summaries (list of strings) produced by the model
    :param reference: Ground-truth summaries (list of strings) for comparison
    """
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
    return rouge.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
    )

In [44]:
result = compute_rouge_score(final_df['generated'], final_df['reference'])

print(result)

{'rouge1': 0.4447858287088081, 'rouge2': 0.22608456497804547, 'rougeL': 0.31858071565885293, 'rougeLsum': 0.43089785199320496}


In [15]:
rouge.compute(
        predictions=["Large language models beat world record"],
        references=["Large language models beating world records"],
        use_stemmer=False,
    )

{'rouge1': 0.6666666666666666,
 'rouge2': 0.4000000000000001,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

In [16]:
rouge.compute(
        predictions=["Large language models beat world record"],
        references=["Large language models beating world records"],
        use_stemmer=True,
    )

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

In [17]:
# What if we predict exactly 1 word correctly?
rouge.compute(
    predictions=["Large language models beat world record"],
    references=["Large"],
    use_stemmer=True,
)

{'rouge1': 0.2857142857142857,
 'rouge2': 0.0,
 'rougeL': 0.2857142857142857,
 'rougeLsum': 0.2857142857142857}

In [18]:
# The ROUGE score is symmetric with respect to predictions and references.
rouge.compute(
    predictions=["Large"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

{'rouge1': 0.2857142857142857,
 'rouge2': 0.0,
 'rougeL': 0.2857142857142857,
 'rougeLsum': 0.2857142857142857}

In [19]:
# What about 2 words?  Note how 'rouge1' and 'rouge2' compare with the case when we predict exactly 1 word correctly.
rouge.compute(
    predictions=["Large language"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

{'rouge1': 0.5, 'rouge2': 0.33333333333333337, 'rougeL': 0.5, 'rougeLsum': 0.5}

In [20]:
# Note how rouge1 differs from the rougeN (N>1) scores when we predict word subsequences correctly.
rouge.compute(
    predictions=["Models beat large language world record"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

{'rouge1': 1.0,
 'rouge2': 0.6,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

## Compare small and large models

We've been working with the `t5-small` model so far.  Let's compare several models with different architectures in terms of their ROUGE scores and some example generated summaries.


In [21]:
import pandas as pd
def compute_rouge_per_row(generated_summaries: list, reference_summaries: list) -> pd.DataFrame:
    """
    Generates a dataframe to compare rogue score metrics.
    """
    generated_with_newlines = [
        "\n".join(sent_tokenize(s.strip())) for s in generated_summaries
    ]
    reference_with_newlines = [
        "\n".join(sent_tokenize(s.strip())) for s in reference_summaries
    ]
    scores = rouge.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
        use_aggregator=False,

    )
    scores["generated"] = generated_summaries
    scores["reference"] = reference_summaries
    return pd.DataFrame.from_dict(scores)

### T5-small

The [T5](https://huggingface.co/docs/transformers/model_doc/t5) [[paper]](https://arxiv.org/pdf/1910.10683.pdf) family of models are text-to-text transformers that have been trained on a multi-task mixture of unsupervised and supervised tasks. They are well suited for task such as summarization, translation, text classification, question answering, and more.

The t5-small version of the T5 models has 60 million parameters.


In [22]:
compute_rouge_per_row(final_df['generated'], final_df['reference'])


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,generated,reference
0,0.539683,0.459016,0.349206,0.539683,magnitude 6.7 quake rattles Papua new Guinea e...,Papua New Guinea is on the so-called Ring of F...
1,0.538462,0.294118,0.326923,0.519231,australia bowled out their opponents for just ...,Australia collapse to 88 all out on opening da...
2,0.4,0.128205,0.35,0.4,federal prosecutors are pushing to force jared...,Jared Loughner is refusing the government's re...
3,0.365591,0.175824,0.27957,0.344086,"""he tried to kill people,"" a 17-year-old high ...",Shooting victim McKayla Hicks went to hearing ...
4,0.477876,0.234234,0.300885,0.424779,double amputee sprinter Oscar Pistorius named ...,Oscar Pistorius to become first double-amputee...
5,0.36,0.163265,0.28,0.36,a grand jury indicted the governor on charges ...,"NEW: Perry lawyer calls indictments ""political..."
6,0.494118,0.192771,0.423529,0.470588,a prosecutor says there is enough evidence to ...,Prosecutor to judge: Enough evidence for inves...
7,0.408602,0.241758,0.301075,0.408602,"UEFA says it is acting over ""the setting-off a...",NEW: UEFA president Michel Platini urges fans ...
8,0.504348,0.247788,0.365217,0.486957,new: presidents agree to create a new high-lev...,A new high-level group to discuss economic coo...
9,0.361111,0.142857,0.194444,0.361111,more than 100 police officers and others are s...,Four inmates escape from jail in St. Tammany P...


### T5-base

The [T5-base](https://huggingface.co/t5-base) model has 220 million parameters.


In [34]:
t5_base_summaries = summarize("t5-base")


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Your max_length is set to 200, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)


In [37]:
t5_base_summaries_final = [x['summary_text'] for x in t5_base_summaries]


In [40]:
compute_rouge_per_row(t5_base_summaries_final, reference_summaries)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,generated,reference
0,0.5625,0.451613,0.34375,0.5625,the quake was centered about 200 miles north-n...,Papua New Guinea is on the so-called Ring of F...
1,0.523364,0.190476,0.299065,0.46729,Pakistan reach 148-3 on opening day of two-Tes...,Australia collapse to 88 all out on opening da...
2,0.333333,0.028571,0.194444,0.277778,federal prosecutors want a sample of jared Lee...,Jared Loughner is refusing the government's re...
3,0.176471,0.02,0.098039,0.156863,"""i think it's cool that I have a bullet in my ...",Shooting victim McKayla Hicks went to hearing ...
4,0.424242,0.103093,0.222222,0.363636,double-amputee sprinter Oscar Pistorius will c...,Oscar Pistorius to become first double-amputee...
5,0.444444,0.272727,0.4,0.444444,"new: governor's attorney calls indictment a ""p...","NEW: Perry lawyer calls indictments ""political..."
6,0.494118,0.096386,0.282353,0.447059,prosecutor says there is enough evidence to co...,Prosecutor to judge: Enough evidence for inves...
7,0.387097,0.087912,0.193548,0.387097,UEFA opens disciplinary proceedings against Cr...,NEW: UEFA president Michel Platini urges fans ...
8,0.26087,0.066667,0.173913,0.23913,"""that's the focus of my visit,"" he says after ...",A new high-level group to discuss economic coo...
9,0.324324,0.166667,0.297297,0.324324,"four inmates escaped from jail in covington, s...",Four inmates escape from jail in St. Tammany P...


In [41]:
result = compute_rouge_score(t5_base_summaries_final, reference_summaries)

print(result)

{'rouge1': 0.39531465679601363, 'rouge2': 0.1470885842887798, 'rougeL': 0.25091263428472615, 'rougeLsum': 0.36849785041564803}



t5-small -
'rouge1': 0.4447858287088081, 'rouge2': 0.22608456497804547, 'rougeL': 0.31858071565885293, 'rougeLsum': 0.43089785199320496


t5-base -
'rouge1': 0.39531465679601363, 'rouge2': 0.1470885842887798, 'rougeL': 0.25091263428472615, 'rougeLsum': 0.36849785041564803

In [51]:
generator = pipeline('summarization', model='facebook/bart-large-cnn')
gpt2_summaries = generator(sample['article'])

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Your max_length is set to 142, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


In [54]:
bart_large_cnn_summaries  = [x['summary_text'] for x in gpt2_summaries]

In [55]:
result = compute_rouge_score(bart_large_cnn_summaries, reference_summaries)

In [56]:
print(result)

{'rouge1': 0.4557204805225591, 'rouge2': 0.23169686574591314, 'rougeL': 0.3064289165389464, 'rougeLsum': 0.4301153097115594}


In [59]:
s = """
t5-small -              'rouge1': 0.4447858287088081, 'rouge2': 0.22608456497804547, 'rougeL': 0.31858071565885293, 'rougeLsum': 0.43089785199320496
t5-base  -              'rouge1': 0.39531465679601363, 'rouge2': 0.1470885842887798, 'rougeL': 0.25091263428472615, 'rougeLsum': 0.36849785041564803
facebook/bart-large-cnn 'rouge1': 0.4557204805225591, 'rouge2': 0.23169686574591314, 'rougeL': 0.3064289165389464, 'rougeLsum': 0.4301153097115594
"""
print(s)


t5-small -              'rouge1': 0.4447858287088081, 'rouge2': 0.22608456497804547, 'rougeL': 0.31858071565885293, 'rougeLsum': 0.43089785199320496
t5-base  -              'rouge1': 0.39531465679601363, 'rouge2': 0.1470885842887798, 'rougeL': 0.25091263428472615, 'rougeLsum': 0.36849785041564803
facebook/bart-large-cnn 'rouge1': 0.4557204805225591, 'rouge2': 0.23169686574591314, 'rougeL': 0.3064289165389464, 'rougeLsum': 0.4301153097115594

