[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cjaniake/xsum_llm_eval/blob/main/docs/getting_started.ipynb)

# Evaluating LLM applications

In [11]:
import litellm
import pandas as pd

from asyncio import Semaphore
from datasets import load_dataset
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
from litellm.types.completion import ChatCompletionMessageParam
from typing import List

from IPython.display import display, Markdown

### Load the XSum dataset from HuggingFace Datasets

The library keeps a local copy and does not re-download it every time

In [2]:
xsum_dataset = load_dataset("EdinburghNLP/xsum")

In [3]:
xsum_dataset["validation"].to_pandas().head(3)

Unnamed: 0,document,summary,id
0,The ex-Reading defender denied fraudulent trad...,Former Premier League footballer Sam Sodje has...,38295789
1,Voges was forced to retire hurt on 86 after su...,Middlesex batsman Adam Voges will be out until...,40202028
2,Seven photographs taken in the Norfolk country...,The Duchess of Cambridge will feature on the c...,36177725


### Draw a sample from the dataset

Let's take 4 random examples and analyse the summaries generated by GTP-4o-mini

In [4]:
RANDOM_STATE = 123
sample_df = xsum_dataset["validation"].to_pandas().sample(4, random_state=RANDOM_STATE)
sample_df

Unnamed: 0,document,summary,id
10804,The six-week mission comes in the face of comp...,Canada has launched a mission to map the Arcti...,28718806
2045,"In a report going to board members next week, ...",NHS Highland has forecast a potential £15.6m o...,28688328
10086,"The B4270 and other local roads are affected, ...",Drivers have been warned to take care as smoke...,39175440
3484,"The proposal, dubbed Hornsea Three, would crea...",A public consultation into plans for a third o...,36206631


### Explore the summarization metric

In [17]:
system_prompt = {
    "content": "Your task is to summarize the provided text. "
    + "Explain in one sentence what the news article is about. "
    + "Your response should be direct, don't mention the article, just give me the summary. "
    + "You should cover the main facts present in the text and prevent from adding facts that are not explicit in the text. "
    + "Be brief and straight to the point",
    "role": "system",
}


def summarization_prompt(document: str) -> ChatCompletionMessageParam:
    return [system_prompt, {"content": document, "role": "user"}]


def run_summarization(summarization_model, input_document):
    llm_response = litellm.completion(
        model=summarization_model,
        messages=summarization_prompt(input_document),
        max_tokens=200,
    )
    response = llm_response["choices"][0]["message"]["content"]
    return response

In [18]:
def summarize_samples(sample_df, summarization_model):
    return [
        run_summarization(summarization_model, r["document"])
        for _, r in sample_df.iterrows()
    ]


def evaluate_samples(sample_df, summaries, evaluation_model):
    eval_scores = []
    for t, generated_summary in zip(sample_df.iterrows(), summaries):
        sample_data = t[1]
        input_document = sample_data["document"]

        test_case = LLMTestCase(input=input_document, actual_output=generated_summary)
        metric = SummarizationMetric(
            threshold=0.5, model=evaluation_model, verbose_mode=False, async_mode=False
        )

        metric.measure(test_case, _show_indicator=False)
        eval_scores.append(metric.score_breakdown | {"Score": metric.score})

        display(Markdown(f"#### Input Text\n> {input_document}"))
        display(Markdown(f"#### Summary\n{generated_summary}"))
        display(Markdown(f"#### Score: **{metric.score}**"))
        display(Markdown(metric.reason))
        display(
            Markdown(
                "\n".join(
                    f"- {k.lower()} score: {v:.2f}"
                    for k, v in metric.score_breakdown.items()
                )
            )
        )
        display(Markdown("---"))
    return pd.DataFrame(eval_scores)

In [20]:
gpt_4o_mini_summaries = summarize_samples(sample_df, "gpt-4o-mini")

In [21]:
gpt4o_eval_scores = evaluate_samples(sample_df, gpt_4o_mini_summaries, "gpt-4o")

#### Input Text
> The six-week mission comes in the face of competing claims from other countries, including Russia.
Two ice-breakers are setting out from Newfoundland to collect data from an undersea ridge that starts near Ellesmere Island and runs to the Pole.
The region is believed to include large oil and gas reserves.
Last December Canada filed an application with the UN seeking to vastly expand its Atlantic sea boundary.
Russia and Denmark have also made competing claims on a vast area of Arctic seabed around the Lomonosov Ridge.
All three countries are seeking scientific proof that the ridge is an underwater extension of their continental shelf.
The area is estimated to hold 13% of the world's undiscovered oil and up to 30% of its hidden natural gas reserves.
A Canadian government statement said the first icebreaker had left St John's, Newfoundland, on Friday and the second would depart on Saturday.
"Our government is securing our sovereignty while expanding our economic and scientific opportunities by defining Canada's last frontier," said Environment Minister Leona Aglukkaq.
"This is important to Canadians, especially those in the north, as this is their future and prosperity at stake."
Under the UN Convention on the Law of the Sea, a coastal nation can claim exclusive economic rights to natural resources on or beneath the sea floor up to 200 nautical miles (370km) beyond their land territory.
But if the continental shelf extends beyond that distance, the country must provide evidence to a UN commission which will then make recommendations about establishing an outer limit.

#### Summary
Canada is conducting a six-week mission with two icebreakers to gather data in a contested Arctic region while competing with Russia and Denmark for claims to potential oil and gas reserves beneath the seabed.

#### Score: **0.2**

The score is 0.20 because the summary includes contradictory details about the nature of the territorial claims in the Arctic, focusing incorrectly on oil and gas reserves instead of the Arctic seabed around Lomonosov Ridge. Additionally, the summary omits critical information about Canada's action in filing a UN application, the specifics of the icebreakers, and the mission's data collection focus, which were present in the original text.

- alignment score: 0.80
- coverage score: 0.20

---

#### Input Text
> In a report going to board members next week, finance director Nick Kenton said the forecast gives "significant cause for concern".
However, he said there was time to breakeven by the end of financial year.
Raigmore Hospital in Inverness accounts for £10.5m of the potential projected overspend.
Mr Kenton said "significant effort" was needed to achieve "financial break even" and avoid the potential £15.6m overspend by the end of March next year.
Waiting list initiatives, nursing reviews and the use of medical locums have been identified as causes behind the overspend at Raigmore, the largest hospital in the Highlands.
Board members are due to discuss the matter on Tuesday.
In a statement, the health board said: "The £15.6m overspend is a potential projected figure based on the first three months of the financial year and while the board appreciates that it will take a significant effort to achieve financial breakeven, NHS Highland remains confident that there is sufficient time remaining in the financial year to allow the breakeven target to be met."
At the end of the last financial year in April the health board had to ask the Scottish government for a £2m loan to balance its books.

#### Summary
Raigmore Hospital is facing a potential £15.6 million overspend, which finance director Nick Kenton believes presents significant concern, but there is confidence that financial breakeven can still be achieved by the end of the financial year.

#### Score: **0.5**

The score is 0.50 because the summary contains factual inaccuracies, such as misstating the overspend amount at Raigmore Hospital and incorrectly identifying Nick Kenton's position. It also fails to address certain queries that the original text answers, indicating crucial content omissions.

- alignment score: 0.50
- coverage score: 0.60

---

#### Input Text
> The B4270 and other local roads are affected, Vale of Glamorgan council has said.
The blaze at Llandow Industrial Estate started on Thursday, and South Wales Fire and Rescue Service crews are still on site.
Residents are also being advised to keep windows and doors closed.
A fire service spokesperson said: "Our crews are monitoring the fire and will do so over the coming days to ensure that the incident is contained."
Officials said no asbestos is believed to be at the site.
However, Public Health Wales (PHW) said substances in the smoke could cause coughing, wheezing, breathlessness and chest pain.
People with asthma and other respiratory conditions may be "particularly susceptible" to the smoke and should carry and use their medication.
Andrew Kibble of the PHW environmental public health team recommended people stay out of the smoke and if at home, close doors and windows.
The council's emergency planning team is working to limit disruption.
Local schools have been advised to open as usual, and there is no impact on the local water supply.

#### Summary
A fire at the Llandow Industrial Estate is ongoing, affecting local roads and prompting health advisories for residents to keep windows closed due to potentially harmful smoke, while emergency services monitor the situation.

#### Score: **0.2**

The score is 0.20 because the summary omits critical questions that the original text addresses, indicating a significant lack of coverage and completeness.

- alignment score: 1.00
- coverage score: 0.20

---

#### Input Text
> The proposal, dubbed Hornsea Three, would create a complex that would be the "world's largest offshore wind farm", Dong Energy said.
Hornsea Project One, which is being built 75 miles (120 km) out at sea, is expected to be operational in 2020.
The RSPB has previously voiced concerns about the "high collision risk for seabirds using the area".
Brent Cheshire, Dong Energy UK's chairman, said moving forward with Hornsea Project Three "underlines our commitment to the UK offshore wind industry."
An application for consent to build Hornsea Project Two was submitted to the Planning Inspectorate in January last year. A decision is expected later in the year.
The energy firm said it expected to submit a planning application for Hornsea Project Three in 2018. If approved, construction was anticipated for sometime between 2022 and 2025.
Dong Energy has already built the Westermost Rough wind farm which consists of 35 turbines located 16 miles (25km) off the Holderness coast.

#### Summary
Dong Energy is proposing Hornsea Three, which aims to become the world's largest offshore wind farm, following the ongoing developments of Hornsea Project One and Two amid environmental concerns regarding seabird collisions.

#### Score: **0.2**

The score is 0.20 because the summary includes extra information not supported by the original text, such as the relationship of Hornsea Three to other projects, and fails to answer several key questions addressed in the original text. These shortcomings significantly reduced the summarization score.

- alignment score: 0.75
- coverage score: 0.20

---