# LLM as judge

In [2]:
from IPython.display import display, Markdown

Finally, we can also leverage LLMs to evaluate the quality of the generated text, this is a more flexible, powerful and nuanced approach than using rule-based or statistical metrics, and sometimes better than using cosine similarity; it is however, more expensive and requires more time to execute.

It is also worth noting that at this point you are also introducing a new source of error, as an LLM will be used to evaluate the quality of the generated text, and it is not always easy to tell if the LLM is being fair or not.


For example, imagine we have the following text:

> Policy Lab has experimented with Artificial Intelligence (AI) in policy development with teams across government, and beyond, for a number of years. In 2019 we worked with the Department for Transport’s data science team to consider the role that AI could play in improving the efficiency and effectiveness of the policy consultation process. In 2022 we used AI to create a vision for the future of Hounslow with the local authority. In 2023, we commissioned the creation of the Ecological Intelligence Agency, a speculative artefact to help experience the role AI might have in future decision-making in environmental policy. 

And we have asked the original writer to provide a summary of the text, **This is our golden reference**:

> Policy Lab has explored the use of AI in policy development across various government projects, including improving policy consultation processes, envisioning future urban planning, and investigating AI's potential role in environmental policy decision-making.

### What we want to evaluate

And we have requested different language models to provide a summary of the text using a variety of prompts.


In [3]:
reference_summary = "Policy Lab has explored the use of AI in policy development across various government projects, including improving policy consultation processes, envisioning future urban planning, and investigating AI's potential role in environmental policy decision-making."
summary_1 = "Policy Lab has explored the application of Artificial Intelligence in diverse government policy initiatives, including enhancing policy consultation, envisioning local community futures, and examining AI's potential role in environmental policy decisions."
summary_2 = "Policy Lab has explored AI applications in governmental policy creation, collaborating with various agencies to enhance consultation procedures, generate urban forecasts, and envision futuristic environmental decision-making tools."
summary_3 = "Policy Lab has been using Artificial Intelligence to replace human decision-making in government policy development, creating automated systems to make key decisions without input from policymakers or the public."

display(Markdown(f"""
## Reference summary\n{reference_summary}\n
## Model 1\n{summary_1}\n
## Model 2\n{summary_2}\n
## Model 3 (worst)\n{summary_3}\n
"""))


## Reference summary
Policy Lab has explored the use of AI in policy development across various government projects, including improving policy consultation processes, envisioning future urban planning, and investigating AI's potential role in environmental policy decision-making.

## Model 1
Policy Lab has explored the application of Artificial Intelligence in diverse government policy initiatives, including enhancing policy consultation, envisioning local community futures, and examining AI's potential role in environmental policy decisions.

## Model 2
Policy Lab has explored AI applications in governmental policy creation, collaborating with various agencies to enhance consultation procedures, generate urban forecasts, and envision futuristic environmental decision-making tools.

## Model 3 (worst)
Policy Lab has been using Artificial Intelligence to replace human decision-making in government policy development, creating automated systems to make key decisions without input from policymakers or the public.



## Preparing our judge

As we will be using an LLM, we need to craft a prompt that instructs the language model to evaluate the quality of the generated text.

Usually the prompts are asked to evaluate the quality of the generated text based on a given reference, and output a score given a scale, such as 1 to 5 or 1 to 10, however, you can also ask the LLM to output a ranking, or even a more verbose evaluation.

For this example, we will ask the LLM to evaluate the quality of the generated text based on a given reference, and output a score between 1 and 10, and as an extra we will also ask the LLM to provide an explanation for its score.

To make this evaluation automated, we will also require the LLM to output its score in a specific format, so we can easily parse it.



In [7]:
judge_prompt = """
You are an expert at evaluating the quality of summaries.
You will be given a reference summary and a generated summary.
You need to evaluate the quality of the generated summary based on the reference summary, and output a score between 1 and 10.
You also need to provide an explanation for your score.
You need to output your score and explanation in the following JSON format:

{{"score": <score>, "explanation": <explanation>}}

The reference summary is:
```
{reference_summary}
```

The generated summary is:
```
{summary}
```

Your output should only be the JSON, nothing else.
"""

In [9]:


summary_1_prompt = judge_prompt.format(reference_summary=reference_summary, summary=summary_1)

print(summary_1_prompt)


You are an expert at evaluating the quality of summaries.
You will be given a reference summary and a generated summary.
You need to evaluate the quality of the generated summary based on the reference summary, and output a score between 1 and 10.
You also need to provide an explanation for your score.
You need to output your score and explanation in the following JSON format:

{"score": <score>, "explanation": <explanation>}


The reference summary is:
Policy Lab has explored the use of AI in policy development across various government projects, including improving policy consultation processes, envisioning future urban planning, and investigating AI's potential role in environmental policy decision-making.

The generated summary is:
Policy Lab has explored the application of Artificial Intelligence in diverse government policy initiatives, including enhancing policy consultation, envisioning local community futures, and examining AI's potential role in environmental policy decision

In [11]:
from common import get_response

summary_1_judge_response = get_response(summary_1_prompt)

print(summary_1_judge_response)

{
  "score": 9,
  "explanation": "The generated summary accurately captures the key points of the reference summary with only minor differences. It correctly mentions Policy Lab's exploration of AI in government projects, including policy consultation improvements and environmental policy decision-making. The main difference is in the phrasing of 'future urban planning' in the reference, which is described as 'envisioning local community futures' in the generated summary. This is a slight shift in focus but still conveys a similar concept. The generated summary maintains the essence and scope of the original while using slightly different wording, demonstrating a high level of accuracy and comprehension."
}


In [12]:
from common import get_response

summary_3_prompt = judge_prompt.format(reference_summary=reference_summary, summary=summary_3)

summary_3_judge_response = get_response(summary_3_prompt)

print(summary_3_judge_response)

{
  "score": 2,
  "explanation": "The generated summary severely misrepresents the content of the reference summary. While the reference indicates that Policy Lab has explored the use of AI in various aspects of policy development, the generated summary incorrectly states that AI is being used to replace human decision-making entirely. The reference mentions AI being used to improve processes and investigate potential roles, not to automate key decisions without human input. The generated summary also omits the specific areas mentioned in the reference (consultation processes, urban planning, environmental policy) and introduces unfounded claims about excluding policymakers and the public. This summary is highly inaccurate and misleading compared to the reference."
}
