# Using LLM-as-a-judge for an automated evaluation

Evaluation of Large language models (LLMs) is often a difficult task: given their broad capabilities, the tasks given to them often should be judged on requirements that would be very broad, and loosely-defined. For instance, an assistant's answer to a question can be:
- not grounded in context
- repetitive, repetitive, repetitive
- grammatically incorrects
- Excessively lengthy and characterized by an overabundance of words, leading to a situation where the discourse or written content becomes overly detailed and protracted
- incoherent
- ...

The list of criteria goes on and on. And even if we had a limited list, each of these would be hard to measure: "devising a rule-based program to assess the outputs is extremely challenging. Traditional evaluation metrics based on the similarity between outputs and reference answers (e.g., ROUGE, BLEU) are also ineffective for these questions."

A powerful solution to assess outputs in a human way, without requiring costly human time, is LLM-as-a-judge.
This method was introduced in [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://huggingface.co/papers/2306.05685) - which I encourage you to read.

💡 The idea is simple: ask an LLM to do the grading for you. 🤖✓

But we'll see that it will not work well out-of-the-box: you need to set it up carefully for good results.

In [None]:
!pip install huggingface_hub datasets pandas tqdm -q

In [None]:
import re
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset
from huggingface_hub import InferenceClient, notebook_login

tqdm.pandas()  # load tqdm's pandas support
pd.set_option("display.max_colwidth", None)

notebook_login()

In [None]:
repo_id = "meta-llama/Llama-3.1-8B"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)

In [None]:
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)

## 1. Prepare the creation and evaluation of our LLM judge

Let's say you want to give an LLM a specific task, like answering open-ended questions.

The difficulty is that, as we discussed above, measuring the answer's quality is difficult, for instance an exact string match will flag too many correct but differently worded answers as false.

You could get human labellers to judge the outputs, but this is very time-consuming for them, and if you want to update the model or the questions, you have to do it all over again.

✅ In this case you can setup a LLM-as-a-judge.

**But to use a LLM-as-a-judge, you will first need to evaluate how reliably it rates your model outputs.**

➡️ So the first step will be... To create a human evaluation dataset. But you can get human annotations for a few examples only - something like 30 should be enough to get a good idea of the performance.
And you will be able to re-use this dataset everytime you want to test your LLM-as-a-judge.

In our case, we will use [`feedbackQA`](https://huggingface.co/datasets/McGill-NLP/feedbackQA), which contains 2 human evaluations and scores for each question/answer couple: using a sample of 30 examples will be representative of what your small evaluation dataset could be.

In [None]:
!wget https://github.com/McGill-NLP/feedbackqa/raw/main/data/feedback_train.json


In [None]:
import json

# Let's load the validation set and take a single sample as an example
ratings = json.load(open('feedback_train.json'))
ratings = pd.DataFrame(ratings)

In [None]:
ratings.head(1)

In [None]:
ratings['rating'].apply(lambda x: len(x)).value_counts()

In [None]:
ratings['review_1'] = ratings['rating'].apply(lambda x: x[0])
ratings['review_2'] = ratings['rating'].apply(lambda x: x[1])
ratings['explanation_1'] = ratings['feedback'].apply(lambda x: x[0])
ratings['explanation_2'] = ratings['feedback'].apply(lambda x: x[1])
ratings = ratings.drop(columns=["feedback"])

In [None]:
# Map scores to numeric values
conversion_dict = {"Excellent": 4, "Acceptable": 3, "Could be Improved": 2, "Bad": 1}
ratings["score_1"] = ratings["review_1"].map(conversion_dict)
ratings["score_2"] = ratings["review_2"].map(conversion_dict)

In [None]:
ratings.head(1)

It's always a good idea to compute a baseline for performance: here it can be for instance the agreement between the two human raters, as measured by the [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of the scores they give.

In [None]:
print("Correlation between 2 human raters:")
print(f"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}")

This correlation between 2 human raters is not that good. If your human ratings are really bad, it probably means the rating criteria are not clear enough.

This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it.

However, we could reduce this noise:
- by taking the average score as our ground truth instead of any single score, we should even out some of the irregularities.
- by only selecting the samples where the human reviewers are in agreement.

Here, we will choose the last option and **only keep examples where the 2 human reviewers are in agreement**.

In [None]:
# Sample examples

same_ratings = ratings[ratings["score_1"]==ratings["score_2"]]
same_ratings = same_ratings.rename(columns={"passage":"answer"})

In [None]:
ratings.shape, same_ratings.shape

In [None]:
same_ratings.head(1)

## 2. Create our LLM judge
We build our LLM judge with a basic prompt, containing these elements:
- task description
- rating / confidence score description
- explanation of the output format

In [None]:
examples = same_ratings.sample(5, random_state=1214)
examples["human_score"] = examples["score_1"]

In [None]:
def extract_judge_score(answer: str, split_str: str = "Total rating:") -> int:
    try:
        if split_str in answer:
            rating = answer.split(split_str)[1]
        else:
            rating = answer
        digit_groups = [el.strip() for el in re.findall(r"\d+(?:\.\d+)?", rating)]
        return float(digit_groups[0])
    except Exception as e:
        print(e)
        return None

### 2.1. LLM judge based on Scale

In [None]:
IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback.
Feedback:::
Evaluation: """

In [None]:
examples.shape

In [None]:
examples["llm_judge_improved"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=IMPROVED_JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=500,
    ),
    axis=1,
)

In [None]:
examples["llm_judge_improved_score"] = examples["llm_judge_improved"].apply(
    extract_judge_score
)

In [None]:
print("Correlation between LLM-as-a-judge and the human raters:")
print(
    f"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}"
)

In [None]:
examples.columns

In [None]:
examples[[col for col in examples.columns if col not in ['answer','explanation_1', 'explanation_2','llm_judge_improved',]]]

### 2.1. LLM judge based on Structured Output Generation

Using **structured generation**, you can configure the LLM judge to directly provide its output as a JSON with fields `Evaluation` and `Total rating`, which makes parsing easier.

In [None]:
JUDGE_PROMPT_JSON = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total_rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

You should provide your answer as a JSON blob, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
Your answer should be built as follows, it must contain the "Answer:" and "End of answer." sequences.

Answer:
{{
  "total_rating": your_total_rating,
  "confidence_score": your_confidence_score
}}
End of answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}
"""

In [None]:
examples["llm_judge_JSON"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=JUDGE_PROMPT_JSON.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=500,
    ),
    axis=1,
)

## Conclusion

**You will never reach 100%:** Let's first note that our human ground truth certainly has some noise, so agreement/correlation will never go up to 100% even with a perfect LLM judge.

**Provide few-shot examples:** adding some few-shot examples of questions and ground truth evaluations in the prompt can improve the results.