In [1]:
from google.colab import userdata


##### Copyright 2025 Google LLC.

In this notebook, we will learn some techniques for evaluating the output of a language model. As part of the evaluation, we will also use Gemini's structured data capability to produce evaluation results as instances of Python types.

In [2]:
from google import genai
from google.genai import types
from IPython.display import Markdown, display


In [3]:
genai.__version__


'1.8.0'

In [4]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

# Ensures that generate_content only gets wrapped once with retry logic.
# If generate_content fails due to a retriable error, it automatically retries instead of failing immediately.

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable
  )(genai.models.Models.generate_content)


## Evaluation

When using LLMs in real-world cases, it's important to understand how well they are performing. The open-ended generation capabilities of LLMs can make many cases difficult to measure. In this notebook, we will walk through some simple techniques for evaluating LLM outputs and understanding their performance.

In the following example, we will evaluate a summarisation task using the Gemini 1.5 Pro technical report. We will start by downloading the PDF to the notebook environment, and uploading that copy for use with the Gemini API.

In [5]:
client = genai.Client(api_key=userdata.get('GOOGLE_API_KEY'))


In [6]:
!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')


2025-03-31 19:46:03 URL:https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf [7228817/7228817] -> "gemini.pdf" [1]


### Summarising a document
The summarisation request used here is fairly basic. It targets the training content specifically but provides no guidance otherwise.

In [7]:
request = 'Tell me about the training process used here.'

def summarise_doc(request: str) -> str:
  # Setting the temperature low to stabilize the output
  config = types.GenerateContentConfig(
      temperature = 0.0
  )

  response = client.models.generate_content(
      model = "gemini-2.0-flash",
      config = config,
      contents = [request, document_file],
  )
  return response.text


In [8]:
summary = summarise_doc(request)
Markdown(summary)


Based on the document you provided, here's a breakdown of the training process used for Gemini 1.5 Pro:

**1. Model Architecture:**

*   Gemini 1.5 Pro is a **sparse mixture-of-experts (MoE) Transformer-based model.** This means it builds upon the Transformer architecture (Vaswani et al., 2017) but incorporates a MoE structure.
*   MoE models use a **learned routing function** to direct inputs to a subset of the model's parameters for processing. This allows the model to have a large total parameter count while only activating a portion of those parameters for any given input.

**2. Training Data:**

*   The model is trained on a **variety of multimodal and multilingual data.**
*   The pre-training dataset includes data sourced from many different domains, including **web documents, code, images, audio, and video content.**

**3. Training Infrastructure:**

*   Gemini 1.5 Pro is trained on **multiple 4096-chip pods of Google's TPUv4 accelerators**, distributed across multiple datacenters.

**4. Training Process:**

*   **Pre-training:** The model is initially pre-trained on the large, diverse dataset mentioned above.
*   **Instruction Tuning:** After pre-training, Gemini 1.5 Pro is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses.
*   **Human Preference Tuning:** Further tuning is performed based on human preference data.

**Key Aspects and Innovations:**

*   **Long Context Understanding:** A series of significant architecture changes enable long-context understanding of inputs up to 10 million tokens without degrading performance.
*   **Efficiency:** Improvements across the entire model stack (architecture, data, optimization, and systems) allow Gemini 1.5 Pro to achieve comparable quality to Gemini 1.0 Ultra while using significantly less training compute and being significantly more efficient to serve.
*   **Multimodality:** The model is natively multimodal and supports interleaving of data from different modalities (audio, visual, text, code) in the same input sequence.

In summary, the training process involves a combination of large-scale pre-training on diverse multimodal data, followed by instruction tuning and human preference tuning, all leveraging a MoE architecture and Google's TPU infrastructure. A key focus is on enabling the model to handle extremely long contexts effectively.


### Defining an evaluator
For a task like this, we may wish to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality". We can instruct an LLM to perform these tasks in a similar manner to how we would instruct a human rater: with a clear definition and assessment rubric. In this step, we define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.


In [9]:
import enum


In [10]:
# Defining the evaluation prompt

SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""


In [11]:
# Defining a structured enum class to capture the result

class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


In [12]:
def evaluate_summary(prompt, ai_response):
  # Evaluating the generated summary against the prompt used
  chat = client.chats.create(model = "gemini-2.0-flash")

  # Generating the full text response
  response = chat.send_message(
      message = SUMMARY_PROMPT.format(prompt = prompt, response = ai_response)
  )
  verbose_eval = response.text

  # Coercing into the desired structure
  structured_output_config = types.GenerateContentConfig(
      response_mime_type = "application/json",
      response_schema = SummaryRating,
  )

  response = chat.send_message(
      message = "Convert the final score",
      config = structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


In [13]:
text_eval, struct_eval = evaluate_summary(prompt = [request, document_file], ai_response = summary)
Markdown(text_eval)


## Evaluation

STEP 1: The response accurately summarized the training process, including the model architecture, training data, infrastructure, and key innovations. It is grounded in the provided document and offers a comprehensive overview.

STEP 2: The response followed the instructions, is grounded, is concise, and fluent.

Rating: 5


In this example, the model generated a textual justification that was set up in a chat context. This full text response is useful both for human interpretation and for giving the model a place to "collect notes" while it assesses the text and produces a final score. This "note taking" or "thinking" strategy typically works well with auto-regressive models, where the generated text is passed back into the model at each generation step. This means the working "notes" are used when generating final result output.

Next, the model converts the text output into a structured response. If we want to aggregate scores or use them programatically then we want to avoid parsing the unstructured text output. Here, the SummaryRating schema is passed, so the model converts the chat history into an instance of the SummaryRating enum.

In [14]:
struct_eval


<SummaryRating.VERY_GOOD: '5'>

### Making the summary prompt better or worse
Gemini models tend to be quite good at tasks like direct summarisation without much prompting, so you should expect to see a result like `GOOD` or `VERY_GOOD` on the previous task, even with a rudimentary prompt.

In [15]:
new_prompt = "ELI5 the training process."

if not new_prompt:
  raise ValueError("Try setting a new summarization prompt.")


In [16]:
def run_and_evaluate_summary(prompt):
  # Generating and evaluating the summary using the new prompt
  summary = summarise_doc(prompt)
  display(Markdown(summary + '\n-------'))

  text, struct = evaluate_summary([new_prompt, document_file], summary)
  display(Markdown(text + '\n-------'))
  print(struct)

run_and_evaluate_summary(new_prompt)


Okay, I'll explain the training process of a large language model (LLM) like Gemini 1.5 in a way that's easy to understand.

**Imagine you're teaching a dog a new trick.**

1.  **Gathering the Data (The "Treats"):**

    *   First, you need lots and lots of examples of what you want the dog to learn. For an LLM, this means collecting a massive amount of text and code. Think of it as the dog's "training treats." This data comes from all over the internet: websites, books, code repositories, etc.
    *   For a multimodal model like Gemini, you also need images, audio, and video data.

2.  **Building the Model (The "Dog"):**

    *   The LLM is like the dog's brain. It's a complex mathematical structure (a neural network) with millions or even billions of "knobs" (parameters) that can be adjusted.
    *   The architecture of the model is important. Gemini 1.5 uses a "mixture-of-experts" architecture, which is like having a team of specialized dogs, each good at a different part of the trick.

3.  **Training (The "Teaching"):**

    *   This is where the magic happens. You feed the model the training data, piece by piece.
    *   **Prediction:** The model tries to predict the next word in a sentence, or the answer to a question, or the next line of code.
    *   **Comparison:** You compare the model's prediction to the correct answer (the "ground truth").
    *   **Adjustment:** If the model is wrong, you adjust the "knobs" in its brain to make it more likely to get the answer right next time. This adjustment is done using a mathematical process called "gradient descent."
    *   **Repeat:** You repeat this process millions or billions of times, gradually making the model better and better at predicting the right answers.

4.  **Instruction Tuning (Polishing the Trick):**

    *   After the initial training, you fine-tune the model to make it better at following instructions. This is like giving the dog specific commands like "sit," "stay," or "fetch."
    *   You feed the model examples of instructions and the desired responses, and again adjust the "knobs" to make it better at following instructions.

5.  **Evaluation (Testing the Trick):**

    *   Once the model is trained, you need to test how well it performs. You feed it new data that it hasn't seen before and see how accurately it can predict the right answers.
    *   This helps you identify any weaknesses in the model and make further adjustments.

6.  **Safety Mitigations (Making Sure the Trick is Safe):**

    *   It's important to make sure the model doesn't generate harmful or biased content. This is like making sure the dog doesn't bite anyone while performing the trick.
    *   You use techniques like supervised fine-tuning and reinforcement learning to make the model safer.

**Key Concepts:**

*   **Tokens:** LLMs don't understand words like humans do. They break down text into smaller units called "tokens."
*   **Parameters:** The "knobs" in the model's brain that are adjusted during training. More parameters generally mean a more powerful model.
*   **Context Window:** The amount of text the model can "remember" when making a prediction. Gemini 1.5 has a very large context window, allowing it to process long documents and videos.
*   **Multimodal:** The ability to process different types of data, like text, images, audio, and video.

In essence, training an LLM is like teaching a dog a very complex trick by showing it lots of examples and gradually adjusting its behavior until it gets it right. The key is to have a massive amount of training data, a powerful model architecture, and effective training techniques.

-------

STEP 1: The response does a pretty good job of following instructions. The response does not include any information that is not present in the context. The response is very conversational and easy to read.
STEP 2: I would rate this as a 4. The response is good because it is grounded, concise, and fluent.
-------

SummaryRating.GOOD


## Evaluating in practice
Evaluation has many practical uses, for example:


*   We can quickly iterate on a prompt with a small set of test documents,
*   We can compare different models to find what works best for our needs, such as finding the trade-off between price and performance, or finding the best performance for a specific task.
*   When pushing changes to a model or prompt in a production system, we can verify that the system does not regress in quality.



### Pointwise evaluation
The technique used above, where we evaluated a single input/output pair against some criteria is known as 'Pointwise Evaluation'. This is useful for evaluating singular outputs in an absolute sense, such as "Was it good or bad?"

In the following exercise, we will try different guidance prompts with a set of questions.


In [17]:
import functools


In [18]:
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."


In [19]:
guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}


In [20]:
questions = [
    # Evaluating more questions will take more time, but produces results
    # with higher confidence. In a production system, we may have hundreds
    # of questions to evaluate a complex system.

    # "What metric(s) are used to evaluate long context performance?",
    "How does the model perform on code tasks?",
    "How many layers does it have?",
    # "Why is it called Gemini?",
]


In [21]:
if not questions:
  raise NotImplementedError("Add some questions to evaluate.")


In [22]:
@functools.cache

def answer_question(question: str, guidance: str = '') -> str:
  # Generating an answer to the question using the uploaded document and guidance
  config = types.GenerateContentConfig(
      temperature = 0.0,
      system_instruction = guidance,
  )

  response = client.models.generate_content(
      model = 'gemini-2.0-flash',
      config = config,
      contents = [question, document_file],
  )

  return response.text


In [23]:
answer = answer_question(questions[0], terse_guidance)
Markdown(answer)


Gemini 1.5 Pro performs well on code tasks, surpassing Gemini 1.0 Ultra on Natural2Code and showing improvements in coding capabilities compared to previous Gemini models.


Now, let's set up a question-answering evaluator, much like before, but using the pointwise QA evaluation prompt.



In [24]:
import enum


In [25]:
QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated responses.
You should first read the user prompt carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness,completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""


In [26]:
class AnswerRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


In [27]:
@functools.cache

def evaluate_answer(prompt, ai_response, n = 1):
  # Evaluating the generated answer against the prompt/question used

  chat = client.chats.create(model = "gemini-2.0-flash")

  #Generating the full text response
  response = chat.send_message(
      message = QA_PROMPT.format(prompt = [prompt, document_file], response = ai_response)
  )
  verbose_eval = response.text

  # Coercing into the desired structure
  structured_output_config = types.GenerateContentConfig(
      response_mime_type = "application/json",
      response_schema = AnswerRating,
  )

  response = chat.send_message(
      message = "Convert the final score",
      config = structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


In [28]:
text_eval, struct_eval = evaluate_answer(prompt = questions[0], ai_response = answer)
display(Markdown(text_eval))
print(struct_eval)


STEP 1: The response directly answers the question, "How does the model perform on code tasks?" It also references the context document provided.
STEP 2:
Instruction following: The response directly answers the question.
Groundedness: The response is grounded in the document provided.
Completeness: The response is complete.
Fluent: The response is well-organized and easy to read.
The response is very good.

Therefore, I choose a score of 5.

AnswerRating.VERY_GOOD


In [29]:
import collections
import itertools


In [30]:
# Number of times to repeat each task in order to reduce error and calculate an average. Increasing it will take longer but give better results.
NUM_ITERATIONS = 1

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)


In [31]:
for question in questions:
  display(Markdown(f' ## {question}'))
  for guidance, guide_prompt in guidance_options.items():
    for n in range(NUM_ITERATIONS):
      # Generating a response
      answer = answer_question(question, guide_prompt)

      # Evaluating the response. Note that the guidance prompt is not passed
      written_eval, struct_eval = evaluate_answer(question, answer, n)
      print(f' {guidance}: {struct_eval}')

      # Saving the numeric score
      scores[guidance] += int(struct_eval.value)

      # Saving the responses, in case we wish to inspect them
      responses[(guidance, question)].append((answer, written_eval))


 ## How does the model perform on code tasks?

 Terse: AnswerRating.VERY_GOOD
 Moderate: AnswerRating.VERY_GOOD
 Cited: AnswerRating.VERY_GOOD


 ## How many layers does it have?

 Terse: AnswerRating.VERY_GOOD
 Moderate: AnswerRating.VERY_BAD
 Cited: AnswerRating.VERY_GOOD


In [32]:
for guidance, score in scores.items():
  avg_score = score / (NUM_ITERATIONS * len(questions))
  nearest = AnswerRating(str(round(avg_score)))
  print(f'{guidance}:{avg_score:.2f} - {nearest.name}')


Terse:5.00 - VERY_GOOD
Moderate:3.00 - OK
Cited:5.00 - VERY_GOOD


###Pairwise evaluation
The pointwise evaluation prompt used in the previous step has 5 levels of grading in the output. This may be too coarse for our system, or perhaps we may wish to improve on a prompt that is already "very good".
Another approach to evaluation is to compare two outputs against each other. This is known as 'Pairwise Evaluation', and is a key step in ranking and sorting algorithms, which allows us to use it to rank our prompts either instead of, or in addition to the pointwise approach.

This step implements pairwise evaluation using the pairwise QA quality prompt from the Google Cloud docs.

In [33]:
QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""


In [34]:
class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'


In [35]:
@functools.cache

def evaluate_pairwise(prompt, response_a, response_b, n=1):
  # Determining the better of two answers to the same prompt

  chat = client.chats.create(model = "gemini-2.0-flash")

  #Generating the full text response
  response = chat.send_message(
      message = QA_PAIRWISE_PROMPT.format(prompt = [prompt, document_file],
                                          baseline_model_response = response_a,
                                          response = response_b,
                                          )
  )
  verbose_eval = response.text

  # Coercing into the desired structure
  structured_output_config = types.GenerateContentConfig(
      response_mime_type = "application/json",
      response_schema = AnswerComparison,
  )

  response = chat.send_message(
      message = "Convert the final score.",
      config = structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


In [36]:
question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = evaluate_pairwise(
    prompt = question,
    response_a = answer_a,
    response_b = answer_b,
)

display(Markdown(text_eval))
print(struct_eval)


STEP 1: Analyze Response A based on the question answering quality criteria:
Response A is very short and does not give me all the information I need. It fulfills the requirements but could have been more complete. It is also fluent.

STEP 2: Analyze Response B based on the question answering quality criteria:
Response B is much more detailed and shows many aspects of how the model is performing. It fulfills the requirements and is fluent.

STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
Response B gives much more information than response A.

STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
B

STEP 5: Output your assessment reasoning in the explanation field.
Response B gives much more detailed information that would be helpful to a user. Response A is lacking in details.

AnswerComparison.B


With a pair-wise evaluator in place, the only thing required to rank prompts against each other is a comparator.

This example implements the minimal comparators required for total ordering (== and <) and performs the comparison using `n_iterations` evaluations over the set of `questions`.

In [38]:
@functools.total_ordering

class QAGuidancePrompt:
  # A question-answering guidance prompt or system instruction

  def __init__(self, prompt, questions, n_comparisons = NUM_ITERATIONS):
    # Create the prompt. Provide questions to evaluate against, and number of evals to perform
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons

  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    # Compare two prompts on all questions over n trials
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return round(mean)

  def _compare_n(self, other, question):
    # Compare two prompts on a question over n trials
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n = 1):
    # Compare two prompts on a single question
    answer_a = answer_question(question, self.prompt)
    answer_b = answer_question(question, other.prompt)

    _, result = evaluate_pairwise(
        prompt = question,
        response_a = answer_a,
        response_b = answer_b,
        n = n, # Cache buster
    )

    # Convert the enum to the standard Python numeric comparison values
    if result is AnswerComparison.A:
      return 1
    elif result is AnswerComparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    # Equality check that performs pairwise evaluation
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented
    return self._compare_all(other) == 0

  def __lt__(self, other):
    # Ordering check that performs pairwise evaluation
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented
    return self._compare_all(other) < 0


Now Python's sorting functions will "just work" on any `QAGuidancePrompt` instances. The `answer_question` and `eval_pairwise` functions are <a href="https://en.wikipedia.org/wiki/Memoization">memoized</a> to avoid unnecessarily regenerating the same answers or evaluations, so we should see this complete quickly unless we have changed the questions, prompts or number of iterations from the earlier steps.

In [39]:
terse_prompt = QAGuidancePrompt(terse_guidance, questions)
moderate_prompt = QAGuidancePrompt(moderate_guidance, questions)
cited_prompt = QAGuidancePrompt(cited_guidance, questions)

# Sorting in reverse order so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt, cited_prompt], reverse = True)

for i, p in enumerate(sorted_results):
  if i:
    print('-----')

  print(f'#{i+1}: {p}')


#1: Answer the following question in a single sentence, or as close to that as possible.
-----
#2: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.
-----
#3: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.


## Challenges

### LLM limitations
LLMs are known to have problems on certain tasks, and these challenges still persist when using LLMs as evaluators. For example, LLMs can struggle to count the number of characters in a word (this is a numerical problem, not a language problem), so an LLM evaluator will not be able to accurately evaluate this type of task. There are solutions available in some cases, such as connecting tools to handle problems unsuitable to a language model, but it's important that we understand possible limitations and include human evaluators to calibrate our evaluation system and determine a baseline.

One reason that LLM evaluators work well is that all of the information they need is available in the input context, so the model only needs to attend to that information to produce the result. When customising evaluation prompts, or building our own systems, we should keep this in mind and ensure that we are not relying on "internal knowledge" from the model, or behaviour that might be better provided from a tool.

### Improving confidence
One way to improve the confidence of our evaluations is to include a diverse set of evaluators. That is, we should use the same prompts and outputs, but execute them on different models, like Gemini Flash and Pro, or even across different providers, like Gemini, Claude, ChatGPT and local models like Gemma or Qwen. This follows the same idea used earlier, where repeating trials to gather multiple "opinions" helps to reduce error, except by using different models the "opinions" will be more diverse.

