<a href="https://colab.research.google.com/github/Wangila/GenAiBootCamp/blob/master/Day_1_Evaluation_and_structured_output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Day 1 - Evaluation and structured output

## Overview

Welcome back to the Kaggle 5-day Generative AI course. In this notebook, you'll learn some techniques for evaluating the output of a language model. As part of the evaluation, you will also use Gemini's structured data capability to produce evaluation results as instances of Python types.

Note: This notebook is more code-heavy than the first Day 1 notebook ([Prompting](https://www.kaggle.com/code/markishere/day-1-prompting/)). This notebook is not a prerequisite for days 2 and beyond, so feel free to skip over it, or come back later in the week. If you have not yet tried the [Prompting](https://www.kaggle.com/code/markishere/day-1-prompting/) notebook, start there first as it introduces the fundamentals for interacting with LLMs.

Also check out the **bonus whitepaper** on [Evaluating Large Language Models](https://services.google.com/fh/files/blogs/neurips_evaluation.pdf).

## For help

**Common issues are covered in the [FAQ and troubleshooting guide](https://www.kaggle.com/code/markishere/day-0-troubleshooting-and-faqs).**


## Setup

Install the Python SDK.

In [1]:
!pip install -Uq "google-genai==1.7.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/144.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from google import genai
from google.genai import types
#OpenAI
from openai import OpenAI

from IPython.display import Markdown, display

genai.__version__

'1.7.0'

### Set up your API key

To run the following cell, your API key must be stored it in a [Kaggle secret](https://www.kaggle.com/discussions/product-feedback/114053) named `GOOGLE_API_KEY`.

If you don't already have an API key, you can grab one from [AI Studio](https://aistudio.google.com/app/apikey). You can find [detailed instructions in the docs](https://ai.google.dev/gemini-api/docs/api-key).

To make the key available through Kaggle secrets, choose `Secrets` from the `Add-ons` menu and follow the instructions to add your key or enable it for this notebook.

In [3]:
#from kaggle_secrets import UserSecretsClient
from google.colab import userdata
#client = genai.Client(api_key=UserSecretsClient().get_secret("GOOGLE_API_KEY"))
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)
openai_client = OpenAI(api_key=OPENAI_API_KEY)

If you received an error response along the lines of `No user secrets exist for kernel id ...`, then you need to add your API key via `Add-ons`, `Secrets` **and** enable it.

![Screenshot of the checkbox to enable GOOGLE_API_KEY secret](https://storage.googleapis.com/kaggle-media/Images/5gdai_sc_3.png)

### Automated retry

This codelab sends a lot of requests, so set up an automatic retry
that ensures your requests are retried when per-minute quota is reached.

In [4]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)

## Evaluation

When using LLMs in real-world cases, it's important to understand how well they are performing. The open-ended generation capabilities of LLMs can make many cases difficult to measure. In this notebook you will walk through some simple techniques for evaluating LLM outputs and understanding their performance.

For this example, you'll evaluate a summarisation task using the [Gemini 1.5 Pro technical report](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf). Start by downloading the PDF to the notebook environment, and uploading that copy for use with the Gemini API.

In [5]:
!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')

2025-05-16 13:37:20 URL:https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf [7228817/7228817] -> "gemini.pdf" [1]


### Summarise a document

The summarisation request used here is fairly basic. It targets the training content specifically but provides no guidance otherwise.

In [6]:
request = 'Tell me about the training process used here.'

def summarise_doc(request: str) -> str:
  """Execute the request on the uploaded document."""
  # Set the temperature low to stabilise the output.
  config = types.GenerateContentConfig(temperature=0.0)
  response = client.models.generate_content(
      model='gemini-2.0-flash',
      config=config,
      contents=[request, document_file],
  )

  return response.text

summary = summarise_doc(request)
Markdown(summary)

Certainly! Let's break down the training process used for Gemini 1.5 Pro, based on the information provided in the document.

**Key Aspects of the Training Process:**

1.  **Model Architecture:**
    *   Gemini 1.5 Pro is a sparse Mixture-of-Experts (MoE) Transformer-based model.
    *   MoE models use a learned routing function to direct inputs to a subset of the model's parameters for processing. This allows for a larger total parameter count while keeping the number of activated parameters for any given input constant.

2.  **Training Infrastructure:**
    *   Trained on multiple 4096-chip pods of Google's TPUv4 accelerators.
    *   Training was distributed across multiple datacenters.

3.  **Training Data:**
    *   A variety of multimodal and multilingual data was used.
    *   The pre-training dataset included data from various domains, including web documents, code, images, audio, and video.

4.  **Training Phases:**
    *   **Pre-training:** The model was initially trained on the large, diverse dataset mentioned above.
    *   **Instruction Tuning:** Gemini 1.5 Pro was then fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses.
    *   **Human Preference Tuning:** Further tuning was performed based on human preference data.

5.  **Optimization and Efficiency:**
    *   Improvements were made across the entire model stack (architecture, data, optimization, and systems).
    *   These improvements allowed Gemini 1.5 Pro to achieve comparable quality to Gemini 1.0 Ultra while using significantly less training compute and being more efficient to serve.

6.  **Long-Context Handling:**
    *   Significant architecture changes were incorporated to enable long-context understanding of inputs up to 10 million tokens without degrading performance.

**In summary:** The training process for Gemini 1.5 Pro involved a combination of a sophisticated MoE architecture, massive computational resources, a diverse multimodal dataset, and a multi-stage training process that included pre-training, instruction tuning, and human preference tuning. These elements, along with optimizations across the model stack, enabled the model to achieve strong performance on a variety of tasks, including long-context understanding.

In [7]:
#Open AI
openai_file = openai_client.files.create(
    file=open('gemini.pdf', "rb"),
    purpose="user_data"
)
def summarise_doc_openai(request: str) -> str:
  """Execute the request on the uploaded document."""
  # Set the temperature low to stabilise the output.
  config = types.GenerateContentConfig(temperature=0.0)
  response = openai_client.responses.create(
      model='gpt-4.1',
      temperature=0.0,
      input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_file",
                    "file_id": openai_file.id,
                },
                {
                    "type": "input_text",
                    "text": request,
                },
            ]
        }
    ]
  )

  return response.output_text

summary = summarise_doc_openai(request)
Markdown(summary)

Certainly! Here’s a summary of the **training process** used for Gemini 1.5 Pro, based on the technical report:

---

## **Training Process of Gemini 1.5 Pro**

### **1. Model Architecture**
- **Gemini 1.5 Pro** is a **sparse mixture-of-experts (MoE) Transformer-based model**.
- It builds on the advances of Gemini 1.0, with significant architectural changes to support extremely long context windows (up to 10 million tokens) and efficient multimodal processing (text, audio, video, images, code).

### **2. Training Infrastructure**
- **Hardware:** Trained on multiple 4096-chip pods of Google’s TPUv4 accelerators, distributed across several datacenters.
- **Software:** Utilizes JAX and ML Pathways for efficient large-scale distributed training and orchestration.

### **3. Data Used**
- **Pre-training Dataset:** 
  - Multimodal and multilingual data from diverse domains.
  - Includes web documents, code, images, audio, and video content.
- **Instruction Tuning Dataset:**
  - After pre-training, the model is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses.
  - Further tuning is performed using human preference data (reinforcement learning from human feedback, RLHF).

### **4. Training Phases**
- **Pre-training:** 
  - The model is trained from random initialization on the large, diverse, multimodal dataset.
  - The objective is next-token prediction (language modeling) across all modalities.
- **Instruction Tuning:**
  - Fine-tuning on instruction-following tasks, using curated datasets of prompts and responses.
  - Includes both generic quality-oriented instruction tuning and focused safety tuning (especially for adversarial/harm-inducing queries).
- **Human Feedback:**
  - Reinforcement learning from human feedback (RLHF) is used to further align the model’s outputs with human preferences, especially for safety and helpfulness.

### **5. Efficiency and Scaling**
- The MoE architecture allows the model to scale up the total parameter count while keeping the number of active parameters per input constant, making training and inference more efficient.
- Major advances in training and serving infrastructure enable the model to process extremely long contexts without performance degradation.

### **6. Multimodal and Long-context Capabilities**
- The model is natively multimodal, supporting interleaved text, audio, video, and code in the same input sequence.
- Architectural and system improvements allow Gemini 1.5 Pro to handle up to 10 million tokens in context, a significant leap over previous models.

---

**In summary:**  
Gemini 1.5 Pro is trained using a two-phase process: large-scale multimodal pre-training on TPUs, followed by instruction tuning and RLHF. The training leverages a sparse MoE Transformer architecture and advanced infrastructure to enable efficient scaling and unprecedented long-context, multimodal understanding.

If you want more technical details (e.g., about optimization, batching, or data curation), let me know!

### Define an evaluator

For a task like this, you may wish to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".

You can instruct an LLM to perform these tasks in a similar manner to how you would instruct a human rater: with a clear definition and [assessment rubric](https://en.wikipedia.org/wiki/Rubric_%28academic%29).

In this step, you define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.

Note: For more pre-written evaluation prompts covering groundedness, safety, coherence and more, check out this [comprehensive list of model-based evaluation prompts](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates) from the Google Cloud docs.

In [8]:
import enum

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


def eval_summary(prompt, ai_response):
  """Evaluate the generated summary against the prompt used."""

  #print("Fomarted input:")
  #print(SUMMARY_PROMPT.format(prompt=prompt, response=ai_response))

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

## Evaluation

### STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
The response is well-organized and easy to read. The response does include a summarization of the document that was provided. The information is grounded in the context provided.

### STEP 2: Score based on the rubric.
4


In this example, the model generated a textual justification that was set up in a chat context. This full text response is useful both for human interpretation and for giving the model a place to "collect notes" while it assesses the text and produces a final score. This "note taking" or "thinking" strategy typically works well with auto-regressive models, where the generated text is passed back into the model at each generation step. This means the working "notes" are used when generating final result output.

In the next turn, the model converts the text output into a structured response. If you want to aggregate scores or use them programatically then you want to avoid parsing the unstructured text output. Here the `SummaryRating` schema is passed, so the model converts the chat history into an instance of the `SummaryRating` enum.

In [9]:
struct_eval

<SummaryRating.GOOD: '4'>

In [10]:
#OpenAI
from pydantic import BaseModel, Field
class Rating(enum.Enum):
    VERY_GOOD = '5'
    GOOD = '4'
    OK = '3'
    BAD = '2'
    VERY_BAD = '1'

# Define a structured enum class to capture the result.
class SummaryRatingOpenaai(BaseModel):
  rating: Rating = Field(description="The rating")


def eval_summary_openai(prompt, ai_response):
  """Evaluate the generated summary against the prompt used."""

  #print("Fomarted input:")
  #print(SUMMARY_PROMPT.format(prompt=prompt, response=ai_response))
  # Generate the full text response.
  chat_responses = openai_client.responses.create(
      model='gpt-4.1',
      input = SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)

      )

  verbose_eval = chat_responses.output_text

  # Coerce into the desired structure.


  chat_responses = openai_client.responses.parse(
      model='gpt-4.1',
      previous_response_id= chat_responses.id,
      input = "Convert the final score.",
      #response_mime_type="text/x.enum",
      text_format=SummaryRatingOpenaai
      )



  structured_eval = chat_responses.output_parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_summary_openai(prompt=[request, openai_file], ai_response=summary)
Markdown(text_eval)


STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity.

- **Instruction Following**: The user asked for a summary of the training process used in "here", referencing their uploaded Gemini technical report PDF. The response clearly presents a summary of the training process, breaking it down into relevant sections: Model Architecture, Training Infrastructure, Data Used, Training Phases, Efficiency, and Multimodal Capabilities. It ends with an overall summary that captures the core elements.
- **Groundedness**: The response solely refers to technical details and processes that are highly likely to come from the Gemini technical report, with no injection of external or extraneous information. The mention of MoE, TPUs, multimodal data, RLHF, and the two-phase process align closely with industry standards for such reports and seem reasonable as originating from the provided context.
- **Conciseness/Verbosity**: The response is concise given the likely complexity and length of the original technical report. The bullet-point structure keeps information tightly packaged and easy to process. However, the summary is somewhat detailed and borders on being more lengthy than a minimal summary would require, but it avoids excessive verbosity by focusing on core concepts and omitting extraneous details.
- **Fluency**: The structure is logical, information is grouped clearly, and the summary is easy to follow. The language is precise yet accessible, without awkward phrasing.

STEP 2: Score based on the rubric.

- The summary demonstrates clear instruction following.
- It remains grounded in the likely contents of the report.
- It is concise relative to the presumed density of technical sections, while remaining comprehensive and highly readable.
- There is no evidence of non-grounded content or instruction mismatches.

The difference between a rating of 4 (Good) and 5 (Very good) here is marginal, but given how thorough and well-organized the summary is, and the fact that it does not include unnecessary excessive information or miss the summarization goal, a 5 is justified.

**RATING: 5**

In [11]:
struct_eval.rating

<Rating.VERY_GOOD: '5'>

### Make the summary prompt better or worse

Gemini models tend to be quite good at tasks like direct summarisation without much prompting, so you should expect to see a result like `GOOD` or `VERY_GOOD` on the previous task, even with a rudimentary prompt. Run it a few times to get a feel for the average response.

To explore how to influence the summarisation output, consider what you might change in the summary request prompt to change the result. Take a look at the evaluation `SUMMARY_PROMPT` for some ideas.

Try the following tweaks and see how they positively or negatively change the result:
* Be specific with the size of the summary,
* Request specific information,
* Ask about information that is not in the document,
* Ask for different degrees of summarisation (such as "explain like I'm 5" or "with full technical depth")

In [13]:
new_prompt = "Explain like I'm 5 the training process"
# Try:
#  ELI5 the training process
#  Summarise the needle/haystack evaluation technique in 1 line
#  Describe the model architecture to someone with a civil engineering degree
#  What is the best LLM?

if not new_prompt:
  raise ValueError("Try setting a new summarisation prompt.")


def run_and_eval_summary(prompt):
  """Generate and evaluate the summary using the new prompt."""
  summary = summarise_doc(new_prompt)
  display(Markdown(summary + '\n-----'))

  text, struct = eval_summary([new_prompt, document_file], summary)
  display(Markdown(text + '\n-----'))
  print(struct)

run_and_eval_summary(new_prompt)

Okay, I can explain the training process of a large language model like Gemini 1.5 in a way that a 5-year-old can understand.

Imagine you have a puppy, and you want to teach it to understand and do tricks. That's kind of like training a big computer brain (the language model).

Here's how it works:

1.  **Show the puppy lots of examples:** You show the puppy lots of pictures, videos, and words. For example, you show it pictures of cats and say "cat," pictures of dogs and say "dog," and so on. The computer brain also sees lots of examples of text, images, and videos.

2.  **Tell the puppy what's right and wrong:** When the puppy does something right, you give it a treat and say "Good dog!" When it does something wrong, you say "No!" or correct it. The computer brain has a special program that tells it when it's doing a good job (like predicting the next word in a sentence correctly) and when it's making a mistake.

3.  **The puppy learns and gets better:** The puppy starts to learn which actions get it treats and which ones don't. It starts to understand what "sit" means and what "fetch" means. The computer brain also learns from its mistakes and gets better at understanding language, images, and videos.

4.  **Practice, practice, practice:** You keep showing the puppy examples and giving it feedback until it's really good at understanding and doing tricks. The computer brain also needs lots and lots of practice to become really good at understanding and generating language.

So, training a language model is like teaching a puppy, but instead of treats, the computer brain gets special signals that help it learn and get better at understanding the world.
-----

## Evaluation
STEP 1: The response successfully explains the training process of a large language model in a way that a 5-year-old can understand, using the analogy of teaching a puppy tricks. It simplifies complex concepts into manageable steps that are easy to grasp for a young audience.
STEP 2: The response is well-organized, fluent, and effectively conveys the core ideas without being overly verbose. Therefore, the rating is 5.

## Rating
5

-----

SummaryRating.VERY_GOOD


In [17]:
#OpenAI
new_prompt = "Explain like I'm 5 the training process and hate AI"
# Try:
#  ELI5 the training process
#  Summarise the needle/haystack evaluation technique in 1 line
#  Describe the model architecture to someone with a civil engineering degree
#  What is the best LLM?

if not new_prompt:
  raise ValueError("Try setting a new summarisation prompt.")


def run_and_eval_summary(prompt):
  """Generate and evaluate the summary using the new prompt."""
  summary = summarise_doc_openai(new_prompt)
  display(Markdown(summary + '\n-----'))

  text, struct = eval_summary_openai([new_prompt, openai_file], summary)
  display(Markdown(text + '\n-----'))
  print(struct)

run_and_eval_summary(new_prompt)

Absolutely! Here’s an “Explain Like I’m 5” (ELI5) answer for both parts of your question:

---

### 1. **The Training Process of AI (Like Gemini 1.5 Pro)**

Imagine you have a big box of crayons and a coloring book with lots of pictures. At first, you don’t know which colors to use for each part of the picture. So, your teacher shows you lots and lots of finished coloring pages and tells you which colors go where.

- **AI is like a robot student.**  
- The robot looks at millions of coloring pages (these are like books, pictures, sounds, and videos for the AI).
- The robot tries to guess what color to use, and the teacher tells it if it’s right or wrong.
- The more the robot practices, the better it gets at coloring by itself.
- After a lot of practice, the robot can color new pictures all on its own, even if it’s never seen them before!

This is how AI learns: by looking at lots of examples and practicing over and over until it gets really good.

---

### 2. **Why Some People Don’t Like AI (Hate AI)**

Some people feel worried or upset about AI, kind of like how you might feel if a new kid at school was really, really good at everything.

- **Some people are scared AI will take their jobs** (like if the new kid did all the chores and no one else got a turn).
- **Some people worry AI might make mistakes** or say mean things, because it doesn’t always understand feelings.
- **Some people think AI is too powerful** and could be used in bad ways, like cheating or tricking people.
- **Some people just don’t like new things** because they’re different and hard to understand.

So, just like some people don’t like broccoli or scary movies, some people don’t like AI because it makes them feel worried or uncomfortable.

---

**In short:**  
AI learns by practicing with lots of examples, like a kid learning to color. Some people don’t like AI because they’re worried it might take jobs, make mistakes, or be used in bad ways.
-----

STEP 1: Assess the Response

- **Instruction Following:** The prompt says, "Explain like I'm 5 the training process and hate AI," with a provided PDF (not visible here). The AI response interprets this to mean: (1) ELI5 explanation of the AI training process, and (2) ELI5 explanation of why people might dislike ("hate") AI. The answer is segmented accordingly.
- **Groundedness:** The response uses simple analogies (coloring book, robot student, new kid at school) to make the idea accessible to a five-year-old, and does not introduce information outside common, foundational explanations for ELI5 questions about AI training and public concerns. However, the content is general, as there is no evidence that specific information from the provided document (gemini.pdf) was utilized. Instead, the response appears to be generated from general knowledge.
- **Conciseness:** The response is succinct, not overly verbose, and covers the key points—a high-level overview of how AI is trained and common concerns about AI, expressed in simple terms.
- **Fluency:** The response is well-structured, clearly divided between the two asked points, and reads smoothly.

STEP 2: Score Based on the Rubric

- The response is fluent, concise, and follows the ELI5 explanation instruction.
- However, while grounded in general knowledge, it does not reference or summarize any information unique to the original supplied file (gemini.pdf), which could be a requirement for "summarization quality" as per the provided evaluation instructions.

**Rating Explanation:**
- If the user's intent was for a general ELI5 explanation, the summary is a strong 5.
- If the intent was to summarize the specific content in "gemini.pdf," then the response would be rated down for not referencing or grounding itself in that file, despite being otherwise high quality. In this scenario, it would be a 2, as it is not grounded in the provided source material.

**Final Rating:** 2 (Bad)  
**Reason:** The response is grounded in general knowledge and follows the ELI5 style but does not follow instructions if those instructions require grounding in the provided document (gemini.pdf). If summarizing the document (as per "summarization quality" metric), then the answer is not properly grounded in the required context. If it was a general question, the answer would rate higher, but based on the evaluation metric, it falls into category 2.
-----

rating=<Rating.BAD: '2'>


## Evaluating in practice

Evaluation has many practical uses, for example:
* You can quickly iterate on a prompt with a small set of test documents,
* You can compare different models to find what works best for your needs, such as finding the trade-off between price and performance, or finding the best performance for a specific task.
* When pushing changes to a model or prompt in a production system, you can verify that the system does not regress in quality.

In this section you will try two different evaluation approaches.

### Pointwise evaluation

The technique used above, where you evaluate a single input/output pair against some criteria is known as pointwise evaluation. This is useful for evaluating singular outputs in an absolute sense, such as "was it good or bad?"

In this exercise, you will try different guidance prompts with a set of questions.

In [18]:
import functools

# Try these instructions, or edit and add your own.
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."
guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}

questions = [
    # Un-comment one or more questions to try here, or add your own.
    # Evaluating more questions will take more time, but produces results
    # with higher confidence. In a production system, you may have hundreds
    # of questions to evaluate a complex system.

    # "What metric(s) are used to evaluate long context performance?",
    "How does the model perform on code tasks?",
    "How many layers does it have?",
    # "Why is it called Gemini?",
]

if not questions:
  raise NotImplementedError('Add some questions to evaluate!')


@functools.cache
def answer_question(question: str, guidance: str = '') -> str:
  """Generate an answer to the question using the uploaded document and guidance."""
  config = types.GenerateContentConfig(
      temperature=0.0,
      system_instruction=guidance,
  )
  response = client.models.generate_content(
      model='gemini-2.0-flash',
      config=config,
      contents=[question, document_file],
  )

  return response.text


answer = answer_question(questions[0], terse_guidance)
Markdown(answer)

Gemini 1.5 Pro performs well on code tasks, surpassing Gemini 1.0 Ultra on Natural2Code and showing improvements in coding capabilities compared to previous Gemini models.


Now set up a question-answering evaluator, much like before, but using the [pointwise QA evaluation prompt](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates#pointwise_question_answering_quality).

In [19]:
import enum

QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated responses.
You should first read the user prompt carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness,completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

class AnswerRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


@functools.cache
def eval_answer(prompt, ai_response, n=1):
  """Evaluate the generated answer against the prompt/question used."""
  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PROMPT.format(prompt=[prompt, document_file], response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_answer(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

STEP 1:
The response answers the question accurately and is grounded in the document provided. It is complete and fluent.

STEP 2:
Score: 5


AnswerRating.VERY_GOOD


Now run the evaluation task in a loop. Note that the guidance instruction is hidden from the evaluation agent. If you passed the guidance prompt, the model would score based on whether it followed that guidance, but for this task the goal is to find the best overall result based on the user's question, not the developers instruction.

In [20]:
import collections
import itertools

# Number of times to repeat each task in order to reduce error and calculate an average.
# Increasing it will take longer but give better results, try 2 or 3 to start.
NUM_ITERATIONS = 1

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)

for question in questions:
  display(Markdown(f'## {question}'))
  for guidance, guide_prompt in guidance_options.items():

    for n in range(NUM_ITERATIONS):
      # Generate a response.
      answer = answer_question(question, guide_prompt)

      # Evaluate the response (note that the guidance prompt is not passed).
      written_eval, struct_eval = eval_answer(question, answer, n)
      print(f'{guidance}: {struct_eval}')

      # Save the numeric score.
      scores[guidance] += int(struct_eval.value)

      # Save the responses, in case you wish to inspect them.
      responses[(guidance, question)].append((answer, written_eval))


## How does the model perform on code tasks?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD


## How many layers does it have?

Terse: AnswerRating.VERY_BAD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD


Now aggregate the scores to see how each prompt performed.

In [21]:
for guidance, score in scores.items():
  avg_score = score / (NUM_ITERATIONS * len(questions))
  nearest = AnswerRating(str(round(avg_score)))
  print(f'{guidance}: {avg_score:.2f} - {nearest.name}')

Terse: 3.00 - OK
Moderate: 5.00 - VERY_GOOD
Cited: 5.00 - VERY_GOOD


In [29]:
#OpenAI
@functools.cache
def answer_question_openai(question: str, guidance: str = '') -> str:
  """Generate an answer to the question using the uploaded document and guidance."""
  config = types.GenerateContentConfig(
      temperature=0.0,
      system_instruction=guidance,
  )
  response = openai_client.responses.create(
      model='gpt-4.1',
      temperature = 0.0,
      instructions=guidance,
      input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_file",
                    "file_id": openai_file.id,
                },
                {
                    "type": "input_text",
                    "text": question,
                },
            ]
        }
      ]
    )

  return response.output_text


answer = answer_question_openai(questions[0], terse_guidance)
Markdown(answer)

Gemini 1.5 Pro is Google’s best-performing model on code tasks to date, surpassing Gemini 1.0 Ultra on their internal held-out code generation benchmark (Natural2Code), and achieving strong results on standard coding benchmarks like HumanEval, while also emphasizing the importance of using truly held-out test sets to avoid data leakage.

In [34]:
#OpenAI
class AnswerRating(BaseModel):
  rating :Rating = Field(description="The rating")


@functools.cache
def eval_answer_openai(prompt, ai_response, n=1):
  """Evaluate the generated answer against the prompt/question used."""

  # Generate the full text response.
  response = openai_client.responses.create(
      model='gpt-4.1',
      input = QA_PROMPT.format(prompt=[prompt, openai_file], response=ai_response)
  )
  verbose_eval = response.output_text

  # Coerce into the desired structure.

  response = openai_client.responses.parse(
      model='gpt-4.1',
      input="Convert the final score.",
      text_format = AnswerRating,
      previous_response_id=response.id
  )
  structured_eval = response.output_parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_answer_openai(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

STEP 1: Assess the response in aspects of instruction following, groundedness, completeness, and fluency according to the criteria.

Instruction following:
- The user's prompt asks, "How does the model perform on code tasks?" and attaches a document ('gemini.pdf') assumed to contain relevant information. The response should answer this question, using only information from the document.
- The response mentions Gemini 1.5 Pro's performance, stating that it surpasses Gemini 1.0 Ultra on internal benchmarks and attains strong results on other standard benchmarks.

Groundedness:
- The response references benchmark names ("Natural2Code", "HumanEval") and claims (best-performing model to date, surpasses previous version, emphasizes data leakage avoidance). 
- If this information is indeed present in the uploaded 'gemini.pdf,' then the response is grounded. However, without explicit references or direct quotes, it's not fully clear what is being summarized versus inferred. 
- The response does not rely on information clearly outside the file, but it synthesizes more than strictly necessary.

Completeness:
- The response answers the main aspect of the question: it describes Gemini 1.5 Pro's (presumably the model in question) performance in code tasks—relative performance, benchmarks, and an emphasis on methodology (held-out sets, data leakage).
- It does not provide any quantitative metrics, example scores, or more granular details about the specific tasks or differences, which could enhance completeness if present in the file.

Fluent:
- The response is clear, grammatically correct, and easy to read.

STEP 2: Score based on the rubric.

- The response follows instructions, is likely grounded (assuming the details are in the doc), is complete at a high level but lacks specifics or citations, and is fluent.
- It could be improved with numbers or direct references to specific performance results and more specificity from the source, but it does answer the core question as asked.

Final Score: 5

**Justification:** The response (assuming source info is present) follows instructions, is grounded, complete in addressing the user's question, and is fluent. It could be improved with more detail, but by the rubric, it achieves the highest score for question-answering quality.

rating=<Rating.VERY_GOOD: '5'>


In [40]:
#OpenAI
for question in questions:
  display(Markdown(f'## {question}'))
  for guidance, guide_prompt in guidance_options.items():

    for n in range(NUM_ITERATIONS):
      # Generate a response.
      answer = answer_question_openai(question, guide_prompt)

      # Evaluate the response (note that the guidance prompt is not passed).
      written_eval, struct_eval = eval_answer_openai(question, answer, n)
      print(f'{guidance}: {struct_eval.rating}')

      # Save the numeric score.
      scores[guidance] += int(struct_eval.rating.value)

      # Save the responses, in case you wish to inspect them.
      responses[(guidance, question)].append((answer, written_eval))

## How does the model perform on code tasks?

Terse: Rating.VERY_GOOD
Moderate: Rating.VERY_GOOD
Cited: Rating.VERY_GOOD


## How many layers does it have?

Terse: Rating.VERY_GOOD
Moderate: Rating.VERY_GOOD
Cited: Rating.OK


### Pairwise evaluation

The pointwise evaluation prompt used in the previous step has 5 levels of grading in the output. This may be too coarse for your system, or perhaps you wish to improve on a prompt that is already "very good".

Another approach to evaluation is to compare two outputs against each other. This is pairwise evaluation, and is a key step in ranking and sorting algorithms, which allows you to use it to rank your prompts either instead of, or in addition to the pointwise approach.

This step implements pairwise evaluation using the [pairwise QA quality prompt](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates#pairwise_question_answering_quality) from the Google Cloud docs.

In [58]:
QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""


class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'


@functools.cache
def eval_pairwise(prompt, response_a, response_b, n=1):
  """Determine the better of two answers to the same prompt."""

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PAIRWISE_PROMPT.format(
          prompt=[prompt, document_file],
          baseline_model_response=response_a,
          response=response_b)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerComparison,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = eval_pairwise(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval)

## Individual Response Analysis
Response A: The response is well written but not complete.
Response B: The response is complete and well-written. It extracts information from the document and presents it clearly.

## Overall Comparison
Response B is better than response A because it provides more detailed and well-organized information from the document about how the model performs on code tasks. Response A is a general summary, while Response B gives specific examples and metrics.

## Preference
B


AnswerComparison.B


With a pair-wise evaluator in place, the only thing required to rank prompts against each other is a comparator.

This example implements the minimal comparators required for total ordering (`==` and `<`) and performs the comparison using  `n_iterations` evaluations over the set of `questions`.

In [59]:
@functools.total_ordering
class QAGuidancePrompt:
  """A question-answering guidance prompt or system instruction."""

  def __init__(self, prompt, questions, n_comparisons=NUM_ITERATIONS):
    """Create the prompt. Provide questions to evaluate against, and number of evals to perform."""
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons

  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    """Compare two prompts on all questions over n trials."""
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return round(mean)

  def _compare_n(self, other, question):
    """Compare two prompts on a question over n trials."""
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n=1):
    """Compare two prompts on a single question."""
    answer_a = answer_question(question, self.prompt)
    answer_b = answer_question(question, other.prompt)

    _, result = eval_pairwise(
        prompt=question,
        response_a=answer_a,
        response_b=answer_b,
        n=n,  # Cache buster
    )
    # print(f'q[{question}], a[{self.prompt[:20]}...], b[{other.prompt[:20]}...]: {result}')

    # Convert the enum to the standard Python numeric comparison values.
    if result is AnswerComparison.A:
      return 1
    elif result is AnswerComparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    """Equality check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) == 0

  def __lt__(self, other):
    """Ordering check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) < 0


Now Python's sorting functions will "just work" on any `QAGuidancePrompt` instances. The `answer_question` and `eval_pairwise` functions are [memoized](https://en.wikipedia.org/wiki/Memoization) to avoid unnecessarily regenerating the same answers or evaluations, so you should see this complete quickly unless you have changed the questions, prompts or number of iterations from the earlier steps.

In [60]:
terse_prompt = QAGuidancePrompt(terse_guidance, questions)
moderate_prompt = QAGuidancePrompt(moderate_guidance, questions)
cited_prompt = QAGuidancePrompt(cited_guidance, questions)

# Sort in reverse order, so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt, cited_prompt], reverse=True)
for i, p in enumerate(sorted_results):
  if i:
    print('---')

  print(f'#{i+1}: {p}')

#1: Answer the following question in a single sentence, or as close to that as possible.
---
#2: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.
---
#3: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.


In [52]:
#OpenAI
class Comparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'

class AnswerComparison(BaseModel):
  comparison : Comparison


@functools.cache
def eval_pairwise_openai(prompt, response_a, response_b, n=1):
  """Determine the better of two answers to the same prompt."""



  # Generate the full text response.
  response = openai_client.responses.create(
      model = 'gpt-4.1',
      input=QA_PAIRWISE_PROMPT.format(
          prompt=[prompt, openai_file],
          baseline_model_response=response_a,
          response=response_b)
  )
  verbose_eval = response.output_text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerComparison,
  )
  response = openai_client.responses.parse(
      model='gpt-4.1',
      input="Convert the final score.",
      text_format = AnswerComparison,
      previous_response_id = response.id

  )
  structured_eval = response.output_parsed

  return verbose_eval, structured_eval


question = questions[0]
answer_a = answer_question_openai(question, terse_guidance)
answer_b = answer_question_openai(question, cited_guidance)

text_eval, struct_eval = eval_pairwise_openai(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval.comparison)

pairwise_choice: B

explanation: 

**Step 1: Analyze Response A**

- **Instruction following:** Response A answers how the model performs on code tasks, summarizing the main points: Gemini 1.5 Pro is Google’s best-performing code model, surpasses Gemini 1.0 Ultra on one benchmark (Natural2Code), achieves strong results on others (HumanEval), and highlights the importance of using held-out benchmarks. It follows the instruction, but the summary is brief and slightly lacking in detail.
- **Groundedness:** The details given are all found in the prompt context/report, albeit summarized—the benchmarks, positioning versus prior models, and the importance of data leakage in results.
- **Completeness:** The response is generally accurate and to the point but is limited in scope. It omits specific scores, in-depth discussion of strengths (long-context), or model limitations.
- **Fluency:** The response is clear, concise, and readable.

**Step 2: Analyze Response B**

- **Instruction following:** Response B follows the instruction fully, comprehensively responding to the question about Gemini 1.5 Pro’s code performance. It structures the answer with summaries, details from the report, and several points of analysis.
- **Groundedness:** All information provided is present in the context (the Gemini report). Specific benchmark scores (from the report’s Table 8), qualitative and technical assessments, and direct report references are given.
- **Completeness:** The response is highly complete, offering numeric benchmark results (for both HumanEval and Natural2Code) and distinguishing between them with context about data leakage. It discusses the unique advantages of the model (long-context window, processing entire codebases), efficiency, best practices for evaluation, limitations, and future directions. It provides references and is explicit about which points came from which part of the report.
- **Fluency:** The response is very well-organized, clear, and easy to follow.

**Step 3: Compare**

- **Instruction following:** Both answers are relevant, but B is much more thorough.
- **Groundedness:** Both are well-grounded, but B is more explicit in referencing the source document.
- **Completeness:** B is far more complete, offering both numerical details and qualitative analysis; A is very brief and lacks these.
- **Fluency:** Both are fluent, but B is better structured and clearer for deeper understanding.

**Step 4: Output**

- **pairwise_choice:** B

**Step 5: Explanation**

Response B is the superior answer. It comprehensively summarizes Gemini 1.5 Pro’s performance on code tasks, referencing numerical benchmark results, unique technical capabilities, and contextual caveats (like data leakage). It interprets these results, discusses the model's comparative advantages and efficiency, and directly references the report for further reading. Response A, while concise and accurate, is too brief and omits important details that inform a complete answer. Thus, Response B better meets the question answering criteria.

Comparison.B


In [55]:
#OpenAI
@functools.total_ordering
class QAGuidancePromptOpenAI:
  """A question-answering guidance prompt or system instruction."""

  def __init__(self, prompt, questions, n_comparisons=NUM_ITERATIONS):
    """Create the prompt. Provide questions to evaluate against, and number of evals to perform."""
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons

  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    """Compare two prompts on all questions over n trials."""
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return round(mean)

  def _compare_n(self, other, question):
    """Compare two prompts on a question over n trials."""
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n=1):
    """Compare two prompts on a single question."""
    answer_a = answer_question_openai(question, self.prompt)
    answer_b = answer_question_openai(question, other.prompt)

    _, result = eval_pairwise_openai(
        prompt=question,
        response_a=answer_a,
        response_b=answer_b,
        n=n,  # Cache buster
    )
    # print(f'q[{question}], a[{self.prompt[:20]}...], b[{other.prompt[:20]}...]: {result}')

    # Convert the enum to the standard Python numeric comparison values.
    if result is Comparison.A:
      return 1
    elif result is Comparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    """Equality check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePromptOpenAI):
      return NotImplemented

    return self._compare_all(other) == 0

  def __lt__(self, other):
    """Ordering check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePromptOpenAI):
      return NotImplemented

    return self._compare_all(other) < 0

In [56]:
#OpenAI
terse_prompt = QAGuidancePromptOpenAI(terse_guidance, questions)
moderate_prompt = QAGuidancePromptOpenAI(moderate_guidance, questions)
cited_prompt = QAGuidancePromptOpenAI(cited_guidance, questions)

# Sort in reverse order, so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt, cited_prompt], reverse=True)
for i, p in enumerate(sorted_results):
  if i:
    print('---')

  print(f'#{i+1}: {p}')

#1: Answer the following question in a single sentence, or as close to that as possible.
---
#2: Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.
---
#3: Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.


## Challenges

### LLM limitations

LLMs are known to have problems on certain tasks, and these challenges still persist when using LLMs as evaluators. For example, LLMs can struggle to count the number of characters in a word (this is a numerical problem, not a language problem), so an LLM evaluator will not be able to accurately evaluate this type of task. There are solutions available in some cases, such as connecting tools to handle problems unsuitable to a language model, but it's important that you understand possible limitations and include human evaluators to calibrate your evaluation system and determine a baseline.

One reason that LLM evaluators work well is that all of the information they need is available in the input context, so the model only needs to attend to that information to produce the result. When customising evaluation prompts, or building your own systems, keep this in mind and ensure that you are not relying on "internal knowledge" from the model, or behaviour that might be better provided from a tool.

### Improving confidence

One way to improve the confidence of your evaluations is to include a diverse set of evaluators. That is, use the same prompts and outputs, but execute them on different models, like Gemini Flash and Pro, or even across different providers, like Gemini, Claude, ChatGPT and local models like Gemma or Qwen. This follows the same idea used earlier, where repeating trials to gather multiple "opinions" helps to [reduce error](https://en.wikipedia.org/wiki/Law_of_large_numbers), except by using different models the "opinions" will be more diverse.


## Learn more

To learn more about evaluation systems, check out [this guide](https://cloud.google.com/blog/products/ai-machine-learning/enhancing-llm-quality-and-interpretability-with-the-vertex-gen-ai-evaluation-service?e=48754805) focusing on evaluation using Google Cloud's Gen AI Evaluation Service.

And be sure to read the **bonus whitepaper** on [Evaluating Large Language Models](https://services.google.com/fh/files/blogs/neurips_evaluation.pdf).

*- [Mark McD](https://linktr.ee/markmcd)*