# Day 1b - Evaluation and Structured Output

This tutorial builds on the prompting techniques from `day-1a-prompting.py` and focuses on evaluating LLM outputs and using structured output formats.

## Learning Objectives

By the end of this tutorial, you will be able to:
- Evaluate LLM outputs using pointwise and pairwise evaluation methods
- Use structured output formats for programmatic evaluation
- Understand best practices and limitations of LLM evaluation
- Build evaluation systems for real-world applications

## Prerequisites

Before starting, make sure you have:
- Completed `day-1a-prompting.py` (recommended but not required)
- Obtained a Gemini API key from [AI Studio](https://aistudio.google.com/app/api-keys)
- Installed the required dependencies listed in `pyproject.toml` via `uv sync`

## Setup

### Import the SDK and Helpers

In [None]:
from google import genai
from google.genai import types
from IPython.display import HTML, Markdown, display
from google.api_core import retry

### Set Up Retry Helper

This allows you to run all cells without worrying about per-minute quota limits.
The retry helper will automatically retry requests that fail due to rate limiting (429) or service unavailability (503).

In [None]:
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

### Initialize the Client

The Gemini API uses a `Client` object to make requests.
The client handles authentication and lets you control which backend to use (Gemini API or Vertex AI).

In [None]:
import google.colab.userdata

api_key = google.colab.userdata.get('GEMINI_API_KEY')
client = genai.Client(api_key=api_key)

**Note:** Use the below code if you decide to run your code locally. We highly recommend using Google Colab

In [None]:
# Initialize the client with your API key
# Replace with your actual API key or use environment variable
# import os

# api_key = os.getenv("GOOGLE_API_KEY")
# if not api_key:
#     raise ValueError("Please set GOOGLE_API_KEY environment variable. See SETUP.md for instructions.")

# client = genai.Client(api_key=api_key)

## Part 1: Evaluation and Structured Output

When using LLMs in real-world applications, it's important to understand how well they are performing.
The open-ended generation capabilities of LLMs can make evaluation challenging.
This section covers techniques for evaluating LLM outputs.

### Document Summarization Example

For this evaluation example, we'll use a document summarization task. First, let's download a sample document and upload it to the Gemini API.

In [None]:
import urllib.request

# Download a sample PDF (you can replace this with your own document)
pdf_url = "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf"
pdf_path = "gemini_technical_report.pdf"

urllib.request.urlretrieve(pdf_url, pdf_path)

document_file = client.files.upload(file=pdf_path)

### Summarize a Document

In [None]:
request = 'Tell me about the training process used here.'

def summarise_doc(request: str) -> str:
    """Execute the request on the uploaded document."""
    # Set the temperature low to stabilize the output
    config = types.GenerateContentConfig(temperature=0.0)
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        config=config,
        contents=[request, document_file],
    )
    
    return response.text

summary = summarise_doc(request)
Markdown(summary)

### Define an Evaluator

For evaluation tasks, you may wish to assess various aspects:
- **Instruction following**: How well the model followed the prompt
- **Groundedness**: Whether the response contains only information from the provided context
- **Fluency**: How easy the text is to read
- **Conciseness**: Whether the response is appropriately brief
- **Quality**: Overall quality of the response

You can instruct an LLM to perform these evaluations similar to how you would instruct a human rater: with a clear definition and assessment rubric.

In [None]:
import enum

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses 
generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate 
the quality of the response based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. 
Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize 
text. Pay special attention to length constraints, such as in X words or in Y sentences. 
The instruction for performing a summarization task and the context to be summarized are 
provided in the user prompt. The response should be shorter than the text in the context. 
The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization 
task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. 
The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without 
a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise 
and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, 
conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

# Define a structured enum class to capture the result
class SummaryRating(enum.Enum):
    VERY_GOOD = '5'
    GOOD = '4'
    OK = '3'
    BAD = '2'
    VERY_BAD = '1'

def eval_summary(prompt, ai_response):
    """Evaluate the generated summary against the prompt used."""
    
    chat = client.chats.create(model='gemini-2.5-flash')
    
    # Generate the full text response
    response = chat.send_message(
        message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
    )
    verbose_eval = response.text
    
    # Coerce into the desired structure
    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=SummaryRating,
    )
    response = chat.send_message(
        message="Convert the final score.",
        config=structured_output_config,
    )
    structured_eval = response.parsed
    
    return verbose_eval, structured_eval

text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

In [None]:
struct_eval

In this example, the model generated a textual justification in a chat context. This full text response is useful for human interpretation and gives the model a place to "collect notes" while assessing the text. The working "notes" are used when generating the final result output.

In the next turn, the model converts the text output into a structured response. If you want to aggregate scores or use them programmatically, you want to avoid parsing unstructured text. Here the `SummaryRating` schema is passed, so the model converts the chat history into an instance of the `SummaryRating` enum.

### Pointwise Evaluation

The technique used above, where you evaluate a single input/output pair against some criteria, is known as **pointwise evaluation**. This is useful for evaluating singular outputs in an absolute sense, such as "was it good or bad?"

In this exercise, you'll try different guidance prompts with a set of questions:

In [None]:
import functools

# Try these instructions, or edit and add your own
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."

guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}

questions = [
    "How does the model perform on code tasks?",
    "How many layers does it have?",
]

@functools.cache
def answer_question(question: str, guidance: str = '') -> str:
    """Generate an answer to the question using the uploaded document and guidance."""
    config = types.GenerateContentConfig(
        temperature=0.0,
        system_instruction=guidance,
    )
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        config=config,
        contents=[question, document_file],
    )
    
    return response.text

answer = answer_question(questions[0], terse_guidance)
Markdown(answer)

Now set up a question-answering evaluator:

In [None]:
QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality 
of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated response.
You should first read the user prompt carefully for analyzing the task, 
and then evaluate the quality of the response based on and rules 
provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall 
quality of the answer to the question in the user prompt. 
Pay special attention to length constraints, such as in X words or in Y sentences. 
The instruction for performing a question-answering task is provided in the user prompt. 
The response should not contain information that is not present in the context 
(if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the 
Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question 
answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context 
if the context is present in the user prompt. 
The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question 
partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, 
is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, 
groundedness, completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

class AnswerRating(enum.Enum):
    VERY_GOOD = '5'
    GOOD = '4'
    OK = '3'
    BAD = '2'
    VERY_BAD = '1'

@functools.cache
def eval_answer(prompt, ai_response, n=1):
    """Evaluate the generated answer against the prompt/question used."""
    chat = client.chats.create(model='gemini-2.5-flash')
    
    # Generate the full text response
    response = chat.send_message(
        message=QA_PROMPT.format(prompt=[prompt, document_file], response=ai_response)
    )
    verbose_eval = response.text
    
    # Coerce into the desired structure
    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=AnswerRating,
    )
    response = chat.send_message(
        message="Convert the final score.",
        config=structured_output_config,
    )
    structured_eval = response.parsed
    
    return verbose_eval, structured_eval

text_eval, struct_eval = eval_answer(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

Now run the evaluation task in a loop. Note that the guidance instruction is hidden from the evaluation agent. If you passed the guidance prompt, the model would score based on whether it followed that guidance, but for this task the goal is to find the best overall result based on the user's question.

In [None]:
import collections

# Number of times to repeat each task to reduce error and calculate an average
NUM_ITERATIONS = 1

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)

for question in questions:
    display(Markdown(f'## {question}'))
    for guidance, guide_prompt in guidance_options.items():
        
        for n in range(NUM_ITERATIONS):
            # Generate a response
            answer = answer_question(question, guide_prompt)
            
            # Evaluate the response (note that the guidance prompt is not passed)
            written_eval, struct_eval = eval_answer(question, answer, n)
            print(f'{guidance}: {struct_eval}')
            
            # Save the numeric score
            scores[guidance] += int(struct_eval.value)
            
            # Save the responses, in case you wish to inspect them
            responses[(guidance, question)].append((answer, written_eval))

In [None]:
# Aggregate the scores
for guidance, score in scores.items():
    avg_score = score / (NUM_ITERATIONS * len(questions))
    nearest = AnswerRating(str(round(avg_score)))
    print(f'{guidance}: {avg_score:.2f} - {nearest.name}')

### Pairwise Evaluation

The pointwise evaluation prompt used in the previous step has 5 levels of grading. This may be too coarse for your system, or perhaps you wish to improve on a prompt that is already "very good."

Another approach is to compare two outputs against each other. This is **pairwise evaluation**, and is a key step in ranking and sorting algorithms, which allows you to use it to rank your prompts either instead of, or in addition to the pointwise approach.

In [None]:
QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses 
generated by two AI models. We will provide you with the user input and a pair of 
AI-generated responses (Response A and Response B). You should first read the user 
input carefully for analyzing the task, and then evaluate the quality of the responses 
based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. 
Then you will give step-by-step explanations for your judgment, compare results 
to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality 
of the answer to the question in the user prompt. Pay special attention to length 
constraints, such as in X words or in Y sentences. The instruction for performing 
a question-answering task is provided in the user prompt. The response should not 
contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question 
answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if 
the context is present in the user prompt. The response does not reference any 
outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: 
Determine how well Response A fulfills the user requirements, is grounded 
in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: 
Determine how well Response B fulfills the user requirements, is grounded in 
the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based 
on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice 
field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""

class AnswerComparison(enum.Enum):
    A = 'A'
    SAME = 'SAME'
    B = 'B'

@functools.cache
def eval_pairwise(prompt, response_a, response_b, n=1):
    """Determine the better of two answers to the same prompt."""
    
    chat = client.chats.create(model='gemini-2.5-flash')
    
    # Generate the full text response
    response = chat.send_message(
        message=QA_PAIRWISE_PROMPT.format(
            prompt=[prompt, document_file],
            baseline_model_response=response_a,
            response=response_b)
    )
    verbose_eval = response.text
    
    # Coerce into the desired structure
    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=AnswerComparison,
    )
    response = chat.send_message(
        message="Convert the final score.",
        config=structured_output_config,
    )
    structured_eval = response.parsed
    
    return verbose_eval, structured_eval

question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = eval_pairwise(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval)

## Part 2: Best Practices and Limitations

### LLM Limitations

LLMs are known to have problems on certain tasks, and these challenges still persist when using LLMs as evaluators. For example:
- LLMs can struggle with numerical problems (like counting characters in a word)
- They may not accurately evaluate tasks that require precise measurements
- They can be biased based on their training data

There are solutions available in some cases, such as connecting tools to handle problems unsuitable to a language model, but it's important that you understand possible limitations and include human evaluators to calibrate your evaluation system and determine a baseline.

One reason that LLM evaluators work well is that all of the information they need is available in the input context, so the model only needs to attend to that information to produce the result. When customizing evaluation prompts, or building your own systems, keep this in mind and ensure that you are not relying on "internal knowledge" from the model, or behavior that might be better provided from a tool.

### Improving Confidence

One way to improve the confidence of your evaluations is to include a diverse set of evaluators. That is, use the same prompts and outputs, but execute them on different models, like Gemini Flash and Pro, or even across different providers. This follows the same idea used earlier, where repeating trials to gather multiple "opinions" helps to reduce error, except by using different models the "opinions" will be more diverse.

## Summary

In this tutorial, you've learned:

1. **Evaluation Methods**: How to evaluate LLM outputs using pointwise and pairwise evaluation
2. **Structured Output**: How to use structured output formats (enums) for programmatic evaluation
3. **Evaluation Prompts**: How to design effective evaluation prompts with clear criteria and rubrics
4. **Best Practices**: Understanding limitations and improving evaluation confidence

## Next Steps

- Review `day-1a-prompting.py` if you haven't already, to understand the prompting techniques that generate the outputs you're evaluating
- Explore the [Gemini API documentation](https://ai.google.dev/gemini-api/docs) for more advanced features
- Check out the [Gemini API cookbook](https://github.com/google-gemini/cookbook) for more examples
- Try building your own evaluation system for a specific use case

## References

- [Gemini API Documentation](https://ai.google.dev/gemini-api/docs)
- [Gemini API Prompting Strategies](https://ai.google.dev/gemini-api/docs/prompting-strategies)
- [Gemini API Models Overview](https://ai.google.dev/gemini-api/docs/models/gemini)
- [Gemini API Cookbook](https://github.com/google-gemini/cookbook)