Setup

In [2]:
from google import genai
from google.genai import types

from IPython.display import HTML, Markdown, display

In [3]:
from dotenv import load_dotenv
import os

load_dotenv()
os.environ['google_api_key']  = os.getenv('google_api_key')

Automated retry

In [4]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)


Evaluation

When using LLMs in real-world cases, it's important to understand how well they are performing. The open-ended generation capabilities of LLMs can make many cases difficult to measure. In this notebook you will walk through some simple techniques for evaluating LLM outputs and understanding their performance.



In [12]:
!curl -o gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  1 7059k    1 83606    0     0   297k      0  0:00:23 --:--:--  0:00:23  299k
 53 7059k   53 3802k    0     0  2985k      0  0:00:02  0:00:01  0:00:01 2987k
100 7059k  100 7059k    0     0  3309k      0  0:00:02  0:00:02 --:--:-- 3312k


In [16]:
import PyPDF2

client = genai.Client()

pdf_path = 'gemini.pdf'
# Open the PDF file
pdf_reader = PyPDF2.PdfReader(pdf_path)
# Get the number of pages
num_pages = len(pdf_reader.pages)
print(f"The PDF has {num_pages} pages")

# Extract text from first page as an example
first_page = pdf_reader.pages[0]
text = first_page.extract_text()

# Extract text from multiple pages (adjust range as needed)
full_text = ""
for i in range(min(5, num_pages)):  # First 5 pages or all if less than 5
    page = pdf_reader.pages[i]
    full_text += page.extract_text() + "\n\n"
    
# Limit text length to avoid token limits
max_length = 10000
if len(full_text) > max_length:
    full_text = full_text[:max_length] + "..."

print(f"Extracted {len(full_text)} characters of text")

The PDF has 77 pages
Extracted 10003 characters of text


Summarise a document

In [17]:
request = 'Tell me about the training process used here.'

def summarise_doc(request: str) -> str:
  """Execute the request on the uploaded document."""
  # Set the temperature low to stabilise the output.
  config = types.GenerateContentConfig(temperature=0.0)
  response = client.models.generate_content(
      model='gemini-2.0-flash',
      config=config,
      contents=[request, full_text],
  )

  return response.text

summary = summarise_doc(request)
Markdown(summary)


Based on the provided text, here's a breakdown of the Gemini 1.5 Pro training process, focusing on what's explicitly mentioned:

**Key Aspects of the Training Process:**

*   **Model Architecture:**
    *   **Mixture-of-Experts (MoE):** Gemini 1.5 Pro is a sparse Mixture-of-Experts (MoE) model. This means it uses a learned routing function to direct inputs to a subset of the model's parameters for processing. This allows the model to have a large total parameter count while keeping the number of parameters activated for any given input constant, improving efficiency.
    *   **Transformer-based:** It builds upon the Transformer architecture, which is a standard in modern language models.
*   **Data:**
    *   **Multimodal:** The model is trained on multimodal data, including text, video, and audio.
    *   **Long Context:** The training data includes very long sequences, up to 10 million tokens. This is a key factor in enabling the model's long-context understanding capabilities.
*   **Optimization:**
    *   The text mentions "a host of improvements made across nearly the entire model stack (architecture, data, optimization and systems)". This suggests that various optimization techniques were employed to improve the model's performance and efficiency.
*   **Compute Efficiency:**
    *   A major goal was to achieve comparable quality to Gemini 1.0 Ultra while using significantly less training compute and being more efficient to serve.
*   **Long-Context Understanding:**
    *   The model incorporates architectural changes specifically designed to enable long-context understanding of inputs up to 10 million tokens without degrading performance.
*   **Comparison to Previous Models:**
    *   The model builds on the research advances and multimodal capabilities of Gemini 1.0.
    *   It also draws upon a longer history of MoE research at Google and language model research in the broader literature.

**In summary, the training process for Gemini 1.5 Pro involved:**

1.  **Leveraging a Mixture-of-Experts (MoE) Transformer architecture.**
2.  **Training on a diverse multimodal dataset with extremely long contexts (up to 10 million tokens).**
3.  **Employing various optimization techniques to improve performance and efficiency.**
4.  **Incorporating architectural changes to specifically enhance long-context understanding.**
5.  **Aiming for comparable quality to Gemini 1.0 Ultra with significantly less training compute.**

The text emphasizes the importance of the MoE architecture and the long-context training data in achieving the model's impressive capabilities.


**Define an evaluator**
For a task like this, you may wish to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".



In [18]:
import enum

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [19]:
# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


def eval_summary(prompt, ai_response):
  """Evaluate the generated summary against the prompt used."""

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_summary(prompt=[request, full_text], ai_response=summary)
Markdown(text_eval)

## Evaluation
STEP 1:
The response successfully follows the instruction by providing information regarding the training process of Gemini 1.5 Pro, based on the given document. It effectively summarizes the key aspects of the training process. The response is grounded and only uses information present in the context. The response is quite verbose and could be more concise by avoiding repetition and unnecessary introductory phrases.

STEP 2:
Rating: 4 (Good). The summary follows instructions, is grounded, concise, and fluent.

In [20]:
struct_eval

<SummaryRating.GOOD: '4'>