In [None]:
!pip show pydantic-ai

[0m

In [None]:
!pip install -U pydantic-ai

Collecting pydantic-ai
  Downloading pydantic_ai-1.18.0-py3-none-any.whl.metadata (14 kB)
Collecting pydantic-ai-slim==1.18.0 (from pydantic-ai-slim[ag-ui,anthropic,bedrock,cli,cohere,evals,fastmcp,google,groq,huggingface,logfire,mcp,mistral,openai,retries,temporal,ui,vertexai]==1.18.0->pydantic-ai)
  Downloading pydantic_ai_slim-1.18.0-py3-none-any.whl.metadata (6.1 kB)
Collecting genai-prices>=0.0.35 (from pydantic-ai-slim==1.18.0->pydantic-ai-slim[ag-ui,anthropic,bedrock,cli,cohere,evals,fastmcp,google,groq,huggingface,logfire,mcp,mistral,openai,retries,temporal,ui,vertexai]==1.18.0->pydantic-ai)
  Downloading genai_prices-0.0.39-py3-none-any.whl.metadata (6.5 kB)
Collecting griffe>=1.3.2 (from pydantic-ai-slim==1.18.0->pydantic-ai-slim[ag-ui,anthropic,bedrock,cli,cohere,evals,fastmcp,google,groq,huggingface,logfire,mcp,mistral,openai,retries,temporal,ui,vertexai]==1.18.0->pydantic-ai)
  Downloading griffe-1.15.0-py3-none-any.whl.metadata (5.2 kB)
Collecting pydantic-graph==1.18.0 (

In [None]:
from google.colab import userdata
import os
os.environ['GROQ_API_KEY'] = userdata.get('GROQ_API_KEY')

- Use Groq as LLM provider in Pydantic-AI

https://ai.pydantic.dev/models/groq/


- Agents Introduction in Pydantic-AI

https://ai.pydantic.dev/agents/#introduction

# Simple Agent

In [None]:
from pydantic_ai import Agent

agent = Agent('groq:openai/gpt-oss-120b', system_prompt="You answer questions only related to sports")

In [None]:
import nest_asyncio

nest_asyncio.apply()

result = agent.run_sync('Where does "hello world" come from?')

In [None]:
result

AgentRunResult(output='I’m sorry, but I can only help with questions about sports. If you have a sports‑related query, feel free to ask!')

In [None]:
result.output

'I’m sorry, but I can only help with questions about sports. If you have a sports‑related query, feel free to ask!'

In [None]:
result.all_messages()

[ModelRequest(parts=[SystemPromptPart(content='You answer questions only related to sports', timestamp=datetime.datetime(2025, 11, 15, 9, 54, 32, 716642, tzinfo=datetime.timezone.utc)), UserPromptPart(content='Where does "hello world" come from?', timestamp=datetime.datetime(2025, 11, 15, 9, 54, 32, 716662, tzinfo=datetime.timezone.utc))], run_id='38f8a026-f2b5-46af-b7d4-648ae628954c'),
 ModelResponse(parts=[ThinkingPart(content='We have a system instruction: "You are ChatGPT...". Then developer instruction: "You answer questions only related to sports". The user asks "Where does \'hello world\' come from?" That\'s not a sports question. According to instruction hierarchy, developer instruction overrides system instruction. So we must obey developer: answer only sports-related questions. The user question is not about sports. We need to respond accordingly: we cannot answer; we should politely say we can only answer sports questions.'), TextPart(content='I’m sorry, but I can only help 

Providing message history to the agent

In [None]:
result = agent.run_sync("When did India won the world cup in cricket?")

In [None]:
result.new_messages()

[ModelRequest(parts=[SystemPromptPart(content='You answer questions only related to sports', timestamp=datetime.datetime(2025, 11, 15, 9, 57, 49, 127776, tzinfo=datetime.timezone.utc)), UserPromptPart(content='When did India won the world cup in cricket?', timestamp=datetime.datetime(2025, 11, 15, 9, 57, 49, 127791, tzinfo=datetime.timezone.utc))], run_id='3eda3789-4297-4932-ad9e-558a7e222092'),
 ModelResponse(parts=[ThinkingPart(content='The user asks: "When did India won the (sic) the world cup in cricket?" It\'s a sports question, specifically about cricket World Cup. We can answer: India won the ICC Cricket World Cup in 1983 and 2011. Also they won T20 World Cup in 2007 and 2021. The question likely refers to the ODI World Cup. Provide dates. Answer.'), TextPart(content='India has lifted the ICC\u202fCricket World Cup (the 50‑over\u202fODI tournament) **twice**:\n\n| Year | Host(s) | Final Opponent | Result |\n|------|----------|----------------|--------|\n| **1983** | England | We

In [None]:
agent.run_sync("What was my last question?", message_history = result.new_messages())

AgentRunResult(output='Your most recent question was: **“When did India won the world cup in cricket?”**')

# Agent with Structured Response


This example shows how to get structured, type-safe responses from the agent.

Key concepts:
- Using Pydantic models to define response structure
- Type validation and safety


In [None]:
from pydantic import BaseModel, Field

model = 'groq:openai/gpt-oss-120b'

class ResponseModel(BaseModel):
    """Structured response with metadata."""

    response: str
    needs_escalation: bool
    follow_up_required: bool
    sentiment: str = Field(description="Customer sentiment analysis")


agent2 = Agent(
    model=model,
    output_type=ResponseModel,
    system_prompt=(
        "You are an intelligent customer support agent. "
        "Analyze queries carefully and provide structured responses."
    ),
)

response = agent2.run_sync("How can I track my order #12345?")
response.output.model_dump_json(indent=2)

'{\n  "response": "To track your order #12345, please follow these steps:\\n1. Visit our website and log into your account.\\n2. Go to the \\"My Orders\\" section.\\n3. Locate order #12345 in the list and click the \\"Track\\" button.\\n4. You’ll see the latest shipping status and an estimated delivery date.\\n\\nIf you prefer, you can also track your order directly using this link: https://www.example.com/track?order=12345 (replace with the actual tracking URL if available).\\n\\nIf you encounter any issues or the status isn’t updating, feel free to reply to this message or contact our support team at support@example.com or call 1‑800‑123‑4567.",\n  "needs_escalation": false,\n  "follow_up_required": false,\n  "sentiment": "neutral"\n}'

In [None]:
print(response.output.response)

To track your order #12345, please follow these steps:
1. Visit our website and log into your account.
2. Go to the "My Orders" section.
3. Locate order #12345 in the list and click the "Track" button.
4. You’ll see the latest shipping status and an estimated delivery date.

If you prefer, you can also track your order directly using this link: https://www.example.com/track?order=12345 (replace with the actual tracking URL if available).

If you encounter any issues or the status isn’t updating, feel free to reply to this message or contact our support team at support@example.com or call 1‑800‑123‑4567.


# RAG

In [None]:
import os
os.environ['GROQ_API_KEY'] = userdata.get('GROQ_API_KEY')

In [None]:
from pydantic_ai import Agent
# from pydantic_ai.models.groq import GroqModel
from typing import List

# 1. Choose your model (OpenAI example; adjust to what you use)
# model = GroqModel("openai:gpt-oss-120b")

# 2. Define a retriever tool for RAG

def retrieve_docs(query: str) -> List[str]:
    """
    Retrieve relevant documents for a query.
    In real life, call your vector DB / search index here.
    """
    # TODO: replace this with your real retrieval
    fake_corpus = {
        "pydantic": "Pydantic is a library for data validation using Python type hints.",
        "rag": "RAG stands for Retrieval Augmented Generation.",
        "agent": "Agents can call tools to fetch external information."
    }
    return [text for key, text in fake_corpus.items() if key in query.lower()]

# 3. Create the agent and attach the tool
rag_agent = Agent(
    "groq:openai/gpt-oss-120b",
    system_prompt=(
        "You are a RAG assistant.\n"
        "- Use the `retrieve_docs` tool whenever user questions may require external info.\n"
        "- When you call it, read the returned documents and answer using ONLY that info plus the question.\n"
        "- If the tool returns nothing, say you couldn't find anything relevant."
    ),
    tools=[retrieve_docs],
)

# 4. Run a RAG-style query
async def ask(question: str):
    result = await rag_agent.run(
        question,
        # (optional) you can pass metadata, user id, etc. here
    )
    print(result)
    # print(result.data)   # final answer text
    # print(result.tool_calls)  # how it used tools, if you want to debug


In [None]:
import asyncio

result = asyncio.run(ask("Explain what RAG is and how it relates to agents."))

AgentRunResult(output='RAG stands for **Retrieval‑Augmented Generation**.\u202fIt is a technique where a language model doesn’t rely solely on its internal knowledge; instead, it first pulls in relevant external information and then uses that retrieved content to generate its response.\n\nAgents—software entities that can execute actions on behalf of a user—often have the ability to call tools that fetch external data (e.g., search APIs, databases, or document stores). When an agent uses such a tool to retrieve information before producing an answer, it is effectively performing the “retrieval” step of RAG. The subsequent generation step then incorporates the fetched data, completing the Retrieval‑Augmented Generation cycle. In short, agents provide the mechanism (tool calls) that enables the retrieval part of RAG, allowing the model to produce more up‑to‑date and accurate responses.')


# Example: Evaluating a RAG application (retrieval + answer quality)

Here we combine:

A Pydantic-AI RAG agent using a search_docs tool (very similar to the official Pydantic RAG example).


Pydantic-Evals to:

1. Check that the answer is grounded in retrieved docs

2. Check that the answer is relevant to the question

- RAG agent skeleton

Assume you already have an in-memory document store and a simple retrieval function.

In [None]:
from dataclasses import dataclass
from typing import List
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext

# --- Domain models ----------------------------------------------------

class DocChunk(BaseModel):
    id: str
    text: str

@dataclass
class RAGDeps:
    # could be a vector store or just a list of docs in a PoC
    documents: List[DocChunk]

# --- Retrieval tool ---------------------------------------------------

def retrieve_relevant_chunks(query: str, docs: List[DocChunk], k: int = 3) -> List[DocChunk]:
    # toy implementation: top-k by simple keyword overlap
    scores = []
    query_terms = set(query.lower().split())
    for d in docs:
        overlap = len(query_terms & set(d.text.lower().split()))
        scores.append((overlap, d))
    scores.sort(reverse=True, key=lambda t: t[0])
    return [d for score, d in scores[:k] if score > 0]

# --- RAG agent --------------------------------------------------------

class RAGAnswer(BaseModel):
    answer: str
    used_doc_ids: List[str]

rag_agent = Agent[RAGDeps, RAGAnswer](
    "openai:gpt-4o-mini",
    deps_type=RAGDeps,
    output_type=RAGAnswer,
    instructions="""
    You are a documentation assistant.
    Use ONLY the provided context chunks to answer the question.
    If the answer is not in the context, say you don't know.
    Return the IDs of the chunks you used.
    """,
)

@rag_agent.tool  # function tool exposes retrieval to the model
def search_docs(ctx: RunContext[RAGDeps], query: str) -> List[DocChunk]:
    return retrieve_relevant_chunks(query, ctx.deps.documents)


- Task function for evals

We want the task function to return both answer text and which docs were used:

In [None]:
from typing import TypedDict, List

class RAGOutput(TypedDict):
    answer: str
    used_doc_ids: List[str]

def rag_task(question: str, deps: RAGDeps) -> RAGOutput:
    result = rag_agent.run_sync(question, deps=deps)
    out = result.output
    return {"answer": out.answer, "used_doc_ids": out.used_doc_ids}


For simple evals we can partially “fix” dependencies (e.g., same KB for all cases):

In [None]:
# Partial application to match the expected signature inputs -> output
my_docs = [
    DocChunk(id="d1", text="Our refund policy allows returns within 30 days."),
    DocChunk(id="d2", text="Tech support is available 24/7 via chat."),
]

def rag_task_fixed(question: str) -> RAGOutput:
    return rag_task(question, deps=RAGDeps(documents=my_docs))


- Define RAG-specific eval dataset

Key idea: in each case we specify both:

1. The question

2. The expected supporting doc IDs

3. Optionally, a reference answer

In [None]:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

class RAGInputs(BaseModel):
    question: str
    expected_doc_ids: List[str]
    reference_answer: str

dataset = Dataset[RAGInputs, RAGOutput](
    cases=[
        Case(
            name="refund_policy",
            inputs=RAGInputs(
                question="What is your refund policy?",
                expected_doc_ids=["d1"],
                reference_answer="We allow refunds within 30 days of purchase.",
            ),
        ),
        Case(
            name="support_hours",
            inputs=RAGInputs(
                question="When is tech support available?",
                expected_doc_ids=["d2"],
                reference_answer="Tech support is available 24/7 via chat.",
            ),
        ),
    ],
)


- Add custom evaluators: retrieval + groundedness

You can layer:

Retrieval precision: Did the agent use the right doc IDs?

Groundedness: Is the answer consistent with provided docs?
(Good use case for LLMJudge.)


In [None]:
### Retrieval evaluator (deterministic)


from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, EvaluationReason

@dataclass
class RetrievalMatch(Evaluator[RAGInputs, RAGOutput, None]):
    """Check that the agent used all expected docs."""

    def evaluate(self, ctx: EvaluatorContext[RAGInputs, RAGOutput, None]):
        expected = set(ctx.inputs.expected_doc_ids)
        used = set(ctx.output["used_doc_ids"])
        missing = expected - used
        extra = used - expected

        ok = not missing  # require at least all expected docs
        reason = f"missing={missing}, extra={extra}"
        return EvaluationReason(value=ok, reason=reason)


Attach this evaluator to the dataset:

In [None]:
dataset.add_evaluator(RetrievalMatch())

(You can also attach evaluators to specific cases by specific_case="refund_policy".)

In [None]:
### Grounded answer evaluator with LLMJudge


from pydantic_evals.evaluators import LLMJudge

groundedness_rubric = """
You are evaluating an answer produced by a RAG system.

Criteria (True vs False):
- TRUE if the answer is fully supported by the provided reference_answer text
  and does not introduce any contradictions or extra facts.
- FALSE if the answer contradicts, fabricates, or goes beyond the reference.
"""

dataset.add_evaluator(
    LLMJudge(
        rubric=groundedness_rubric,
        include_input=True,            # see the question + inputs
        include_expected_output=True,  # see reference_answer
        # model optional – uses default judge model, typically `openai:gpt-4o`
    )
)


### Under the hood, LLMJudge will call a judge model and return a boolean/score with an explanation.

Run RAG evaluation

In [None]:
report = dataset.evaluate_sync(rag_task_fixed)
report.print()

# Example: Evaluating a Agent with Pydantic-Evals

- Agent we want to evaluate

A minimal Pydantic-AI agent that classifies user queries into support intents:

In [None]:
from typing import Literal
from pydantic import BaseModel
from pydantic_ai import Agent

class IntentOutput(BaseModel):
    intent: Literal["refund", "technical_support", "sales", "other"]
    reasoning: str

# Pydantic AI agent: takes a user query and returns typed output
intent_agent = Agent[None, IntentOutput](
    "openai:gpt-4o-mini",  # or any configured model
    instructions="""
    You are a support triage bot.
    Classify the user's message into one of: refund, technical_support, sales, other.
    Explain your reasoning in one or two sentences.
    """,
    output_type=IntentOutput,
)


This is straight out of typical Pydantic-AI usage: agents are parameterized by dependency type and output model, and output_type is a Pydantic model that gets validated


We’ll now evaluate whether the classification is correct across a dataset.

- Wrap the agent as a task function

Pydantic-Evals expects a function to call for each test case. We wrap the agent:

In [None]:
def classify_intent_task(user_message: str) -> str:
    """Task function for evals – returns only the intent label."""
    result = intent_agent.run_sync(user_message)
    return result.output.intent


- Define dataset + evaluators (deterministic + LLM-judge)

In [None]:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected, LLMJudge

# 1) Dataset with labeled cases
dataset = Dataset[str, str](
    cases=[
        Case(
            name="refund_request",
            inputs="I want a refund for the headphones I bought last week.",
            expected_output="refund",
        ),
        Case(
            name="login_issue",
            inputs="I can't log into my account, it keeps saying password invalid.",
            expected_output="technical_support",
        ),
        Case(
            name="pricing_question",
            inputs="Do you offer any discount if we buy 100 licenses?",
            expected_output="sales",
        ),
    ],
    evaluators=[
        # Exact correctness: output == expected_output
        EqualsExpected(),
        # Subjective check: was the classification reasonable?
        LLMJudge(
            rubric=(
                "Judge if the predicted intent label is a reasonable "
                "classification for the given user message. "
                "Return true only if it clearly fits."
            ),
            include_input=True,
            include_expected_output=True,
            # model optional – uses default judge model if omitted
        ),
    ],
)


- EqualsExpected is a built-in comparison evaluator.


- LLMJudge is an LLM-as-a-judge evaluator for subjective criteria (correctness of label given context).

In [None]:
### Run the evaluation


report = dataset.evaluate_sync(classify_intent_task)

# Pretty print high-level summary in training session:
report.print()


Pydantic-Evals’ data model is:

Dataset – list of Cases + Evaluators

Experiment – running dataset.evaluate(task)

EvaluationReport – structured result with scores, assertions, durations, etc.