# AG-UI Integration
Ragas can evaluate agents that stream events via the [AG-UI protocol](https://docs.ag-ui.com/). This notebook shows how to build evaluation datasets, configure metrics, and score AG-UI endpoints.


## Prerequisites
- Install optional dependencies with `pip install "ragas[ag-ui]" langchain-openai python-dotenv nest_asyncio`
- Start an AG-UI compatible agent locally (Google ADK, PydanticAI, CrewAI, etc.)
- Create an `.env` file with your evaluator LLM credentials (e.g. `OPENAI_API_KEY`, `GOOGLE_API_KEY`, etc.)
- If you run this notebook, call `nest_asyncio.apply()` (shown below) so you can `await` coroutines in-place.


In [None]:
# !pip install "ragas[ag-ui]" langchain-openai python-dotenv nest_asyncio


## Imports and environment setup
Load environment variables and import the classes used throughout the walkthrough.


In [2]:
import asyncio

from dotenv import load_dotenv
import nest_asyncio
from IPython.display import display
from langchain_openai import ChatOpenAI

from ragas.dataset_schema import EvaluationDataset, SingleTurnSample, MultiTurnSample
from ragas.integrations.ag_ui import (
    evaluate_ag_ui_agent,
    convert_to_ragas_messages,
    convert_messages_snapshot,
)
from ragas.messages import HumanMessage, ToolCall
from ragas.metrics import FactualCorrectness, ToolCallF1
from ragas.llms import LangchainLLMWrapper
from ag_ui.core import (
    MessagesSnapshotEvent,
    TextMessageChunkEvent,
    UserMessage,
    AssistantMessage,
)

load_dotenv()
# Patch the existing notebook loop so we can await coroutines safely
nest_asyncio.apply()

## Build single-turn evaluation data
Create `SingleTurnSample` entries when you only need to grade the final answer text.


In [2]:
scientist_questions = EvaluationDataset(
    samples=[
        SingleTurnSample(
            user_input="Who originated the theory of relativity?",
            reference="Albert Einstein originated the theory of relativity.",
        ),
        SingleTurnSample(
            user_input="Who discovered penicillin and when?",
            reference="Alexander Fleming discovered penicillin in 1928.",
        ),
    ]
)

scientist_questions

EvaluationDataset(features=['user_input', 'reference'], len=2)

## Build multi-turn conversations
For tool-usage metrics, extend the dataset with `MultiTurnSample` and expected tool calls.


In [3]:
weather_queries = EvaluationDataset(
    samples=[
        MultiTurnSample(
            user_input=[HumanMessage(content="What's the weather in Paris?")],
            reference_tool_calls=[
                ToolCall(name="weatherTool", args={"location": "Paris"})
            ],
        )
    ]
)

weather_queries

EvaluationDataset(features=['user_input', 'reference_tool_calls'], len=1)

## Configure metrics and the evaluator LLM
Wrap your grading model with the appropriate adapter and instantiate the metrics you plan to use.


In [4]:
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

qa_metrics = [FactualCorrectness(llm=evaluator_llm)]
tool_metrics = [ToolCallF1()]  # rule-based, no LLM required

  evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))


## Evaluate a live AG-UI endpoint
Set the endpoint URL exposed by your agent. Toggle the flags when you are ready to run the evaluations.
In Jupyter/IPython you can `await` the helpers directly once `nest_asyncio.apply()` has been called.


In [32]:
AG_UI_ENDPOINT = "http://localhost:8000/agentic_chat"  # Update to match your agent

RUN_FACTUAL_EVAL = False
RUN_TOOL_EVAL = False

In [34]:
async def evaluate_factual():
    return await evaluate_ag_ui_agent(
        endpoint_url=AG_UI_ENDPOINT,
        dataset=scientist_questions,
        metrics=qa_metrics,
        evaluator_llm=evaluator_llm,
        metadata=True,
    )


if RUN_FACTUAL_EVAL:
    factual_result = await evaluate_factual()
    factual_df = factual_result.to_pandas()
    display(factual_df)

Calling AG-UI Agent:   0%|          | 0/2 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,user_input,retrieved_contexts,response,reference,factual_correctness(mode=f1)
0,Who originated the theory of relativity?,[],The theory of relativity was originated by Alb...,Albert Einstein originated the theory of relat...,0.33
1,Who discovered penicillin and when?,[],Penicillin was discovered by Alexander Fleming...,Alexander Fleming discovered penicillin in 1928.,1.0


In [35]:
async def evaluate_tool_usage():
    return await evaluate_ag_ui_agent(
        endpoint_url=AG_UI_ENDPOINT,
        dataset=weather_queries,
        metrics=tool_metrics,
        evaluator_llm=evaluator_llm,
    )


if RUN_TOOL_EVAL:
    tool_result = await evaluate_tool_usage()
    tool_df = tool_result.to_pandas()
    display(tool_df)

Calling AG-UI Agent:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_tool_calls,tool_call_f1
0,"[{'content': 'What's the weather in Paris?', '...","[{'name': 'weatherTool', 'args': {'location': ...",0.0


## Convert recorded AG-UI events
Use the conversion helpers when you already have an event log to grade offline.


In [3]:
events = [
    TextMessageChunkEvent(
        message_id="assistant-1",
        role="assistant",
        delta="Hello from AG-UI!",
    )
]

messages_from_stream = convert_to_ragas_messages(events, metadata=True)

snapshot = MessagesSnapshotEvent(
    messages=[
        UserMessage(id="msg-1", content="Hello?"),
        AssistantMessage(id="msg-2", content="Hi! How can I help you today?"),
    ]
)

messages_from_snapshot = convert_messages_snapshot(snapshot)

messages_from_stream, messages_from_snapshot

([AIMessage(content='Hello from AG-UI!', metadata={'timestamp': None, 'message_id': 'assistant-1'}, type='ai', tool_calls=None)],
 [HumanMessage(content='Hello?', metadata=None, type='human'),
  AIMessage(content='Hi! How can I help you today?', metadata=None, type='ai', tool_calls=None)])