# Tracing & Evaluation for OpenAI SDK Agent 

## Part I. Tracing 

#### Loading environment variables

To start, let's load our environment variables from our .env file. Make sure all of the keys necessary in .env.example are included!

In your `.env` file, set up the following fields 

LANGSMITH_TRACING="true" \
LANGSMITH_ENDPOINT="https://api.smith.langchain.com" \
LANGSMITH_API_KEY="your-api-key" \
LANGSMITH_PROJECT="your-project-name" \
OPENAI_API_KEY="your-openai-api-key"

In [2]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env", override=True)

True

#### Loading agent and setting up sample instrumentation

First, we load the `MultiAgentQAEvaluator` from qa_agent

In [3]:
%cd ..
%load_ext autoreload
%autoreload 2
from qa_agent.evaluators import MultiAgentQAEvaluator, MultiAgentQAEvaluatorParameters

/Users/qiaocatherine/git/snorkel-ai-evaluation


Next, we will trace the application using OpenAI Agents SDK

In [4]:
pip -q install "langsmith[openai-agents]"

Note: you may need to restart the kernel to use updated packages.


In [5]:
from agents import Agent, Runner, set_trace_processors
from langsmith.wrappers import OpenAIAgentsTracingProcessor

# Set up tracing processor
set_trace_processors([OpenAIAgentsTracingProcessor()])

#### Invoke agent and trace! 

Example [trace](https://smith.langchain.com/public/6e2bb491-8357-4799-abb8-2619b2cfddbf/r)

In [5]:
# Configure the evaluator
parameters = MultiAgentQAEvaluatorParameters(
    planner_model="gpt-4o-mini",
    executor_model="gpt-4o",
    agent_timeout=90,
    low_confidence_threshold=0.5
)

# Create evaluator instance
evaluator = MultiAgentQAEvaluator(parameters)

# Example: Cat image validation
question = "What type of cat is shown in this image?"
provided_answer = "This is a tuxedo cat with black and white fur pattern."
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b7/George%2C_a_perfect_example_of_a_tuxedo_cat.jpg/1250px-George%2C_a_perfect_example_of_a_tuxedo_cat.jpg"

# Run validation
result = await evaluator.evaluate(
    question=question,
    provided_answer=provided_answer,
    image_url=image_url
)

# Display results
print(f"Validation Passed: {result.passed}")
print(f"Notes: {result.notes}")

if result.metadata.get("confidence"):
    print(f"Confidence: {result.metadata['confidence']}")

Validation Passed: True
Notes: The provided answer is correct with confidence 0.95. The rationale follows: The provided answer correctly identifies the cat as a tuxedo cat based on the visual characteristics observed in the image. The image shows a cat with a predominantly black fur pattern and distinct white markings on the chest, paws, and face, which are characteristic of a tuxedo cat. This matches the definition of a tuxedo cat as a bicolor cat with a specific black-and-white pattern. The validation tasks consistently support this conclusion with high confidence, and there is no contradictory evidence. Therefore, the answer is correct..
Confidence: 0.95


## Part II. Complex Evaluations (full agent)

In [6]:
import sqlite3
import requests
from langchain_community.utilities.sql_database import SQLDatabase
from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool

def get_engine_for_chinook_db():
    """Pull sql file, populate in-memory database, and create engine."""
    url = "https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql"
    response = requests.get(url)
    sql_script = response.text

    connection = sqlite3.connect(":memory:", check_same_thread=False)
    connection.executescript(sql_script)
    return create_engine(
        "sqlite://",
        creator=lambda: connection,
        poolclass=StaticPool,
        connect_args={"check_same_thread": False},
    )

engine = get_engine_for_chinook_db()
db = SQLDatabase(engine)


**Evaluations** are made up of three components:

1. A **dataset test** inputs and expected outputs.
2. An **application or target function** that defines what you are evaluating, taking in inputs and returning the application output
3. **Evaluators** that score your target function's outputs.

![Evaluation](../images/evals-conceptual.png) 

#### 1. Create a Dataset

Loads test cases from csv

In [6]:
import pandas as pd

df = pd.read_csv("agent_qa_eval_samples.csv")
df.head()

Unnamed: 0,index,question,answer,image_url,is_correct
0,0,Which variant is most similar to HPR1 in terms...,HPR1-T335A,https://d28jlkw5i7tp0z.cloudfront.net/upload/p...,True
1,1,Which solid at 10 WT% has the highest deviatio...,CRB,https://d28jlkw5i7tp0z.cloudfront.net/upload/p...,True
2,2,Examining the gene expression patterns across ...,CYP1A2,https://d28jlkw5i7tp0z.cloudfront.net/upload/p...,True
3,3,Which treatment most increases both AP-1 and C...,H2O2,https://d28jlkw5i7tp0z.cloudfront.net/upload/p...,True
4,4,"After application of depolarizing current, at ...",10' + 30',https://d28jlkw5i7tp0z.cloudfront.net/upload/p...,True


Create dataset in LangSmith, inputs to evaluators will include `question`, `answer`, and `image_url`; reference_output will be `is_correct`

In [7]:
from langsmith import Client

client = Client()

dataset_name = "Snorkel AI: Complex Evaluation"

if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name=dataset_name)

    for index, row in df.iterrows():
        client.create_examples(
            inputs=[{"question": row["question"], "answer": row["answer"],"image_url": row["image_url"]}],
            outputs=[{"is_correct": row["is_correct"]}],
            dataset_id=dataset.id
        )

#### 2. Define Application Logic to be Evaluated 

Now, let's define how to run our agent, taking in the input fields as defined in our dataset. 

**You can easily configure the paramters to run a different experiment, allowing you to test between different models & configurations' impact on the agent quality**. For other components like agent parameters, you can easily change them within the agent, import, and run the experiment. 

In [8]:
# Configure the evaluator
parameters = MultiAgentQAEvaluatorParameters(
    planner_model="gpt-4o-mini",
    executor_model="gpt-4o-mini",
    agent_timeout=100, 
    low_confidence_threshold=0.5
)

# Create evaluator instance
evaluator = MultiAgentQAEvaluator(parameters)

In [9]:
async def run_agent(inputs: dict):
    """Run agent and track the final response."""

    result = await evaluator.evaluate(
        question=inputs["question"],
        provided_answer=inputs["answer"],
        image_url=inputs["image_url"]
    )

    return {"result": result.passed, "notes": result.notes, "confidence": result.metadata['confidence']}

#### 3. Define the Evaluator

**Using heuristic code evaluator**

Since our human reference output and the actual result are in boolean, we can do a simple heuristic evaluator. 

In [10]:
async def alignment(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the agent chose the correct route."""
    return outputs['result'] == reference_outputs["is_correct"]

#### 4. Run the Evaluation

Example result [here](https://smith.langchain.com/public/a08a7fcf-4094-4888-b645-82755d9e0623/d)

In [None]:
# Evaluation job and results
experiment_results = await client.aevaluate(
    run_agent,
    data=dataset_name,
    evaluators=[alignment],
    experiment_prefix="agent-gpt-4o-mini",
    num_repetitions=1, # since LLMs are non-deterministic, you can run multiple iterations over the same example in a dataset
    max_concurrency=2,
)