# 🔬 Lab: Evaluating the Cooking AI Agent

## Overview

This lab walks you through the complete process of evaluating an AI agent using **Azure AI Foundry's cloud-based evaluation framework**. You'll learn how to:

1. **Prepare test data** - Create queries and collect agent responses
2. **Format data for evaluation** - Convert responses to evaluation-ready format
3. **Configure evaluators** - Set up quality and agent-specific metrics
4. **Run cloud evaluation** - Submit to Azure AI Foundry for comprehensive analysis
5. **Analyze results** - Review metrics in the Azure AI Foundry portal

### What You'll Evaluate

The **Cooking AI Agent** is a conversational agent that helps users with:
- Recipe search
- Ingredient extraction
- Recipe suggestions based on preferences

### Evaluation Metrics

We'll measure both **general quality** and **agent-specific** performance:

**Quality Metrics:**
- **Relevance** (1-5): How well responses address the query
- **Coherence** (1-5): Logical structure and clarity
- **Fluency** (1-5): Grammatical correctness and readability

**Agent Metrics:**
- **Intent Resolution** (1-5): Understanding user intent
- **Tool Call Accuracy** (1-5): Correct tool selection and parameters
- **Task Adherence** (1-5): Following instructions and scope

---

## Prerequisites

Before starting, ensure you have:
- ✅ Azure AI Foundry project set up
- ✅ Azure OpenAI resource deployed
- ✅ Agent responses collected (from `run_agent.py`)
- ✅ Environment variables configured

Let's get started! 🚀

## Step 1: Import Required Libraries

First, let's import all the libraries needed for cloud-based evaluation.

In [15]:
import os
import json
from datetime import datetime
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    Evaluation,
    InputDataset,
    EvaluatorConfiguration,
    EvaluatorIds
)
from dotenv import load_dotenv

# Load environment variables
load_dotenv("..\..\.env")

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


## Step 2: Define Tool Definitions

For the **Tool Call Accuracy** evaluator to work, we need to provide the tool definitions that were available to the cooking agent. This helps the evaluator understand which tools exist and their intended use.

In [16]:
def get_tool_definitions() -> list:
    """
    Get the tool definitions that were available to the cooking agent.
    These definitions help the Tool Call Accuracy evaluator understand
    which tools were available and their intended use.
    
    Format matches Azure AI Foundry evaluator requirements.
    """
    return [
        {
            "id": "search_recipes",
            "name": "search_recipes",
            "description": "Search for recipes based on a query. Returns matching recipes with their basic information.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query for recipes (e.g., 'pasta', 'chicken', 'dessert')"
                    }
                },
                "required": ["query"]
            }
        },
        {
            "id": "extract_ingredients",
            "name": "extract_ingredients",
            "description": "Extract and return the full list of ingredients for a specific recipe.",
            "parameters": {
                "type": "object",
                "properties": {
                    "recipe_name": {
                        "type": "string",
                        "description": "The name of the recipe to extract ingredients from"
                    }
                },
                "required": ["recipe_name"]
            }
        },
        {
            "id": "get_recipe_suggestions",
            "name": "get_recipe_suggestions",
            "description": "Get recipe suggestions based on dietary preferences or meal type.",
            "parameters": {
                "type": "object",
                "properties": {
                    "dietary_preference": {
                        "type": "string",
                        "description": "Dietary preference (e.g., 'quick', 'vegetarian', 'meat', 'dessert')",
                        "default": "any"
                    }
                }
            }
        }
    ]

# Display tool definitions
tools = get_tool_definitions()
print(f"✅ Defined {len(tools)} tools for the cooking agent:")
for tool in tools:
    print(f"   - {tool['name']}: {tool['description'][:60]}...")

✅ Defined 3 tools for the cooking agent:
   - search_recipes: Search for recipes based on a query. Returns matching recipe...
   - extract_ingredients: Extract and return the full list of ingredients for a specif...
   - get_recipe_suggestions: Get recipe suggestions based on dietary preferences or meal ...


## Step 3: Prepare Evaluation Data

Now we'll convert the test responses (collected by running the agent) into JSONL format required by Azure AI Foundry's cloud evaluation.

Each record will include:
- `query`: The user's question
- `response`: The agent's final answer
- `tool_calls`: Tools used by the agent
- `tool_definitions`: Available tools (for evaluation context)

In [17]:
def prepare_evaluation_data(responses_file: str, output_jsonl: str) -> None:
    """
    Convert test responses to JSONL format required by cloud evaluation.
    
    Args:
        responses_file: Path to test_responses.json
        output_jsonl: Path to output JSONL file
    """
    print(f"📊 Preparing evaluation data from {responses_file}...")
    
    with open(responses_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        responses = data.get("responses", [])
    
    # Get tool definitions that were available to the agent
    tool_definitions = get_tool_definitions()
    
    # Convert to JSONL format with required fields
    with open(output_jsonl, 'w', encoding='utf-8') as f:
        for item in responses:
            # Create evaluation record
            eval_record = {
                "query": item.get("query", ""),
                "response": item.get("response", ""),
                "tool_calls": item.get("tool_calls", []),  # Include tool calls for agent evaluators
                "tool_definitions": tool_definitions  # Include tool definitions for Tool Call Accuracy evaluator
            }
            
            f.write(json.dumps(eval_record) + '\n')
    
    print(f"✅ Created evaluation dataset: {output_jsonl} ({len(responses)} records)")
    return len(responses)

# Prepare the data
responses_file = "test_responses.json"
eval_data_file = "evaluation_data.jsonl"

num_records = prepare_evaluation_data(responses_file, eval_data_file)

# Preview first record
print(f"\n📋 Preview of first evaluation record:")
with open(eval_data_file, 'r', encoding='utf-8') as f:
    first_record = json.loads(f.readline())
    print(f"   Query: {first_record['query']}")
    print(f"   Response: {first_record['response'][:100]}...")
    print(f"   Tool calls: {len(first_record['tool_calls'])} calls")

📊 Preparing evaluation data from test_responses.json...
✅ Created evaluation dataset: evaluation_data.jsonl (10 records)

📋 Preview of first evaluation record:
   Query: Find me some pasta recipes
   Response: I found a delicious pasta recipe for you: Pasta Carbonara! It takes just 10 minutes to prep and 15 m...
   Tool calls: 1 calls


## Step 4: Configure Azure AI Foundry Project

Let's connect to your Azure AI Foundry project where the evaluation will run. You'll need:
- **PROJECT_ENDPOINT**: Your Azure AI Foundry project endpoint
- **MODEL_ENDPOINT**: Your Azure OpenAI endpoint (for evaluator model)
- **AZURE_OPENAI_DEPLOYMENT**: Model deployment name (default: gpt-4o-mini)

**Authentication Options:**
- **Option 1 (Recommended)**: Use Azure credentials (no API key needed) via `az login`
- **Option 2**: Provide `AZURE_OPENAI_API_KEY` environment variable

In [None]:
print("🔧 Configuring Azure AI Foundry project...")

# Get configuration from environment
project_endpoint = os.getenv("AI_FOUNDRY_PROJECT_ENDPOINT")
model_endpoint = os.getenv("AOAI_ENDPOINT")
model_deployment_name = os.getenv("MODEL_DEPLOYMENT_NAME", "gpt-4o-mini")

# Validate configuration
if not project_endpoint:
    print("❌ PROJECT_ENDPOINT not found. Please set it in environment variables.")
    print("   Format: https://<account>.services.ai.azure.com/api/projects/<project>")
else:
    print(f"✅ Project endpoint: {project_endpoint}")

if not model_endpoint:
    print("❌ MODEL_ENDPOINT not found. Please set it in environment variables.")
    print("   Format: https://<account>.services.ai.azure.com")
else:
    print(f"✅ Model endpoint: {model_endpoint}")


print(f"✅ Model deployment: {model_deployment_name}")

🔧 Configuring Azure AI Foundry project...
✅ Project endpoint: https://foundrytkkzfewrtdpby.services.ai.azure.com/api/projects/foundry-tkkzfewrtdpby-project
✅ Model endpoint: https://admin-mebeltoz-eastus2.openai.azure.com/
✅ Model deployment: gpt-4.1


## Step 5: Create AI Project Client

Connect to Azure AI Foundry using your credentials. Make sure you've run `az login` before this step.

In [19]:
print("🌐 Connecting to Azure AI Foundry project...")

try:
    project_client = AIProjectClient(
        endpoint=project_endpoint,
        credential=DefaultAzureCredential(),
    )
    print("✅ Connected to Azure AI Foundry project")
except Exception as e:
    print(f"❌ Failed to connect to project: {e}")
    print("Please run 'az login' and ensure you have access to the project.")
    raise

🌐 Connecting to Azure AI Foundry project...
✅ Connected to Azure AI Foundry project


## Step 6: Upload Evaluation Dataset

Upload the JSONL file to Azure AI Foundry as a dataset. This creates a versioned dataset that can be reused.

In [20]:
dataset_name = "cooking-agent-test-data"
# Use timestamp for unique version to avoid conflicts
dataset_version = datetime.now().strftime("%Y%m%d-%H%M%S")

print(f"📤 Uploading evaluation dataset...")

try:
    data_upload = project_client.datasets.upload_file(
        name=dataset_name,
        version=dataset_version,
        file_path=eval_data_file,
    )
    data_id = data_upload.id
    if not data_id:
        print("❌ Dataset upload succeeded but no ID returned")
        raise Exception("No dataset ID returned")
    
    print(f"✅ Dataset uploaded: {dataset_name} (v{dataset_version})")
    print(f"   Dataset ID: {data_id}")
except Exception as e:
    print(f"❌ Failed to upload dataset: {e}")
    raise

📤 Uploading evaluation dataset...
✅ Dataset uploaded: cooking-agent-test-data (v20251029-223733)
   Dataset ID: azureai://accounts/foundry-tkkzfewrtdpby/projects/foundry-tkkzfewrtdpby-project/data/cooking-agent-test-data/versions/20251029-223733


## Step 7: Configure Evaluators

Now we'll configure both **quality evaluators** and **agent-specific evaluators**:

### Quality Evaluators (General)
- **Relevance**: Does the response address the query?
- **Coherence**: Is the response logically structured?
- **Fluency**: Is the response grammatically correct?

### Agent Evaluators (Agent-Specific)
- **Intent Resolution**: Does the agent understand user intent?
- **Tool Call Accuracy**: Are tools used correctly?
- **Task Adherence**: Does the agent stay within task scope?

In [21]:
print("📋 Configuring evaluators...")

evaluators = {
    # Quality evaluators
    "relevance": EvaluatorConfiguration(
        id=EvaluatorIds.RELEVANCE.value,
        init_params={"deployment_name": model_deployment_name},
        data_mapping={
            "query": "${data.query}",
            "response": "${data.response}",
        },
    ),
    "coherence": EvaluatorConfiguration(
        id=EvaluatorIds.COHERENCE.value,
        init_params={"deployment_name": model_deployment_name},
        data_mapping={
            "query": "${data.query}",
            "response": "${data.response}",
        },
    ),
    "fluency": EvaluatorConfiguration(
        id=EvaluatorIds.FLUENCY.value,
        init_params={"deployment_name": model_deployment_name},
        data_mapping={
            "query": "${data.query}",
            "response": "${data.response}",
        },
    ),
    # Agent-specific evaluators
    "intent_resolution": EvaluatorConfiguration(
        id=EvaluatorIds.INTENT_RESOLUTION.value,
        init_params={"deployment_name": model_deployment_name},
        data_mapping={
            "query": "${data.query}",
            "response": "${data.response}",
        },
    ),
    "tool_call_accuracy": EvaluatorConfiguration(
        id=EvaluatorIds.TOOL_CALL_ACCURACY.value,
        init_params={"deployment_name": model_deployment_name},
        data_mapping={
            "query": "${data.query}",
            "response": "${data.response}",
            "tool_calls": "${data.tool_calls}",  # Map tool_calls from data
            "tool_definitions": "${data.tool_definitions}",  # Map tool definitions from data
        },
    ),
    "task_adherence": EvaluatorConfiguration(
        id=EvaluatorIds.TASK_ADHERENCE.value,
        init_params={"deployment_name": model_deployment_name},
        data_mapping={
            "query": "${data.query}",
            "response": "${data.response}",
        },
    ),
}

print(f"✅ Configured {len(evaluators)} evaluators:")
print("   Quality Evaluators:")
print("   - relevance")
print("   - coherence")
print("   - fluency")
print("   Agent Evaluators:")
print("   - intent_resolution")
print("   - tool_call_accuracy")
print("   - task_adherence")

📋 Configuring evaluators...
✅ Configured 6 evaluators:
   Quality Evaluators:
   - relevance
   - coherence
   - fluency
   Agent Evaluators:
   - intent_resolution
   - tool_call_accuracy
   - task_adherence


## Step 8: Create and Submit Evaluation

Now we'll create the evaluation job and submit it to Azure AI Foundry. The evaluation will run in the cloud using the configured evaluators.

In [22]:
print("🚀 Submitting cloud evaluation...")


try:
    evaluation = Evaluation(
        display_name="Cooking Agent Evaluation",
        description="Evaluation of cooking agent responses for quality (relevance, coherence, fluency) and agent-specific metrics (intent resolution, tool call accuracy, task adherence)",
        data=InputDataset(id=data_id),
        evaluators=evaluators,
    )
    
    # Prepare headers based on authentication method
    headers = {"model-endpoint": model_endpoint}
    
    #if model_api_key:
        # Use API key authentication
    #    headers["api-key"] = model_api_key
    #    print("   Using API key authentication")
    #else:
    # Use Azure credential authentication (requires proper RBAC)
    from azure.identity import get_bearer_token_provider
    credential = DefaultAzureCredential()
    token_provider = get_bearer_token_provider(
        credential,
        "https://cognitiveservices.azure.com/.default"
    )
    # Get token and add to headers
    token = token_provider()
    headers["Authorization"] = f"Bearer {token}"
    print("   Using Azure credential authentication")

    # Submit the evaluation
    evaluation_response = project_client.evaluations.create(
        evaluation,
        headers=headers,
    )
    
    print("✅ Evaluation submitted successfully!")
    print("=" * 70)
    print(f"📊 Evaluation Details:")
    if hasattr(evaluation_response, 'name'):
        print(f"   Name: {evaluation_response.name}")
    if hasattr(evaluation_response, 'status'):
        print(f"   Status: {evaluation_response.status}")
    if hasattr(evaluation_response, 'id'):
        print(f"   ID: {evaluation_response.id}")
    print("=" * 70)
    
except Exception as e:
    print(f"\n❌ Failed to submit evaluation: {e}")
    print("\nTroubleshooting:")
    print("1. Ensure PROJECT_ENDPOINT is correct")
    print("2. Ensure MODEL_ENDPOINT is correct")
    print("3. Verify model deployment exists in your Azure OpenAI resource")
    print("4. Check that storage account is connected to your project")
    print("5. Ensure you have appropriate RBAC permissions:")
    print("   - Cognitive Services OpenAI User (for Azure credential auth)")
    print("   - Or use AZURE_OPENAI_API_KEY environment variable")
    raise

🚀 Submitting cloud evaluation...
   Using Azure credential authentication
✅ Evaluation submitted successfully!
📊 Evaluation Details:
   Name: 8c01712b-5e44-4f87-82a4-d524c86d2ce9
   Status: NotStarted


## Step 9: View Results in Azure AI Foundry Portal

Your evaluation is now running in the cloud! 🎉

### Next Steps:

1. **Open Azure AI Foundry Portal**
   - Navigate to: https://ai.azure.com
   
2. **Find Your Project**
   - Go to your project dashboard
   - Click on the **Evaluation** tab
   
3. **View Results**
   - Look for evaluation: **"Cooking Agent Evaluation"**
   - Dataset version: **{dataset_version}**
   - View metrics, charts, and detailed results
   
4. **Analyze Metrics**
   - Quality scores (1-5): Relevance, Coherence, Fluency
   - Agent scores (1-5): Intent Resolution, Tool Call Accuracy, Task Adherence
   - Per-query breakdowns and aggregated statistics

### Understanding Results

**Score Scale**: All evaluators use a 1-5 scale where:
- **5** = Excellent
- **4** = Good
- **3** = Acceptable
- **2** = Needs improvement
- **1** = Poor

**What to Look For**:
- High scores (4-5) across all metrics indicate strong performance
- Low tool call accuracy may indicate incorrect tool selection
- Low intent resolution suggests the agent misunderstands queries
- Low task adherence means the agent goes off-topic

---

## Summary

Congratulations! 🎊 You've completed the cloud evaluation lab and learned how to:

✅ Define tool definitions for agent evaluation  
✅ Prepare evaluation data in JSONL format  
✅ Configure Azure AI Foundry project  
✅ Upload datasets to the cloud  
✅ Set up quality and agent-specific evaluators  
✅ Submit cloud evaluations  
✅ View results in the portal  

### Key Takeaways

1. **Cloud evaluation** scales better than local evaluation
2. **Agent evaluators** provide insights specific to tool-calling agents
3. **Versioned datasets** enable tracking improvements over time
4. **Azure AI Foundry** provides rich visualizations and historical tracking

### Next Steps

- **Iterate on agent prompts** based on evaluation results
- **Add more test queries** to cover edge cases
- **Compare evaluations** across different versions
- **Set up automated evaluation** in your CI/CD pipeline

Happy evaluating! 🚀

---

## 🔬 Optional: Explore the Data

Want to understand what data looks like at each step? Run the cells below to inspect the evaluation pipeline.

### View Test Queries

These are the queries used to test the cooking agent:

In [10]:
# Load and display test queries
with open('test_queries.json', 'r', encoding='utf-8') as f:
    queries_data = json.load(f)
    queries = queries_data.get("queries", [])

print(f"📝 Test Queries ({len(queries)} total):\n")
for i, q in enumerate(queries, 1):
    print(f"{i}. {q['query']}")

📝 Test Queries (10 total):

1. Find me some pasta recipes
2. What ingredients do I need for carbonara?
3. I want to make something with chicken
4. Suggest some quick recipes for dinner
5. Do you have any soup recipes?
6. What can I make for dessert?
7. Show me vegetarian options
8. I need ingredients for chicken stir fry
9. What recipes use tomatoes?
10. Give me some meat-based recipes


### View Agent Responses

Let's look at a sample response from the agent:

In [11]:
# Load and display a sample response
with open('test_responses.json', 'r', encoding='utf-8') as f:
    responses_data = json.load(f)
    responses = responses_data.get("responses", [])

# Show first response
if responses:
    sample = responses[0]
    print(f"📤 Sample Response:\n")
    print(f"Query: {sample['query']}")
    print(f"\nResponse: {sample['response']}")
    print(f"\nTool Calls: {len(sample.get('tool_calls', []))} calls")
    
    if sample.get('tool_calls'):
        print("\nTool Call Details:")
        for i, tc in enumerate(sample['tool_calls'], 1):
            print(f"  {i}. {tc.get('name', 'unknown')}({tc.get('arguments', {})})")
else:
    print("⚠️ No responses found. Run 'python run_agent.py' first to collect responses.")

📤 Sample Response:

Query: Find me some pasta recipes

Response: I found a delicious pasta recipe for you: Pasta Carbonara! It takes just 10 minutes to prep and 15 minutes to cook, making it a quick and tasty option. Would you like the full ingredient list or step-by-step cooking instructions for Pasta Carbonara? If you’re interested in more pasta recipes or have specific preferences (like vegetarian or gluten-free), let me know!

Tool Calls: 1 calls

Tool Call Details:
  1. search_recipes({'query': 'pasta'})


### View Evaluation Data Format

Here's what the formatted evaluation data looks like (JSONL format):

In [12]:
# Read and display first evaluation record
import json

print("📄 Evaluation Data Format (JSONL):\n")
with open('evaluation_data.jsonl', 'r', encoding='utf-8') as f:
    first_line = f.readline()
    eval_record = json.loads(first_line)
    
    # Pretty print the structure
    print(json.dumps({
        "query": eval_record["query"],
        "response": eval_record["response"][:100] + "..." if len(eval_record["response"]) > 100 else eval_record["response"],
        "tool_calls": eval_record["tool_calls"],
        "tool_definitions_count": len(eval_record["tool_definitions"])
    }, indent=2))

📄 Evaluation Data Format (JSONL):

{
  "query": "Find me some pasta recipes",
  "response": "I found a delicious pasta recipe for you: Pasta Carbonara! It takes just 10 minutes to prep and 15 m...",
  "tool_calls": [
    {
      "type": "tool_call",
      "tool_call_id": "call_uy2o6m70jy00xddGS0wkA7HG",
      "name": "search_recipes",
      "arguments": {
        "query": "pasta"
      }
    }
  ],
  "tool_definitions_count": 3
}
