# üìä Agent Evaluation with Function Tools - Account Balance Lookup

This notebook demonstrates how to **evaluate AI agents that use function tools** using Microsoft Foundry. We'll create a **Banking Assistant Agent** with an account balance lookup tool and evaluate its responses.

## üéØ Learning Objectives

1. **Create function tools** for agent capabilities
2. **Build an agent** with tool integration
3. **Handle function calls** and provide results
4. **Evaluate agent responses** including tool usage

## üíº Industry Use Case: Banking Assistant with Account Lookup

In banking, agents often need to:
- **Look up account balances** via secure APIs
- **Provide transaction history** summaries
- **Answer questions** about account status

Evaluating these tool-enabled agents ensures:
- Tools are called correctly with proper parameters
- Responses accurately reflect tool outputs
- Security and compliance requirements are met

### ‚ö†Ô∏è Disclaimer
> **This is a demonstration with simulated data.** In production, account lookups would connect to secure banking APIs with proper authentication.

## üîê Authentication Setup

Before running this notebook, authenticate with Azure CLI:

```bash
az login --use-device-code
```

## 1. Environment Setup

In [None]:
import json
import os
import time
from pathlib import Path
from typing import Union
from pprint import pprint
from dotenv import load_dotenv

# Load environment variables
notebook_path = Path().absolute()
env_path = notebook_path.parent / '.env'
load_dotenv(env_path)

# Verify required environment variables
project_endpoint = os.environ.get("AI_FOUNDRY_PROJECT_ENDPOINT")
tenant_id = os.environ.get("TENANT_ID")
model_deployment = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "gpt-4o")

if not project_endpoint:
    raise ValueError("üö® AI_FOUNDRY_PROJECT_ENDPOINT not set in .env")

print(f"üîë Tenant ID: {tenant_id}")
print(f"üìç Project Endpoint: {project_endpoint[:50]}...")
print(f"ü§ñ Model Deployment: {model_deployment}")

## 2. Initialize AI Project Client

In [None]:
from azure.identity import AzureCliCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import PromptAgentDefinition, Tool, FunctionTool
from openai.types.responses.response_input_param import FunctionCallOutput, ResponseInputParam
from openai.types.evals.run_create_response import RunCreateResponse
from openai.types.evals.run_retrieve_response import RunRetrieveResponse

# Initialize credentials and clients
credential = AzureCliCredential(tenant_id=tenant_id)
project_client = AIProjectClient(endpoint=project_endpoint, credential=credential)
openai_client = project_client.get_openai_client()

print("‚úÖ AIProjectClient initialized")
print("‚úÖ OpenAI client retrieved for evaluations")

## 3. Define Function Tools

We'll create two banking function tools:
1. **get_account_balance** - Look up account balance by account number
2. **get_recent_transactions** - Get recent transaction summary

In [None]:
# Define the account balance lookup tool
get_balance_tool = FunctionTool(
    name="get_account_balance",
    parameters={
        "type": "object",
        "properties": {
            "account_number": {
                "type": "string",
                "description": "The account number to look up (e.g., 'CHK-12345' or 'SAV-67890')",
            },
        },
        "required": ["account_number"],
        "additionalProperties": False,
    },
    description="Get the current balance for a bank account. Returns balance and account type.",
    strict=True,
)

# Define the recent transactions tool
# Note: With strict=True, ALL properties must be in the required array
get_transactions_tool = FunctionTool(
    name="get_recent_transactions",
    parameters={
        "type": "object",
        "properties": {
            "account_number": {
                "type": "string",
                "description": "The account number to look up transactions for",
            },
            "num_transactions": {
                "type": "integer",
                "description": "Number of recent transactions to retrieve (max 10)",
            },
        },
        "required": ["account_number", "num_transactions"],  # All properties required when strict=True
        "additionalProperties": False,
    },
    description="Get recent transactions for a bank account. Returns transaction list with dates and amounts.",
    strict=True,
)

# Combine tools
tools: list[Tool] = [get_balance_tool, get_transactions_tool]

print("‚úÖ Function tools defined:")
for tool in tools:
    print(f"   ‚Ä¢ {tool.name}: {tool.description[:50]}...")

## 4. Implement Tool Functions

These simulate backend banking API calls. In production, these would connect to secure banking systems.

In [None]:
# Simulated account data (in production, this would be a secure API call)
MOCK_ACCOUNTS = {
    "CHK-12345": {"type": "Checking", "balance": 5432.10, "currency": "USD"},
    "SAV-67890": {"type": "Savings", "balance": 15750.00, "currency": "USD"},
    "CHK-11111": {"type": "Checking", "balance": 892.45, "currency": "USD"},
}

MOCK_TRANSACTIONS = {
    "CHK-12345": [
        {"date": "2026-01-15", "description": "Direct Deposit - Payroll", "amount": 3500.00},
        {"date": "2026-01-14", "description": "Electric Bill Payment", "amount": -145.50},
        {"date": "2026-01-12", "description": "Grocery Store", "amount": -87.23},
        {"date": "2026-01-10", "description": "ATM Withdrawal", "amount": -200.00},
    ],
    "SAV-67890": [
        {"date": "2026-01-01", "description": "Interest Credit", "amount": 12.50},
        {"date": "2025-12-15", "description": "Transfer from Checking", "amount": 500.00},
    ],
}


def get_account_balance(account_number: str) -> dict:
    """Simulate looking up account balance from banking system."""
    if account_number in MOCK_ACCOUNTS:
        account = MOCK_ACCOUNTS[account_number]
        return {
            "account_number": account_number,
            "account_type": account["type"],
            "balance": account["balance"],
            "currency": account["currency"],
            "status": "active"
        }
    else:
        return {
            "error": "Account not found",
            "account_number": account_number
        }


def get_recent_transactions(account_number: str, num_transactions: int = 5) -> dict:
    """Simulate looking up recent transactions from banking system."""
    num_transactions = min(num_transactions, 10)  # Cap at 10
    
    if account_number in MOCK_TRANSACTIONS:
        transactions = MOCK_TRANSACTIONS[account_number][:num_transactions]
        return {
            "account_number": account_number,
            "transactions": transactions,
            "count": len(transactions)
        }
    else:
        return {
            "error": "Account not found or no transactions available",
            "account_number": account_number
        }


print("‚úÖ Tool functions implemented")
print(f"   Mock accounts available: {list(MOCK_ACCOUNTS.keys())}")

## 5. Create Banking Assistant Agent with Tools

In [None]:
# Create the Banking Assistant Agent with function tools
agent = project_client.agents.create_version(
    agent_name="banking-assistant-with-tools",
    definition=PromptAgentDefinition(
        model=model_deployment,
        instructions="""
        You are a helpful Banking Assistant that can look up account information.
        
        You have access to the following tools:
        - get_account_balance: Look up the current balance for an account
        - get_recent_transactions: Get recent transactions for an account
        
        Guidelines:
        1. Always use the appropriate tool when a customer asks about their account
        2. Present balance information clearly with proper currency formatting
        3. Summarize transactions in a helpful way
        4. If an account is not found, politely inform the customer
        5. Never reveal sensitive implementation details about the banking system
        6. Always maintain a professional and helpful tone
        
        Security Notice: Only provide information for accounts the customer specifies.
        """,
        tools=tools,
    ),
)

print(f"üéâ Agent created (name: {agent.name}, version: {agent.version})")
print(f"   Tools attached: {len(tools)}")

## 6. Test the Agent with Tool Calls

Let's test the agent and handle the function calls. This interaction will be used for evaluation.

In [None]:
# Test query that should trigger tool usage
test_query = "What is the balance in my checking account CHK-12345? Also show me my recent transactions."

print(f"üë§ Customer: {test_query}")
print("\nüîÑ Calling agent...")

# Initial response from agent (may include function calls)
response = openai_client.responses.create(
    input=test_query,
    extra_body={"agent": {"name": agent.name, "type": "agent_reference"}},
)

print(f"\nüì• Initial Response:")
print(f"   Response ID: {response.id}")
print(f"   Output text: {response.output_text}")
print(f"   Output items: {len(response.output)} items")

In [None]:
# Process function calls from the agent
input_list: ResponseInputParam = []

print("\nüîß Processing function calls...")
print("-" * 40)

for item in response.output:
    if item.type == "function_call":
        print(f"\nüìû Function call: {item.name}")
        print(f"   Arguments: {item.arguments}")
        
        # Parse arguments and execute the appropriate function
        args = json.loads(item.arguments)
        
        if item.name == "get_account_balance":
            result = get_account_balance(**args)
        elif item.name == "get_recent_transactions":
            result = get_recent_transactions(**args)
        else:
            result = {"error": f"Unknown function: {item.name}"}
        
        print(f"   Result: {result}")
        
        # Add function call output to input list
        input_list.append(
            FunctionCallOutput(
                type="function_call_output",
                call_id=item.call_id,
                output=json.dumps(result),
            )
        )

print(f"\n‚úÖ Processed {len(input_list)} function calls")

In [None]:
# If there were function calls, send results back to get final response
if input_list:
    print("\nüîÑ Sending function results back to agent...")
    
    final_response = openai_client.responses.create(
        input=input_list,
        extra_body={"agent": {"name": agent.name, "type": "agent_reference"}},
        previous_response_id=response.id,
    )
    
    print(f"\nü§ñ Agent Final Response:")
    print(f"   Response ID: {final_response.id}")
    print(f"\n   {final_response.output_text}")
    
    # Use final response for evaluation
    response_for_eval = final_response
else:
    print("\nü§ñ Agent Response (no function calls):")
    print(f"   {response.output_text}")
    response_for_eval = response

print(f"\nüìù Response ID for evaluation: {response_for_eval.id}")

## 7. Configure Evaluation for Response with Tools

We'll evaluate the agent's response using the `azure_ai_responses` data source, which allows us to evaluate a specific response by ID.

In [None]:
from openai.types.eval_create_params import DataSourceConfigCustom

# Define data source config for response evaluation
data_source_config = DataSourceConfigCustom(
    type="custom",
    item_schema={
        "type": "object",
        "properties": {
            "resp_id": {"type": "string"}
        },
        "required": ["resp_id"]
    },
    include_sample_schema=True,
)

# Testing criteria for evaluating tool-enabled responses
testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "violence_detection",
        "evaluator_name": "builtin.violence",
        "data_mapping": {
            "query": "{{item.resp_id}}",  # Using resp_id as placeholder
            "response": "{{sample.output_text}}"
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "fluency",
        "evaluator_name": "builtin.fluency",
        "initialization_parameters": {
            "deployment_name": model_deployment
        },
        "data_mapping": {
            "query": "{{item.resp_id}}",
            "response": "{{sample.output_text}}"
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "task_adherence",
        "evaluator_name": "builtin.task_adherence",
        "initialization_parameters": {
            "deployment_name": model_deployment
        },
        "data_mapping": {
            "query": "{{item.resp_id}}",
            "response": "{{sample.output_items}}"  # Includes tool call info
        },
    },
]

print("‚úÖ Evaluation criteria configured for tool-enabled responses")

In [None]:
# Create evaluation object
eval_object = openai_client.evals.create(
    name="Agent Response Evaluation with Tools",
    data_source_config=data_source_config,
    testing_criteria=testing_criteria,  # type: ignore
)

print(f"‚úÖ Evaluation created (id: {eval_object.id}, name: {eval_object.name})")

## 8. Run Evaluation on the Response

In [None]:
# Configure data source to evaluate the specific response
data_source = {
    "type": "azure_ai_responses",
    "item_generation_params": {
        "type": "response_retrieval",
        "data_mapping": {
            "response_id": "{{item.resp_id}}"
        },
        "source": {
            "type": "file_content",
            "content": [
                {"item": {"resp_id": response_for_eval.id}}
            ]
        },
    },
}

# Create and run the evaluation
response_eval_run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
    eval_id=eval_object.id,
    name=f"Evaluation Run for Agent {agent.name} with Tools",
    data_source=data_source  # type: ignore
)

print(f"üöÄ Evaluation run created (id: {response_eval_run.id})")
print(f"‚è≥ Status: {response_eval_run.status}")

In [None]:
# Poll for evaluation completion
print("‚è≥ Waiting for evaluation to complete...")
print("-" * 40)

while response_eval_run.status not in ["completed", "failed"]:
    response_eval_run = openai_client.evals.runs.retrieve(
        run_id=response_eval_run.id,
        eval_id=eval_object.id
    )
    print(f"   Status: {response_eval_run.status}")
    time.sleep(5)

if response_eval_run.status == "completed":
    print("\n‚úÖ Evaluation run completed successfully!")
else:
    print("\n‚ùå Evaluation run failed.")

## 9. Analyze Evaluation Results

In [None]:
if response_eval_run.status == "completed":
    print("\n" + "=" * 60)
    print("üìä EVALUATION RESULTS - Agent with Function Tools")
    print("=" * 60)
    
    # Display result counts
    print(f"\nüìà Result Counts: {response_eval_run.result_counts}")
    
    # Get output items
    output_items = list(
        openai_client.evals.runs.output_items.list(
            run_id=response_eval_run.id,
            eval_id=eval_object.id
        )
    )
    
    print(f"\nüìù OUTPUT ITEMS (Total: {len(output_items)})")
    
    # Display report URL
    if response_eval_run.report_url:
        print(f"\nüîó Eval Run Report URL: {response_eval_run.report_url}")
    
    # Pretty print detailed results
    print("\nüìã Detailed Results:")
    print("-" * 60)
    pprint(output_items)
    print("-" * 60)
else:
    print("\n‚ùå Cannot display results - evaluation did not complete successfully.")
    if response_eval_run.report_url:
        print(f"üîó Check report URL for details: {response_eval_run.report_url}")

## 10. Summary - Tool Evaluation Insights

In [None]:
print("\n" + "=" * 60)
print("üìä EVALUATION SUMMARY - Agent with Function Tools")
print("=" * 60)

print("\nüîß Tools Evaluated:")
print("   ‚Ä¢ get_account_balance - Account balance lookup")
print("   ‚Ä¢ get_recent_transactions - Transaction history")

print("\nüéØ Evaluation Metrics:")
print("   ‚Ä¢ Violence Detection - Safety check on responses")
print("   ‚Ä¢ Fluency - Quality of natural language output")
print("   ‚Ä¢ Task Adherence - Correct tool usage and response")

print("\nüíº FSI Compliance Insights:")
print("   ‚Ä¢ Tool calls were logged and can be audited")
print("   ‚Ä¢ Response includes proper account information")
print("   ‚Ä¢ No sensitive data exposed beyond what was requested")

print("\nüìù Key Differences from Basic Evaluation:")
print("   ‚Ä¢ Uses 'azure_ai_responses' data source type")
print("   ‚Ä¢ Evaluates specific response by ID")
print("   ‚Ä¢ Captures tool call information in output_items")

if response_eval_run.report_url:
    print(f"\nüîó View detailed report: {response_eval_run.report_url}")

## 11. Cleanup

In [None]:
# # Clean up resources
# try:
#     openai_client.evals.delete(eval_id=eval_object.id)
#     print("üóëÔ∏è Evaluation deleted")
# except Exception as e:
#     print(f"‚ö†Ô∏è Could not delete evaluation: {e}")

# try:
#     project_client.agents.delete(agent_name=agent.name)
#     print("üóëÔ∏è Agent deleted")
# except Exception as e:
#     print(f"‚ö†Ô∏è Could not delete agent: {e}")

# print("\n‚úÖ Cleanup completed!")

## üéØ Summary

In this notebook, you learned how to:

‚úÖ **Define function tools** for banking operations (balance lookup, transactions)  
‚úÖ **Create an agent with tools** integrated  
‚úÖ **Handle function calls** and provide results back to the agent  
‚úÖ **Evaluate tool-enabled responses** using `azure_ai_responses` data source  
‚úÖ **Analyze results** including tool usage information  

### üîß Key APIs Used

| API | Purpose |
|-----|--------|
| `FunctionTool()` | Define a callable tool for the agent |
| `openai_client.responses.create()` | Get agent response with tool calls |
| `FunctionCallOutput()` | Provide function results back to agent |
| `azure_ai_responses` data source | Evaluate specific response by ID |

### üìä Evaluation Data Sources

| Data Source Type | Use Case |
|------------------|----------|
| `azure_ai_target_completions` | Evaluate agent with test queries |
| `azure_ai_responses` | Evaluate specific response by ID |

### üìö Next Steps

1. **Add more tools** for comprehensive banking functionality
2. **Test edge cases** like invalid accounts or errors
3. **Add custom evaluators** for domain-specific criteria
4. **Integrate into CI/CD** for continuous agent validation
