# üìä Day 5: Evaluation Results

This notebook provides an interactive view of the agent's performance against synthetic test cases. The evaluations are generated by a **Llama 3.3 70B** model acting as a judge.

In [1]:
import json
import pandas as pd
from IPython.display import display, Markdown

# 1. Load the evaluation results generated by evaluate_batch.py
try:
    with open('evaluation_results.json', 'r', encoding='utf-8') as f:
        results = json.load(f)
    print(f"Successfully loaded {len(results)} evaluation records.")
except FileNotFoundError:
    print("Error: evaluation_results.json not found. Please run 'python evaluate_batch.py' first.")
    results = []

Successfully loaded 2 evaluation records.


## üìà Results Summary Board

In [2]:
# 2. Flatten data for a clean table view
display_data = []
for res in results:
    row = {
        "ID": res.get('test_case_id'),
        "Question": res.get('question'),
        "Verdict Summary": res.get('evaluation', {}).get('summary', '')
    }
    
    # Add individual checks with icons
    checklist = res.get('evaluation', {}).get('checklist', [])
    for check in checklist:
        name = check.get('check_name')
        val = check.get('check_pass')
        if val is True: icon = "‚úÖ"
        elif val is False: icon = "‚ùå"
        else: icon = "‚ûñ" # Null/Unknown
        row[name] = icon
        
    display_data.append(row)

if display_data:
    df = pd.DataFrame(display_data)
    # Reorder columns to put interesting metrics first
    cols = ["ID", "Question"] + [c for c in df.columns if c not in ["ID", "Question", "Verdict Summary"]] + ["Verdict Summary"]
    display(df[cols])
else:
    print("No data to display.")

Unnamed: 0,ID,Question,instructions_follow,answer_relevant,answer_clear,answer_citations,completeness,tool_call_search,Verdict Summary
0,test_1,What is the mechanism used by the API to preve...,‚úÖ,‚úÖ,‚úÖ,‚ûñ,‚úÖ,‚úÖ,"The agent's response is of high quality, follo..."
1,test_2,What is the specific comparison method used in...,‚úÖ,‚úÖ,‚úÖ,‚ûñ,‚úÖ,‚úÖ,"The agent's response is accurate, relevant, an..."


## üîç Deep Dive: Justifications
Review why the judge made certain decisions below.

In [3]:
for res in results:
    display(Markdown(f"### Test {res.get('test_case_id')}: {res.get('question')}"))
    display(Markdown(f"**Agent Response:**\n> {res.get('agent_response')}"))
    
    checks_md = "| Check | Result | Justification |\n| :--- | :--- | :--- |\n"
    for check in res.get('evaluation', {}).get('checklist', []):
        status = "‚úÖ PASS" if check.get('check_pass') is True else ("‚ùå FAIL" if check.get('check_pass') is False else "‚ûñ SKIP")
        checks_md += f"| {check.get('check_name')} | {status} | {check.get('justification')} |\n"
    
    display(Markdown(checks_md))
    display(Markdown("---"))

### Test test_1: What is the mechanism used by the API to prevent conflicting task updates?

**Agent Response:**
> The API uses optimistic concurrency control to prevent conflicting task updates. This is implemented through the ability to update tasks, where the API checks for any potential conflicts based on timestamps.

| Check | Result | Justification |
| :--- | :--- | :--- |
| instructions_follow | ‚úÖ PASS | The agent followed its system instructions by using the 'text_search' tool to look up information before answering. |
| answer_relevant | ‚úÖ PASS | The response directly answers the user's specific question about the mechanism used by the API to prevent conflicting task updates. |
| answer_clear | ‚úÖ PASS | The answer is easy to understand and technically accurate, explaining the use of optimistic concurrency control and timestamp-based conflict checks. |
| answer_citations | ‚ûñ SKIP | There are no source files to cite in this response. |
| completeness | ‚úÖ PASS | The answer covers all parts of the user's inquiry, providing a clear explanation of the API's conflict prevention mechanism. |
| tool_call_search | ‚úÖ PASS | The agent used the 'text_search' tool to find facts before answering the user's question. |


---

### Test test_2: What is the specific comparison method used in the optimistic locking mechanism of the API?

**Agent Response:**
> The comparison method used in the optimistic locking mechanism of the API is timestamp comparison.

| Check | Result | Justification |
| :--- | :--- | :--- |
| instructions_follow | ‚úÖ PASS | The agent used the 'text_search' tool to look up information before answering, following rule 1. |
| answer_relevant | ‚úÖ PASS | The response directly answers the user's specific question about the comparison method used in the optimistic locking mechanism of the API. |
| answer_clear | ‚úÖ PASS | The answer is easy to understand and technically accurate, stating the comparison method as timestamp comparison. |
| answer_citations | ‚ûñ SKIP | There are no source files cited in the response, but the answer is based on the search results provided by the 'text_search' tool. |
| completeness | ‚úÖ PASS | The answer covers the user's inquiry about the comparison method used in the optimistic locking mechanism of the API. |
| tool_call_search | ‚úÖ PASS | The agent used the 'text_search' tool to find facts before answering the user's question. |


---