# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, you'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

## Learning Objectives
- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [None]:
from datetime import datetime
from agent import Agent

In [None]:
ECOHOME_SYSTEM_PROMPT = """You are EcoHome Energy Advisor, an expert assistant for smart-home energy optimization.

Your goals:
1. Reduce user electricity cost while keeping comfort and practicality.
2. Reduce carbon impact by prioritizing solar generation and lower-demand periods.
3. Provide clear, actionable recommendations with concrete times and expected tradeoffs.

Rules:
- Use tools whenever recommendations depend on weather, prices, historical usage, or savings math.
- Prefer specific schedules (hours/time windows), not vague guidance.
- If data is missing, state assumptions explicitly.
- Include brief reasoning: why this schedule is better.
- Include estimated savings when possible.
- Keep answers concise and practical.

When relevant, optimize around:
- EV charging
- Thermostat/HVAC operation
- Appliance scheduling (dishwasher, laundry, water heater)
- Solar self-consumption and grid export/import balance

If the user asks for multiple options, provide a ranked list with pros/cons."""

In [None]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT,
)

In [None]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA"
)

In [None]:
print(response["messages"][-1].content)

In [None]:
print("TOOLS:")
for msg in response["messages"]:
    obj = msg.model_dump()
    if obj.get("tool_call_id"):
        print("-", msg.name)

## 2. Define Test Cases

In [None]:
# Test cases covering EV, thermostat, appliances, solar, and savings scenarios.

In [None]:
test_cases = [
    {
        "id": "ev_charging_1",
        "question": "When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "The response should contain time recommendation, cost analysis and solar consideration",
    },
    {
        "id": "ev_charging_2",
        "question": "I need 30 kWh for my EV by 7 AM. What is the cheapest charging schedule tonight?",
        "expected_tools": ["get_electricity_prices", "calculate_energy_savings"],
        "expected_response": "The response should include overnight off-peak timing and cost estimate",
    },
    {
        "id": "thermostat_1",
        "question": "What thermostat setting should I use Wednesday afternoon if prices spike?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "The response should include temperature setpoint strategy and peak pricing avoidance",
    },
    {
        "id": "thermostat_2",
        "question": "How can I pre-cool my home to reduce HVAC costs during evening peak hours?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "The response should include pre-cooling window, peak-hour behavior, and comfort tradeoff",
    },
    {
        "id": "appliance_1",
        "question": "When should I run my dishwasher and laundry tomorrow for the lowest bill?",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "The response should recommend specific off-peak times for both appliances",
    },
    {
        "id": "appliance_2",
        "question": "I can run my water heater for only 3 hours today. Which hours are best?",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "The response should include a 3-hour schedule based on low prices",
    },
    {
        "id": "solar_1",
        "question": "What is the best time tomorrow to run heavy loads to use maximum solar generation?",
        "expected_tools": ["get_weather_forecast"],
        "expected_response": "The response should identify midday solar window and suitable loads",
    },
    {
        "id": "solar_2",
        "question": "Should I delay EV charging to noon tomorrow if it is sunny?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "The response should compare solar availability vs electricity rates",
    },
    {
        "id": "history_1",
        "question": "Based on my last 7 days of usage, what are 3 ways to reduce consumption?",
        "expected_tools": ["query_energy_usage", "search_energy_tips"],
        "expected_response": "The response should reference usage patterns and provide 3 actionable recommendations",
    },
    {
        "id": "savings_1",
        "question": "How much can I save if I reduce HVAC use from 18 kWh/day to 14 kWh/day?",
        "expected_tools": ["calculate_energy_savings"],
        "expected_response": "The response should include daily and annual savings estimates",
    },
]

if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")

## 3. Run Agent Tests

In [None]:
CONTEXT = "Location: San Francisco, CA"

In [None]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat()
        }
        test_results.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results.append(result)

print(f"\nCompleted {len(test_results)} tests")


In [None]:
test_results

## 4. Evaluate Responses

In [None]:
def _extract_tool_names(messages):
    tool_names = []
    for msg in messages:
        obj = msg.model_dump() if hasattr(msg, "model_dump") else {}
        name = getattr(msg, "name", None) or obj.get("name")
        if obj.get("tool_call_id") and name:
            tool_names.append(name)
    return tool_names

In [None]:
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response against expected response using simple heuristics."""
    text = (final_response or "").lower()
    exp = (expected_response or "").lower()

    score = 0
    notes = []

    if len(text) > 120:
        score += 1
        notes.append("Response is sufficiently detailed")
    else:
        notes.append("Response may be too brief")

    action_keywords = ["recommend", "should", "schedule", "between", "hour", "am", "pm"]
    if any(k in text for k in action_keywords):
        score += 1
        notes.append("Contains actionable guidance")
    else:
        notes.append("Missing actionable schedule guidance")

    value_keywords = ["$", "save", "cost", "kwh", "annual", "usd"]
    if any(k in text for k in value_keywords):
        score += 1
        notes.append("Includes cost/savings context")
    else:
        notes.append("Missing cost/savings context")

    if any(token in text for token in exp.replace(",", " ").split()[:8]):
        score += 1
        notes.append("Aligned with expected intent")
    else:
        notes.append("Weak alignment with expected intent")

    return {
        "score": score,
        "max_score": 4,
        "passed": score >= 3,
        "notes": notes,
    }

In [None]:
def evaluate_tool_usage(messages, expected_tools):
    """Evaluate if the right tools were used."""
    used_tools = _extract_tool_names(messages)
    used_set = set(used_tools)
    expected_set = set(expected_tools or [])

    matched = sorted(list(expected_set.intersection(used_set)))
    missing = sorted(list(expected_set - used_set))
    extra = sorted(list(used_set - expected_set))

    precision = len(matched) / max(1, len(used_set))
    recall = len(matched) / max(1, len(expected_set))
    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) else 0

    return {
        "used_tools": used_tools,
        "expected_tools": list(expected_set),
        "matched": matched,
        "missing": missing,
        "extra": extra,
        "precision": round(precision, 2),
        "recall": round(recall, 2),
        "f1": round(f1, 2),
        "passed": recall >= 0.7,
    }

In [None]:
def generate_evaluation_report():
    """Generate evaluation metrics across all test_results."""
    report_rows = []
    for result in test_results:
        if isinstance(result.get("response"), dict):
            messages = result["response"].get("messages", [])
            final_response = messages[-1].content if messages else ""
        else:
            messages = []
            final_response = str(result.get("response", ""))

        response_eval = evaluate_response(
            question=result["question"],
            final_response=final_response,
            expected_response=result["expected_response"],
        )
        tool_eval = evaluate_tool_usage(messages, result["expected_tools"])

        report_rows.append({
            "test_id": result["test_id"],
            "response_score": response_eval["score"],
            "response_max": response_eval["max_score"],
            "response_passed": response_eval["passed"],
            "tool_f1": tool_eval["f1"],
            "tool_passed": tool_eval["passed"],
            "missing_tools": tool_eval["missing"],
        })

    total = len(report_rows)
    avg_response = round(sum(r["response_score"] / r["response_max"] for r in report_rows) / max(1, total), 2)
    avg_tool_f1 = round(sum(r["tool_f1"] for r in report_rows) / max(1, total), 2)
    pass_rate = round(sum(1 for r in report_rows if r["response_passed"] and r["tool_passed"]) / max(1, total), 2)

    summary = {
        "total_tests": total,
        "avg_response_score_ratio": avg_response,
        "avg_tool_f1": avg_tool_f1,
        "overall_pass_rate": pass_rate,
        "failed_tests": [r["test_id"] for r in report_rows if not (r["response_passed"] and r["tool_passed"])],
        "details": report_rows,
    }

    return summary