# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, you'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

## Learning Objectives
- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [1]:
from datetime import datetime
from agent import Agent
import json

In [2]:
## TODO: Create the agent's instructions

ECOHOME_SYSTEM_PROMPT = """
You are EcoHome, an AI assistant specialized in helping users save energy and reduce their carbon footprint at home. Your expertise covers residential energy optimization, including electricity pricing, weather patterns, appliance usage, and sustainable practices.

## Key Capabilities
You have access to the following tools to provide informed recommendations:
- **get_weather_forecast**: Retrieve current and forecasted weather conditions for energy planning
- **get_electricity_prices**: Access real-time and historical electricity pricing data
- **get_energy_saving_tips**: Get evidence-based tips for reducing energy consumption
- **get_solar_power_tips**: Obtain strategies for maximizing solar power utilization
- **calculate_energy_savings**: Compute potential cost and environmental savings from energy-efficient changes

## Step-by-Step Process
When responding to user queries, follow this structured approach:

1. **Analyze the Query**: Understand the user's specific energy-related question or concern
2. **Gather Context**: Use available tools to collect relevant data (weather, pricing, tips, calculations)
3. **Evaluate Options**: Consider multiple factors including cost, environmental impact, and practicality
4. **Provide Recommendations**: Structure your response with clear, actionable advice
5. **Quantify Benefits**: Include specific savings estimates, timeframes, and measurable outcomes
6. **Explain Reasoning**: Clearly justify your recommendations based on the data collected

## Response Structure
Format your recommendations using this clear structure:

### Current Situation Analysis
- Brief assessment of the user's current energy usage scenario

### Recommended Actions
- Numbered list of specific, actionable steps
- Include timing, settings, or behavioral changes
- Prioritize by impact and ease of implementation

### Expected Benefits
- Quantified savings (cost, energy, carbon footprint)
- Timeline for seeing results
- Additional advantages (comfort, convenience)

### Implementation Tips
- Practical advice for executing recommendations
- Potential challenges and solutions
- Monitoring suggestions

## Example Interactions

**Example 1: EV Charging Optimization**
User: "When should I charge my electric car tomorrow?"
Your response should analyze weather forecasts, electricity prices, and provide specific charging windows with cost comparisons.

**Example 2: Thermostat Settings**
User: "What's the best thermostat setting for winter?"
Your response should reference energy-saving tips, calculate potential savings, and provide temperature recommendations with comfort considerations.

**Example 3: Appliance Scheduling**
User: "When should I run my dishwasher?"
Your response should check electricity pricing patterns and recommend optimal time slots with cost-benefit analysis.

**Example 4: Solar Power Usage**
User: "How can I maximize my solar panels this weekend?"
Your response should use weather forecasts and solar tips to suggest usage strategies and energy storage options.

Always prioritize practical, cost-effective solutions that balance energy savings with user comfort and convenience. Be specific with numbers, times, and measurable outcomes.
"""

In [3]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT,
)

In [4]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA"
)

In [5]:
print(response["messages"][-1].content)

### Current Situation Analysis
Tomorrow in San Francisco, the weather is expected to be sunny with good solar irradiance, especially in the afternoon. The electricity pricing follows a time-of-use model, with lower rates during off-peak hours (late night and early morning) and higher rates during peak hours (midday to early evening).

### Recommended Actions
1. **Charge During Off-Peak Hours**: 
   - **Best Time**: Start charging your electric car between **10 PM and 6 AM** when electricity rates are at their lowest (around $0.11 to $0.13 per kWh).
   - **Alternative Time**: If you prefer to charge during the day, consider charging from **10 PM to 12 AM** and then again from **10 PM to 12 AM** the next night, as rates are lower during these hours.

2. **Maximize Solar Power Utilization**:
   - If you have a solar power system, consider charging your car during the peak solar generation hours, which will be from **11 AM to 3 PM**. However, note that the rates during this time are higher

In [6]:
print("TOOLS:")
for msg in response["messages"]:
    obj = msg.model_dump()
    if obj.get("tool_call_id"):
        print("-", msg.name)

TOOLS:
- get_weather_forecast
- get_electricity_prices


## 2. Define Test Cases

In [7]:
# TODO: Define comprehensive test cases for the Energy Advisor
# Create 10 test cases covering different scenarios:
# - EV charging optimization
# - Thermostat settings
# - Appliance scheduling
# - Solar power maximization
# - Cost savings calculations

In [8]:
test_cases = [
    {
        "id": "ev_charging_1",
        "question": "When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "The response should contain time recommendation, cost analysis and solar consideration",
    },
    {
        "id": "thermostat_setting_1",
        "question": "What is the optimal thermostat setting for my home during a heatwave?",
        "expected_tools": ["get_weather_forecast", "get_energy_saving_tips"],
        "expected_response": "The response should include recommended temperature settings and energy-saving tips",
    },
    {
        "id": "appliance_scheduling_1",
        "question": "When should I run my dishwasher to save on energy costs?",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "The response should provide specific time slots for running the dishwasher",
    },
    {
        "id": "solar_power_maximization_1",
        "question": "How can I maximize my solar power usage this weekend?",
        "expected_tools": ["get_weather_forecast", "get_solar_power_tips"],
        "expected_response": "The response should include strategies for maximizing solar power usage",
    },
    {
        "id": "cost_savings_calculation_1",
        "question": "How much can I save by adjusting my thermostat by 2 degrees?",
        "expected_tools": ["calculate_energy_savings"],
        "expected_response": "The response should provide a detailed cost savings calculation",
    },
    {
        "id": "ev_charging_2",
        "question": "Is it cheaper to charge my EV at night or during the day?",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "The response should compare costs and provide a recommendation",
    },
    {
        "id": "thermostat_setting_2",
        "question": "What thermostat settings should I use in winter to save energy?",
        "expected_tools": ["get_energy_saving_tips"],
        "expected_response": "The response should include recommended temperature settings and energy-saving tips",
    },
    {
        "id": "appliance_scheduling_2",
        "question": "When is the best time to run my washing machine to save energy?",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "The response should provide specific time slots for running the washing machine",
    },
    {
        "id": "solar_power_maximization_2",
        "question": "What are the best practices for using solar power during cloudy days?",
        "expected_tools": ["get_solar_power_tips"],
        "expected_response": "The response should include strategies for maximizing solar power usage on cloudy days",
    },
    {
        "id": "cost_savings_calculation_2",
        "question": "How much can I save by using energy-efficient appliances?",
        "expected_tools": ["calculate_energy_savings"],
        "expected_response": "The response should provide a detailed cost savings calculation based on appliance efficiency",
    }
]

if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")

## 3. Run Agent Tests

In [9]:
CONTEXT = "Location: San Francisco, CA"

In [10]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat()
        }
        test_results.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results.append(result)

print(f"\nCompleted {len(test_results)} tests")


=== Running Agent Tests ===

Test 1: ev_charging_1
Question: When should I charge my electric car tomorrow to minimize cost and maximize solar power?
--------------------------------------------------

Test 2: thermostat_setting_1
Question: What is the optimal thermostat setting for my home during a heatwave?
--------------------------------------------------

Test 2: thermostat_setting_1
Question: What is the optimal thermostat setting for my home during a heatwave?
--------------------------------------------------

Test 3: appliance_scheduling_1
Question: When should I run my dishwasher to save on energy costs?
--------------------------------------------------

Test 3: appliance_scheduling_1
Question: When should I run my dishwasher to save on energy costs?
--------------------------------------------------

Test 4: solar_power_maximization_1
Question: How can I maximize my solar power usage this weekend?
--------------------------------------------------

Test 4: solar_power_maxim

In [11]:
test_results

[{'test_id': 'ev_charging_1',
  'question': 'When should I charge my electric car tomorrow to minimize cost and maximize solar power?',
  'response': {'messages': [SystemMessage(content='Location: San Francisco, CA', additional_kwargs={}, response_metadata={}, id='2f40f677-b82d-411e-8a5d-930827a5edd8'),
    HumanMessage(content='When should I charge my electric car tomorrow to minimize cost and maximize solar power?', additional_kwargs={}, response_metadata={}, id='138c5bd2-673b-4dd9-a0fa-4b5c246ff3ea'),
    AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 61, 'prompt_tokens': 1449, 'total_tokens': 1510, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_29330a9688', 'id': 'ch

## 4. Evaluate Responses

In [12]:
# TODO: Implement evaluation functions
# Create functions to evaluate:
# - Final Response
# - Tool usage

In [13]:
# LLM as a judge

EVAL_PROMPT = """
You are an expert evaluator for AI assistant responses in the context of energy-saving advice.

Given:
- Question: {question}
- Expected Response Description: {expected}
- Actual Response: {actual}

Evaluate the Actual Response on the following criteria, providing a score from 1-5 (1=poor, 5=excellent) and a brief explanation:

- ACCURACY: How accurate and factually correct is the information provided?
- RELEVANCE: How well does the response address the specific question asked?
- COMPLETENESS: How complete is the response in covering all aspects of the question?
- USEFULNESS: How practical and actionable is the advice for the user?

Output your evaluation in the following JSON format:
{{
  "accuracy": {{"score": <number>, "explanation": "<brief explanation>"}},
  "relevance": {{"score": <number>, "explanation": "<brief explanation>"}},
  "completeness": {{"score": <number>, "explanation": "<brief explanation>"}},
  "usefulness": {{"score": <number>, "explanation": "<brief explanation>"}}
}}

Do not include any other text outside the JSON.
"""

llm_judge = Agent(
    instructions=EVAL_PROMPT,
)

In [14]:
# TODO: Create a response evaluator
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response using LLM as judge"""
    # Format the evaluation prompt
    eval_input = EVAL_PROMPT.format(
        question=question,
        expected=expected_response,
        actual=final_response
    )
    
    # Call the LLM judge
    judge_response = llm_judge.invoke(
        question=eval_input,
        context=""
    )
    
    # Extract the JSON from the response
    response_content = judge_response["messages"][-1].content
    try:
        # Parse the JSON
        evaluation = json.loads(response_content)
        return evaluation
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails
        return {
            "accuracy": {"score": 3, "explanation": "Unable to parse evaluation"},
            "relevance": {"score": 3, "explanation": "Unable to parse evaluation"},
            "completeness": {"score": 3, "explanation": "Unable to parse evaluation"},
            "usefulness": {"score": 3, "explanation": "Unable to parse evaluation"}
        }

In [15]:
# TODO: Create a tool udage evaluator
def evaluate_tool_usage(messages, expected_tools):
    """Evaluate if the right tools were used"""
    print("Evaluating tool usage...")

    for msg in messages:
        print(f"Message: {msg.content}, Name: {msg.name}")

    # get all used tool names from messages
    used_tools = [msg.name for msg in messages if msg.model_dump().get("tool_call_id")]
    used_tool_set = set(used_tools)
    expected_tool_set = set(expected_tools)
    
    correct_tools = used_tool_set.intersection(expected_tool_set)
    missing_tools = expected_tool_set.difference(used_tool_set)
    extra_tools = used_tool_set.difference(expected_tool_set)
    
    # Calculate appropriateness: proportion of used tools that were correct
    total_used = len(correct_tools) + len(extra_tools)
    appropriateness = len(correct_tools) / total_used if total_used > 0 else 1.0
    
    # Calculate completeness: proportion of expected tools that were used
    completeness = len(correct_tools) / len(expected_tool_set) if expected_tool_set else 1.0
    
    return {
        "used_tools": list(used_tool_set),
        "correct_tools": list(correct_tools),
        "missing_tools": list(missing_tools),
        "extra_tools": list(extra_tools),
        "appropriateness": appropriateness,
        "completeness": completeness,
    }   

In [16]:
# TODO: Generate a comprehensive evaluation report
# Calculate overall scores and metrics
# Identify strengths and weaknesses
# Provide recommendations for improvement
def generate_evaluation_report(test_results):
    total_accuracy = 0
    total_relevance = 0
    total_completeness = 0
    total_usefulness = 0
    total_tool_correct = 0
    total_tool_missing = 0
    total_tool_extra = 0
    total_appropriateness = 0
    total_tool_completeness = 0
    
    for result in test_results:
        # Evaluate response
        final_response = result['response']['messages'][-1].content if isinstance(result['response'], dict) else ""
        response_evaluation = evaluate_response(
            question=result['question'],
            final_response=final_response,
            expected_response=result['expected_response']
        )
        
        # Extract scores
        accuracy_score = response_evaluation.get('accuracy', {}).get('score', 3)
        relevance_score = response_evaluation.get('relevance', {}).get('score', 3)
        completeness_score = response_evaluation.get('completeness', {}).get('score', 3)
        usefulness_score = response_evaluation.get('usefulness', {}).get('score', 3)
        
        total_accuracy += accuracy_score
        total_relevance += relevance_score
        total_completeness += completeness_score
        total_usefulness += usefulness_score
        
        # Evaluate tool usage
        messages = result['response']['messages'] if isinstance(result['response'], dict) else []
        tool_evaluation = evaluate_tool_usage(
            messages=messages,
            expected_tools=result['expected_tools']
        )
        
        total_tool_correct += len(tool_evaluation['correct_tools'])
        total_tool_missing += len(tool_evaluation['missing_tools'])
        total_tool_extra += len(tool_evaluation['extra_tools'])
        total_appropriateness += tool_evaluation['appropriateness']
        total_tool_completeness += tool_evaluation['completeness']
        
        # Print individual test results
        print(f"Test ID: {result['test_id']}")
        print(f"Accuracy: {accuracy_score}/5 - {response_evaluation.get('accuracy', {}).get('explanation', '')}")
        print(f"Relevance: {relevance_score}/5 - {response_evaluation.get('relevance', {}).get('explanation', '')}")
        print(f"Completeness: {completeness_score}/5 - {response_evaluation.get('completeness', {}).get('explanation', '')}")
        print(f"Usefulness: {usefulness_score}/5 - {response_evaluation.get('usefulness', {}).get('explanation', '')}")
        print(f"Tool Appropriateness: {tool_evaluation['appropriateness']:.2f}")
        print(f"Tool Completeness: {tool_evaluation['completeness']:.2f}")
        print(f"Tool Usage Details: Correct: {len(tool_evaluation['correct_tools'])}, Missing: {len(tool_evaluation['missing_tools'])}, Extra: {len(tool_evaluation['extra_tools'])}")
        print("-" * 50)
    
    num_tests = len(test_results)
    avg_accuracy = total_accuracy / num_tests if num_tests > 0 else 0
    avg_relevance = total_relevance / num_tests if num_tests > 0 else 0
    avg_completeness = total_completeness / num_tests if num_tests > 0 else 0
    avg_usefulness = total_usefulness / num_tests if num_tests > 0 else 0
    avg_appropriateness = total_appropriateness / num_tests if num_tests > 0 else 0
    avg_tool_completeness = total_tool_completeness / num_tests if num_tests > 0 else 0
    
    print("\n=== Evaluation Summary ===")
    print(f"Average Accuracy: {avg_accuracy:.2f}/5")
    print(f"Average Relevance: {avg_relevance:.2f}/5")
    print(f"Average Completeness: {avg_completeness:.2f}/5")
    print(f"Average Usefulness: {avg_usefulness:.2f}/5")
    print(f"Average Tool Appropriateness: {avg_appropriateness:.2f}")
    print(f"Average Tool Completeness: {avg_tool_completeness:.2f}")
    print(f"Total Correct Tools Used: {total_tool_correct}")
    print(f"Total Missing Tools: {total_tool_missing}")
    print(f"Total Extra Tools: {total_tool_extra}")

In [17]:
generate_evaluation_report(test_results)

Evaluating tool usage...
Message: Location: San Francisco, CA, Name: None
Message: When should I charge my electric car tomorrow to minimize cost and maximize solar power?, Name: None
Message: , Name: energy_advisor
Message: {"location": "San Francisco, CA", "forecast_days": 1, "current": {"temperature_c": 22.2, "condition": "partly_cloudy", "humidity": 33, "wind_speed": 4.0}, "hourly": [{"hour": 0, "temperature_c": 20.9, "condition": "cloudy", "solar_irradiance": 750.9, "humidity": 73, "wind_speed": 5.3}, {"hour": 1, "temperature_c": 25.7, "condition": "cloudy", "solar_irradiance": 465.2, "humidity": 79, "wind_speed": 0.7}, {"hour": 2, "temperature_c": 29.8, "condition": "sunny", "solar_irradiance": 408.7, "humidity": 52, "wind_speed": 12.1}, {"hour": 3, "temperature_c": 23.0, "condition": "partly_cloudy", "solar_irradiance": 863.7, "humidity": 38, "wind_speed": 1.3}, {"hour": 4, "temperature_c": 22.8, "condition": "sunny", "solar_irradiance": 30.6, "humidity": 31, "wind_speed": 14.2}