# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, you'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

## Learning Objectives
- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [1]:
from datetime import datetime
from agent import Agent

In [2]:
## TODO: Create the agent's instructions

ECOHOME_SYSTEM_PROMPT = f"""
    You are an home energy expert, you are given a set of tools to assess the energy situation of our customer.
    You will be supporting the customer and providing advises on the following topics [with the corresponding tools]:

        (1) Weather Integration: Use weather forecasts to predict solar generation and optimize device scheduling
        (2) Dynamic Pricing: Consider time-of-day electricity prices for cost optimization
        (3) Historical Analysis: Query past energy usage patterns for personalized advice
        (4) RAG Pipeline: Retrieve relevant energy-saving tips and best practices
        (5) Multi-device Optimization: Handle EVs, HVAC, appliances, and solar systems
        (6) Cost Calculations: Provide specific savings estimates and ROI analysis
        (7) Other Energy Advices or Questions of Multiple Categories: Perform an analysis of the question and use any combination of the tools to try and figure out the answer
    
    After obtaining all required data from the tools, try to answer the question in a polite and helpful manner within 200 words.  Explain your logic.
    """

In [3]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT,
)

In [4]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA"
)

In [5]:
print(response["messages"][-1].content)

To minimize costs and maximize solar power for charging your electric car tomorrow in San Francisco, consider the following:

1. **Solar Power Maximization**: The highest solar irradiance is expected between 11 AM and 2 PM, with a peak at 2 PM (310 W/m²). Charging during this time will maximize the use of solar energy.

2. **Cost Minimization**: Electricity prices are lowest during the Off-Peak period, which is from midnight to 6 AM and again from 11 PM to midnight, at $0.096 per kWh. The Mid-Peak rate is $0.132 per kWh, and the Peak rate is $0.192 per kWh from 2 PM to 8 PM.

**Recommendation**: To balance both solar power usage and cost, consider starting your charge around 11 AM to take advantage of solar power, and if possible, continue charging into the early afternoon. If you need to charge more, consider doing so during the Off-Peak hours (midnight to 6 AM) to benefit from the lowest rates. This strategy will help you optimize both cost and renewable energy usage.


In [6]:
print("TOOLS:")
for msg in response["messages"]:
    obj = msg.model_dump()
    if obj.get("tool_call_id"):
        print("-", msg.name)

TOOLS:
- get_weather_forecast
- get_electricity_prices


In [7]:
response['messages']

[SystemMessage(content='Location: San Francisco, CA', additional_kwargs={}, response_metadata={}, id='17b9dd9f-81fa-42e3-9794-02bfbc4da7b4'),
 HumanMessage(content='When should I charge my electric car tomorrow to minimize cost and maximize solar power?', additional_kwargs={}, response_metadata={}, id='a58008cf-834a-4a40-be85-35b6cfaa653a'),
 AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 54, 'prompt_tokens': 1079, 'total_tokens': 1133, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_689bad8e9a', 'id': 'chatcmpl-Ch3aTEsovBh1rSEZMtTlBZnW9rUmR', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, name='energy_advisor', id='lc_run--e2c2cfdf-5e5a-4cf4-8d9d

## 2. Define Test Cases

In [8]:
# TODO: Define comprehensive test cases for the Energy Advisor
# Create 10 test cases covering different scenarios:
# - EV charging optimization
# - Thermostat settings
# - Appliance scheduling
# - Solar power maximization
# - Cost savings calculations

In [9]:
test_cases = [
    {
        "id": "ev_charging_1",
        "question": "When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "The response should contain time recommendation, cost analysis and solar consideration",
    },
    {
        "id": "ev_charging_2",
        "question": "To minimize cost of charing my vehicle using solar energy, how does the time range differs comparing summer to winter?",
        "expected_tools": ["get_weather_forecast", "query_solar_generation"],
        "expected_response": "The response to provide insights about time range difference between summer and winter.",
    },
    {
        "id": "thermostat_setting",
        "question": "What is the optimal thermostat_setting I should be setting to next week in my living room?",
        "expected_tools": ["query_energy_usage", "get_weather_forecast"],
        "expected_response": "The response should contain time recommendation and thermostat temperature setting.",
    },
    {
        "id": "appliance_scheduling",
        "question": "When is the best time this week to do my laundry to save money?",
        "expected_tools": ["get_electricity_prices","query_energy_usage","get_recent_energy_summary"],
        "expected_response": "The response should include a date and time for doing laundry.",
    },
    {
        "id": "energy_price_estimation",
        "question": "By not turning on my appliance, I save 100kwh a day, how much money do I save for a year?",
        "expected_tools": ["calculate_energy_savings"],
        "expected_response": "The response should include the number of hours, the energy consumed, and the price.",
    },
    {
        "id": "costs_savings_calculation",
        "question": "How do I optimize my electric bill with my appliances using my usage data from the last 2 weeks?",
        "expected_tools": ["query_energy_usage","get_electricity_prices"],
        "expected_response": "The result should provide suggestions on suggestions on how to arrange the appliances usage.",
    },
    {
        "id": "general_tips",
        "question": "Can you give me 5 general ways to save energy for my home?",
        "expected_tools": ["search_energy_tips"],
        "expected_response": "There should be 5 points related to energy saving given.",
    },
    {
        "id": "solar_saving",
        "question": "Can you estimate the total solar saving for the whole week next week?",
        "expected_tools": ["get_weather_forecast", "query_solar_generation"],
        "expected_response": "The response should be a total saving amount.",
    },
    {
        "id": "general_consumption",
        "question": "How much does it cost for my energy consumption in the last 2 weeks?",
        "expected_tools": ["query_energy_usage", "get_electricity_prices"],
        "expected_response": "The response should e a total cost amount.",
    },
    {
        "id": "energy_consumption",
        "question": "What is my energy consumption for the last 2 hours?",
        "expected_tools": ["get_recent_energy_summary"],
        "expected_response": "The response should contain the energy consumption of last 2 hours.",
    },
    

]

if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")

## 3. Run Agent Tests

In [11]:
CONTEXT = "Location: San Francisco, CA"

In [12]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat()
        }
        test_results.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results.append(result)

print(f"\nCompleted {len(test_results)} tests")

=== Running Agent Tests ===

Test 1: ev_charging_1
Question: When should I charge my electric car tomorrow to minimize cost and maximize solar power?
--------------------------------------------------

Test 2: ev_charging_2
Question: To minimize cost of charing my vehicle using solar energy, how does the time range differs comparing summer to winter?
--------------------------------------------------

Test 3: thermostat_setting
Question: What is the optimal thermostat_setting I should be setting to next week in my living room?
--------------------------------------------------

Test 4: appliance_scheduling
Question: When is the best time this week to do my laundry to save money?
--------------------------------------------------

Test 5: energy_price_estimation
Question: By not turning on my appliance, I save 100kwh a day, how much money do I save for a year?
--------------------------------------------------

Test 6: costs_savings_calculation
Question: How do I optimize my electric bi

In [13]:
test_results

[{'test_id': 'ev_charging_1',
  'question': 'When should I charge my electric car tomorrow to minimize cost and maximize solar power?',
  'response': {'messages': [SystemMessage(content='Location: San Francisco, CA', additional_kwargs={}, response_metadata={}, id='3ccf1dff-42c3-4019-b8e3-ba71e5742654'),
    HumanMessage(content='When should I charge my electric car tomorrow to minimize cost and maximize solar power?', additional_kwargs={}, response_metadata={}, id='d8af9fe2-f4ae-415a-be1f-2eeec303192f'),
    AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 54, 'prompt_tokens': 1079, 'total_tokens': 1133, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_689bad8e9a', 'id': 'chatcmp

In [14]:
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage, ToolMessage

tool_calls = []
for result in test_results:
    msg_list = result["response"]["messages"]
    tool_call = [msg.name for msg in msg_list if isinstance(msg, ToolMessage)]
    tool_calls.append(tool_call)

tool_calls_all = set([item for sublist in tool_calls for item in sublist])

In [15]:
tool_calls

[['get_weather_forecast', 'get_electricity_prices'],
 ['get_weather_forecast', 'get_electricity_prices'],
 ['get_weather_forecast'],
 ['get_weather_forecast', 'get_electricity_prices'],
 ['get_electricity_prices'],
 ['query_energy_usage', 'get_electricity_prices'],
 ['search_energy_tips'],
 ['get_weather_forecast'],
 ['query_energy_usage', 'get_electricity_prices'],
 ['get_recent_energy_summary']]

In [16]:
tool_calls_all

{'get_electricity_prices',
 'get_recent_energy_summary',
 'get_weather_forecast',
 'query_energy_usage',
 'search_energy_tips'}

In [17]:
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage, ToolMessage

tool_calls = []
for result in test_results:
    msg_list = result["response"]["messages"]
    tool_call = [msg.name for msg in msg_list if isinstance(msg, ToolMessage)]
    tool_calls.append(tool_call)

tool_calls

[['get_weather_forecast', 'get_electricity_prices'],
 ['get_weather_forecast', 'get_electricity_prices'],
 ['get_weather_forecast'],
 ['get_weather_forecast', 'get_electricity_prices'],
 ['get_electricity_prices'],
 ['query_energy_usage', 'get_electricity_prices'],
 ['search_energy_tips'],
 ['get_weather_forecast'],
 ['query_energy_usage', 'get_electricity_prices'],
 ['get_recent_energy_summary']]

## 4. Evaluate Responses

In [18]:
# Set up all required libraries
from langchain_openai import ChatOpenAI
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage, ToolMessage
# from langgraph.graph.message import MessagesState
from langchain_core.prompts import PromptTemplate
from typing import List, Dict
from tools import TOOL_KIT
from pydantic import BaseModel, Field

In [19]:
class QualityEvalSchema(BaseModel):
    accuracy: float = Field(description="Accuracy of final_response with respect to the customer's question and espected response description.")
    relevance: float = Field(description="Relevancy of final_response with respect to the customer's question and espected response description.")
    completeness: float = Field(description="Completeness of final_response with respect to the customer's question and espected response description.")
    usefulness: float = Field(description="Usefulness of final_response with respect to the customer's question.")
    quality_evaluation_description: str = Field(description="Description of the analysis logic of other metrics")

class ToolEvalSchema(BaseModel):
    tool_appropriateness: float = Field(description="Appropriateness of the tools selected based on the given tools.")
    tool_completness: float = Field(description="Metric measuring whether all the necessary tools were used.")
    tool_evaluation_description: str = Field(description="The reasoning and description of the analysis of coming up with the metrics.")

In [20]:
# Set up LLM Judge for Evaluation
# (LangChain is used instead of LangGraph, because analysis are independant and only simple single step LLM evaluation is required)
from dotenv import load_dotenv
import os

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
llm_base_url = "https://openai.vocareum.com/v1"

# Create LLM Evaluation Agent
llm = ChatOpenAI(
    model="gpt-4o",             # using a more sophisticated model for evaluation purpose
    temperature=0.0,
    api_key = OPENAI_API_KEY,
    base_url = llm_base_url,
)

qual_structured_llm = llm.with_structured_output(QualityEvalSchema)
tool_structured_llm = llm.with_structured_output(ToolEvalSchema)

In [21]:
# TODO: Implement evaluation functions
# Create functions to evaluate:
# - Final Response
# - Tool usage

In [22]:
# TODO: Create a response evaluator
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response against expected response"""

    template = PromptTemplate(
        input_variables=["question", "final_response", "expected_response"],
        template="""You're a judging evaluator for the response given by an energy advisor.

        Given the response from the energy advisor with the expected response description provide the following:
        - accuracy: Out of a score of 100, how accurate is the advisor response provided with respect to the customer's question and the espected response description? 
        - relevance: Out of a score of 100, how relevant the advisor response provided with respect to the customer's question and the espected response description?
        - completeness: Out of a score of 100, how complete the advisor response provided with respect to the customer's question and the espected response description?
        - usefulness: Out of a score of 100, how useful the advisor response provided with respect to the customer's question?  (Consider whether customer can take any action based on this anwser)
        - quality_evaluation_description: Provide a detail description to justify your metrics provided above within 200 to 500 words.

        Customer's question: 
        {question}
        
        Energy advisor response: 
        {final_response}

        Expected response description: 
        {expected_response}
        """
    )

    messages = template.invoke(
            {
                "question": question, 
                "final_response": final_response, 
                "expected_response": expected_response
            }
        ).to_messages()

    ai_message = qual_structured_llm.invoke(messages)
    return ai_message

In [23]:
# TODO: Create a tool udage evaluator
def evaluate_tool_usage(messages: list, expected_tools: str):
    """Evaluate if the right tools were used"""
    
    template = PromptTemplate(
        input_variables=["messages", "expected_tools"],
        template="""You're a judging evaluator for the tools used by an energy advisor AI agent to answer the user's question.

        Given the message history from the energy advisor AI agent please evaluat the selection of tool(s) used by the following:
        - tool_appropriateness: Out of a score of 100, how appropriate the tool selected to answer the customer's question?
        - tool_completeness: Out of a score of 100, evaluate whether all necessary tools were used?
        - tool_evaluation_description: Provide a detail description to justify your metrics provided above within 200 to 500 words.

        Available toolset for the energy advisor AI agent: 
        {TOOL_KIT}
        
        Expected tools to use: 
        {expected_tools}
        """
    ) 


    tool_info: List[Dict[str, str]] = [
        {"name": tool.name, "description": tool.description}
        for tool in TOOL_KIT
    ]

    prompt_message = template.invoke(
            {"TOOL_KIT": str(tool_info), "expected_tools": expected_tools}
        ).to_messages()

    ai_message = tool_structured_llm.invoke(messages + prompt_message)
    return ai_message

In [24]:
def results_reporting(metrics: List[List[tuple]]):
    """Produce a report for the evaluation"""

    template = PromptTemplate(
        input_variables=["metrics"],
        template="""You are an expert report writer by looking at the provided metrics for the AI energy advisor you need to write a report from a list of test samples.
        Your task is listed as follow:
            - Calculate overall scores and metrics across all samples
            - Identify strengths and weaknesses
            - Provide recommendations for improvement
        
        Metrics:
        {metrics}

        The report must have a proper structure with the following sections names with a total of less than 300 words:
            - Overall Score
            - Strength and Weaknesses
            - Recommendations
        """
        )
    
    prompt_message = template.invoke({"metrics": metrics})
    report = llm.invoke(prompt_message)
    return report 

In [25]:
# TODO: Generate a comprehensive evaluation report
# Calculate overall scores and metrics
# Identify strengths and weaknesses
# Provide recommendations for improvement

def generate_evaluation_report(test_results: List[Dict]):

    metrics = []

    for result in test_results:
        # Qualitative Evaluation using LLM
        quality_metric = evaluate_response(
            question = result.get('question',''),
            final_response = result.get('response', None).get('messages', None)[-1].content,
            expected_response = result.get('expected_response', None),
            )
        
        # Tool Evaluation using LLM
        tool_metric = evaluate_tool_usage(
            messages = result.get('response', None).get('messages', None),
            expected_tools = result.get('expected_tools', None),
            )

        # Join Metrics
        metrics.append( list(quality_metric) + list(tool_metric) )
        
    report = results_reporting(metrics)
    print(report.content)

    return metrics, report

In [26]:
metrics, report = generate_evaluation_report(test_results)

### Overall Score

The AI energy advisor demonstrates strong performance across various metrics, with an average accuracy of 85.5%, relevance of 89.0%, completeness of 81.5%, and usefulness of 86.0%. Tool appropriateness and completeness are rated at 88.0% and 82.5%, respectively. These scores indicate a generally effective system, though there is room for improvement in certain areas.

### Strengths and Weaknesses

**Strengths:**
- **Accuracy and Relevance:** The advisor consistently provides accurate and relevant responses, particularly in scenarios involving specific queries about energy usage and savings.
- **Tool Appropriateness:** The selection of tools is generally well-suited to the tasks, ensuring that the responses are based on relevant data.
- **Usefulness:** The advice given is often actionable, allowing users to make informed decisions about energy consumption and savings.

**Weaknesses:**
- **Completeness:** Some responses lack full completeness, often missing specific de

---