# Tool-Calling Agent with Hybrid Evaluation

## Overview

This notebook demonstrates a complete tool-calling AI agent system built with LangChain and LangGraph. The agent can execute function calls (weather queries, calculations) and includes semantic evaluation of its responses.

## Architecture

The system uses **one LLM instance** (Hermes-3-Llama-3.1-8B via vLLM) serving two distinct roles:

- **Agent Mode**: LLM with tools bound (`llm_with_tools`) - makes structured tool calls
- **Evaluator Mode**: Raw LLM without tools (`llm`) - performs semantic assessment

This hybrid approach addresses GPU memory constraints while maintaining proper separation between agent execution and evaluation logic.

## Key Capabilities

- **Structured Tool Calling**: Agent makes actual function calls rather than describing tool usage in text
- **Smart Routing**: Uses LangGraph to manage agent flow and decide when to call tools vs. return final answers
- **Semantic Evaluation**: LLM-based assessment of tool selection correctness and response quality, avoiding brittle keyword matching
- **Input Validation**: Pydantic schemas ensure proper tool parameter handling

## Requirements

- vLLM server running Hermes-3-Llama-3.1-8B on port 8082
- Function-calling capable model (general instruction models lack this capability)

## Imports

In [1]:
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
from langchain_core.tools import tool
from langchain_core.messages import ToolMessage
from pydantic import BaseModel, Field


## Connect to vLLM Server

**Prerequisites**: Before running this notebook, start the vLLM server in a separate terminal window.

### Starting the Server

Open a terminal and run:
```bash
vllm serve NousResearch/Hermes-3-Llama-3.1-8B \
  --port 8082 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192
```

**Required flags explained**:
- `--enable-auto-tool-choice`: Enables automatic tool calling capability
- `--tool-call-parser hermes`: Uses Hermes-specific parser for structured tool calls
- `--max-model-len 8192`: Limits context window to fit in 24GB GPU memory

Wait for the server to fully load (you'll see "Application startup complete" in the logs) before proceeding.

### Connection Setup

This cell creates the base LLM connection that will be used for both agent operations (with tools bound) and evaluation (without tools). The `api_key` parameter is required by the OpenAI client interface but not validated by vLLM for local connections.

In [2]:
# Connect to vLLM - this is the BASE LLM (no tools)
llm = ChatOpenAI(
    base_url="http://localhost:8082/v1",
    api_key="not-needed",
    model="NousResearch/Hermes-3-Llama-3.1-8B"
)

print("‚úì Connected to vLLM server on port 8082")

‚úì Connected to vLLM server on port 8082


## Pydantic Models for Tool Input Validation

Pydantic models define the expected input schema for each tool, ensuring the LLM generates valid, structured tool calls.

### Critical Requirement: Function-Calling Capable Models

**Not all LLMs support Pydantic-based tool calling.** The model must be specifically trained for structured function calling. 

**Before selecting an LLM**, verify it explicitly supports function calling with structured schemas. Using a model without this capability will result in the agent describing tool usage in text rather than making actual structured tool calls.

### Why Pydantic?

- **Type Safety**: Enforces correct data types (str, float, int, etc.)
- **Validation**: Applies constraints (min/max length, numerical ranges)
- **Schema Generation**: Automatically creates JSON schemas that the LLM uses to structure tool calls
- **Field Descriptions**: Provides detailed parameter descriptions that guide the LLM's tool usage

### Pattern Used

Each tool follows this pattern:

1. Define a Pydantic `BaseModel` subclass with typed fields
2. Use `Field()` to add descriptions and validation constraints
3. Pass the model to `@tool(args_schema=ModelClass)` decorator
4. Write the function with a complete docstring

The field descriptions are critical‚Äîthey're sent to the LLM and directly influence how it constructs tool calls.

## Define Tools

Tools are Python functions that the agent can call to perform specific tasks. The LLM must understand what each tool does and when to use it based solely on the tool's metadata.

### Model Selection for Tool Calling

**Critical requirement**: You must use a model specifically trained for function calling. Not all LLMs support structured tool calls.

**What happens with the wrong model**: The agent will generate text descriptions like "I would call the weather tool for Boston" instead of making actual structured function calls. Always verify function-calling support before implementing your agent.

### Critical Role of Docstrings

**Docstrings are not optional**‚Äîthey are the primary mechanism by which the LLM learns what each tool does. The docstring content is sent to the LLM as part of the tool schema. Many coding agents will skip the docstring or provide a minimal tool description that is not adequate for an agent to utilize.

**What makes an effective tool docstring:**
- **Clear purpose statement**: First line explains what the tool does
- **Detailed parameter descriptions**: Each argument needs explanation and examples
- **Return value documentation**: Describes what the tool outputs
- **Format and constraints**: Any limitations or expected formats

The LLM reads these docstrings to decide:
1. Which tool (if any) to call for a given query
2. What arguments to pass to the tool
3. How to interpret the tool's output

Poor or missing docstrings lead to incorrect tool selection and malformed tool calls. The quality of your docstrings directly determines agent reliability.

### Tool Structure

Each tool combines four components:
1. **Pydantic model** (`args_schema`) - validates inputs
2. **@tool decorator** - registers the function as a callable tool
3. **Docstring** - instructs the LLM on tool usage
4. **Function implementation** - executes the actual logic

### Weather Tool

**Pydantic Model**: `WeatherInput` defines a single required parameter:
- `location` (str): City name with optional state/country
- Validation: 2-100 characters in length
- Field description provides format examples that guide the LLM's input generation

**Tool Decorator**: `@tool(args_schema=WeatherInput)` binds the validation schema to the function, ensuring all calls pass through Pydantic validation before execution.

**Docstring**: Provides structured documentation that the LLM uses to understand:
- **Purpose**: What the tool retrieves (current weather information)
- **Parameters**: Repeats the location format with examples
- **Returns**: Expected output format (temperature + conditions)

**Implementation**: Returns a mock weather response. In production, this would call an actual weather API, but for demonstration purposes returns a fixed sunny/72¬∞F response with the requested location interpolated.

In [3]:
# Define Pydantic model for input validation
class WeatherInput(BaseModel):
    location: str = Field(
        description="The city name and optionally state/country (e.g., 'San Francisco, CA' or 'London, UK')",
        min_length=2,
        max_length=100
    )

# Define a tool with Pydantic validation
@tool(args_schema=WeatherInput)
def get_weather(location: str) -> str:
    """
    Retrieves current weather information for a specified location.
    
    Args:
        location: The city name and optionally state/country (e.g., "San Francisco, CA" or "London, UK")
    
    Returns:
        A string containing the current temperature in Fahrenheit and weather conditions.
    """
    return f"Current weather in {location}: Temperature is 72¬∞F, conditions are sunny with clear skies"

### Calculator Tool

**Pydantic Model**: `CalculatorInput` defines three required parameters:
- `operation` (str): Specifies which mathematical operation to perform
- `a` (float): First operand
- `b` (float): Second operand
- Field descriptions clearly enumerate the four supported operations, helping the LLM choose valid operation names

**Tool Decorator**: `@tool(args_schema=CalculatorInput)` binds the validation schema, ensuring type correctness (floats for numbers, valid string for operation).

**Docstring**: Documents the tool's purpose and parameter meanings, enabling the LLM to correctly map user queries like "multiply 25 by 4" to the appropriate operation name and operands.

**Implementation**: 
- Executes basic arithmetic using if/elif branching
- **Error handling**: Returns descriptive error messages for division by zero and unknown operations
- **Return format**: Provides a natural language response incorporating the operation name and result
- This pattern (string returns vs. raw numbers) allows the agent to easily incorporate results into conversational responses

In [4]:
# Define Pydantic model for calculator input
class CalculatorInput(BaseModel):
    operation: str = Field(
        description="The mathematical operation: 'add', 'subtract', 'multiply', or 'divide'"
    )
    a: float = Field(description="First number")
    b: float = Field(description="Second number")

# Define calculator tool with Pydantic validation
@tool(args_schema=CalculatorInput)
def calculator(operation: str, a: float, b: float) -> str:
    """
    Performs basic mathematical operations on two numbers.
    
    Args:
        operation: The mathematical operation to perform (add, subtract, multiply, divide)
        a: First number
        b: Second number
    
    Returns:
        A string containing the result of the operation.
    """
    if operation == "add":
        result = a + b
    elif operation == "subtract":
        result = a - b
    elif operation == "multiply":
        result = a * b
    elif operation == "divide":
        if b == 0:
            return "Error: Cannot divide by zero"
        result = a / b
    else:
        return f"Error: Unknown operation '{operation}'"
    
    return f"The result of {operation}ing {a} and {b} is {result}"

## Bind Tools to Create Agent LLM

**This is a critical step that many coding agents incorrectly skip.** Simply defining tools is not enough‚Äîthey must be explicitly bound to the LLM.

### What Tool Binding Does

`llm.bind_tools(tools)` creates a **new LLM instance** (`llm_with_tools`) that:

1. **Receives tool schemas**: The Pydantic models and docstrings are converted to JSON schemas and sent with every request
2. **Enables structured output**: The LLM is configured to return tool calls as structured data (function name + arguments) rather than text descriptions
3. **Maintains the original**: The base `llm` remains unchanged and available for non-tool tasks

### Why This Step is Essential

**Without binding:**
- The LLM has no knowledge of available tools
- Responses will be text descriptions like "I would call the weather tool..." instead of actual function calls
- The agent cannot execute any tools

**With binding:**
- The LLM receives tool metadata with every prompt
- It can generate properly structured tool calls: `{"name": "get_weather", "args": {"location": "Boston"}}`
- The agent can execute the structured calls and return real results

### Two LLM Instances in This System

- `llm`: Base model without tools (used for evaluation)
- `llm_with_tools`: Agent model with tools bound (used for agent responses)

Both reference the same underlying model server, but `llm_with_tools` includes tool schema metadata in its configuration.

In [5]:
# Bind tools to create the AGENT version of the LLM
tools = [get_weather, calculator]
llm_with_tools = llm.bind_tools(tools)

print("‚úì Tools bound to LLM")
print(f"  Available tools: {[tool.name for tool in tools]}")

‚úì Tools bound to LLM
  Available tools: ['get_weather', 'calculator']


### Agent Node: call_model Function

This function serves as the agent node, responsible for generating responses and deciding when to call tools.

**System Message Design**: The system prompt provides behavioral instructions but **deliberately does not list available tools**. This is correct‚Äîtool information is already included in `llm_with_tools` via the bound schemas. Listing tools in the system message would be redundant and could confuse the LLM with duplicate information.

**Key Instructions in System Prompt:**
- Identifies the assistant's role
- Enforces tool-first behavior: "You MUST call a tool to answer questions"
- Provides fallback response for out-of-scope queries

**Message Construction**: Combines the system message with the conversation history from `state["messages"]`, maintaining full context across turns.

**Critical LLM Selection**: Uses `llm_with_tools` (not base `llm`) to invoke the model. This ensures:
- Tool schemas are included in the request
- The model can generate structured tool calls
- Proper function calling behavior is enabled

**Tool Usage Enforcement Logic:**
1. Checks if any `ToolMessage` exists in conversation history
2. Checks if current response contains tool calls
3. If neither condition is met, overrides response with "I don't have a tool to answer that question."

This enforcement prevents the agent from attempting to answer questions without using tools, maintaining consistency with the tool-first design pattern. Once a tool has been used in the conversation, the agent can provide synthesis responses using those results.

**Return Value**: Returns updated state with the new message appended to the conversation.

In [6]:
def call_model(state: MessagesState):
    """
    Agent node - uses llm_with_tools for tool calling.
    """
    
    system_message = f"""You are a helpful assistant with access to specific tools.

You MUST call a tool to answer questions. If no tool is available for the user's question, respond with:
'I don't have a tool to answer that question.'"""
    
    messages = [{"role": "system", "content": system_message}] + state["messages"]
    
    # CRITICAL: Use llm_with_tools for agent responses
    response = llm_with_tools.invoke(messages)
    
    # Check if ANY tool was called in this conversation
    tool_was_used = any(isinstance(msg, ToolMessage) for msg in state["messages"])
    
    # Only force "can't answer" if:
    # 1. No tool calls in current response AND
    # 2. No tool was used earlier in conversation
    if (not hasattr(response, 'tool_calls') or not response.tool_calls) and not tool_was_used:
        response.content = "I don't have a tool to answer that question."
    
    return {"messages": [response]}


### Build Agent Graph

**Graph Initialization**: Creates a `StateGraph` with `MessagesState` as the state schema. `MessagesState` automatically manages the conversation history, appending new messages to the `messages` list.

**Agent Node**: `add_node("agent", call_model)` registers the `call_model` function as a node. This node receives the current state, invokes the LLM, and returns updated state with the response.

**Tools Node**: `add_node("tools", ToolNode(tools))` creates a pre-built node that automatically:
- Executes tool calls from the agent's response
- Handles multiple tool calls if present
- Returns `ToolMessage` objects with execution results
- Manages errors if tool execution fails

**Conditional Routing**: `add_conditional_edges()` examines the last message in state and routes based on presence of tool calls:
- If `tool_calls` exists ‚Üí route to "tools" node for execution
- If no `tool_calls` ‚Üí route to `END` to complete the workflow

This lambda function determines the graph's flow: tools needed vs. final answer ready.

**Return Edge**: `add_edge("tools", "agent")` creates an unconditional edge from tools back to agent. After tools execute, control always returns to the agent to synthesize results into a natural language response.

**Entry Point**: `add_edge(START, "agent")` defines where execution begins. Every invocation starts at the agent node.

**Compilation**: `compile()` validates the graph structure and creates an executable workflow. The compiled graph can be invoked with initial state.

**Graph Flow Summary**: `START` ‚Üí `agent` ‚Üí (conditional) ‚Üí `tools` ‚Üí `agent` ‚Üí (conditional) ‚Üí `END`

### Important: Watch for Outdated Syntax

**Warning for users working with AI coding assistants**: Many LLM-based coding agents (including older versions of Claude, ChatGPT, and others) will generate LangGraph code using the **legacy API** with `set_entry_point()` and `"__end__"` instead of the current `START` and `END` keywords.

**Legacy syntax (outdated):**
```python
graph_builder.set_entry_point("agent")  # Old way
lambda x: "tools" if x["messages"][-1].tool_calls else "__end__"  # Old way
```

**Current syntax (used in this notebook):**
```python
graph_builder.add_edge(START, "agent")  # Current way
lambda x: "tools" if x["messages"][-1].tool_calls else END  # Current way
```

**Why this matters**: The legacy syntax still works but is deprecated. If an AI assistant generates graph code for you, verify it uses `START` and `END` keywords. If it uses the old syntax, update it to match the current API shown in this notebook.

**What to do**: When copying graph building code from AI assistants, always check for and update these patterns to use the current LangGraph API.

In [7]:
# Build graph
graph_builder = StateGraph(MessagesState)
graph_builder.add_node("agent", call_model)
graph_builder.add_node("tools", ToolNode(tools))
graph_builder.add_conditional_edges(
    "agent",
    lambda x: "tools" if x["messages"][-1].tool_calls else END
)
graph_builder.add_edge("tools", "agent")
graph_builder.add_edge(START, "agent")
graph = graph_builder.compile()
print("‚úì Agent graph compiled")

‚úì Agent graph compiled


## Hybrid Evaluator Functions

This evaluator uses **the same LLM instance** for assessment but without tools bound, implementing a "hybrid" approach that addresses GPU memory constraints.

### Why "Hybrid"?

The term "hybrid" refers to using one physical model instance in two different modes:
- **Agent mode**: `llm_with_tools` (with tool schemas)
- **Evaluator mode**: `llm` (raw model, no tools)

This allows semantic evaluation without requiring a second model to be loaded into GPU memory.

### Evaluation Strategy

Rather than brittle keyword matching (e.g., checking if response contains "multiply"), the evaluator uses the LLM's semantic understanding to assess:

1. **Tool Selection Correctness**: Did the agent choose appropriate tools for the query?
2. **Response Quality**: Is the final answer clear, complete, and well-formatted?
3. **Overall Success**: Does the response successfully address the user's question?

### How It Works

The `evaluate_response` function:
- Constructs a structured evaluation prompt with available tools, tools used, and final answer
- Invokes the **raw LLM** (without tool bindings) to perform assessment
- Parses the LLM's evaluation into structured fields
- Returns a dictionary with ratings and reasoning for each criterion

This approach provides nuanced evaluation that understands semantic variations (e.g., "multiplied" vs "multiply" vs "times") without fragile string matching.

In [8]:
def evaluate_response(question: str, final_answer: str, tool_calls: list, messages: list, available_tools: list) -> dict:
    """
    LLM-based evaluation of agent performance.
    Uses the LLM to assess all aspects of the response.
    """
    
    # Build tool descriptions for context
    tool_descriptions = "\n".join([
        f"- {tool.name}: {tool.description}"
        for tool in available_tools
    ])
    
    tools_used = [tc['name'] for tc in tool_calls] if tool_calls else []
    
    eval_prompt = f"""You are evaluating an AI agent's performance.

AVAILABLE TOOLS:
{tool_descriptions}

USER QUESTION: {question}

TOOLS USED BY AGENT: {tools_used if tools_used else "None"}

AGENT'S FINAL ANSWER: {final_answer}

Evaluate the agent's performance on these criteria:

1. TOOL SELECTION: Did the agent correctly identify whether tools were needed? If tools were needed, did it select the appropriate tool(s)?
   - "correct" if tool usage was appropriate
   - "incorrect" if wrong tool was used or tools were used unnecessarily
   - "missing" if tools were needed but not used

2. RESPONSE QUALITY: Is the final answer clear, complete, and appropriately formatted?
   - "good" if answer is clear, complete, and well-formatted
   - "fair" if answer is acceptable but could be improved
   - "poor" if answer is unclear, incomplete, or poorly formatted

3. OVERALL: Does the response successfully address the user's question?
   - "pass" if the agent handled the question correctly
   - "fail" if there were significant errors in tool usage or response quality

Respond in this EXACT format:
TOOL_SELECTION: correct/incorrect/missing
TOOL_REASONING: <brief explanation of tool selection>
QUALITY: good/fair/poor
QUALITY_REASONING: <brief explanation of response quality>
OVERALL: pass/fail
OVERALL_REASONING: <brief summary>
"""
    
    # Use raw LLM for evaluation
    eval_response = llm.invoke(eval_prompt)
    eval_text = eval_response.content
    
    # Parse response
    result = {
        'tool_selection': None,
        'tool_reasoning': None,
        'quality': None,
        'quality_reasoning': None,
        'overall': None,
        'overall_reasoning': None,
        'raw_evaluation': eval_text
    }
    
    for line in eval_text.strip().split('\n'):
        line = line.strip()
        if line.startswith('TOOL_SELECTION:'):
            result['tool_selection'] = line.replace('TOOL_SELECTION:', '').strip().lower()
        elif line.startswith('TOOL_REASONING:'):
            result['tool_reasoning'] = line.replace('TOOL_REASONING:', '').strip()
        elif line.startswith('QUALITY:'):
            result['quality'] = line.replace('QUALITY:', '').strip().lower()
        elif line.startswith('QUALITY_REASONING:'):
            result['quality_reasoning'] = line.replace('QUALITY_REASONING:', '').strip()
        elif line.startswith('OVERALL:'):
            result['overall'] = line.replace('OVERALL:', '').strip().lower()
        elif line.startswith('OVERALL_REASONING:'):
            result['overall_reasoning'] = line.replace('OVERALL_REASONING:', '').strip()
    
    # Validate that parsing succeeded
    if not all([result['tool_selection'], result['quality'], result['overall']]):
        print(f"WARNING: Failed to parse evaluation response:")
        print(eval_text)
    
    return result


## Test Agent with Full Evaluation

This section runs the complete agent workflow and evaluation in a single pass, showing both the internal message flow and the evaluation results.

**Test Cases**: Three questions that demonstrate different agent behaviors:
1. Weather query - requires `get_weather` tool
2. Math query - requires `calculator` tool  
3. General knowledge - no applicable tool (should decline gracefully)

**Output Structure**: For each test case, displays:

1. **Message Flow**: Shows the complete conversation with message types:
   - `HumanMessage`: User's question
   - `AIMessage` (with tool_calls): Agent's decision to call a tool
   - `ToolMessage`: Tool execution result
   - `AIMessage` (final): Agent's synthesized natural language response

2. **Evaluation Results**: LLM-based assessment including:
   - Tool Selection: Whether correct tools were chosen
   - Response Quality: Clarity and completeness of final answer
   - Overall Assessment: Pass/fail determination with reasoning

**Why Combined Output**: Running both together ensures consistency‚Äîthe evaluation analyzes the exact same execution that generated the message flow, eliminating any discrepancies from separate runs.

In [9]:
# Test cases
test_cases = [
    "What's the weather in Boston?",
    "What is 25 multiplied by 4?",
    "What is the capital of France?"
]

print("="*70)
print("AGENT TESTING WITH EVALUATION")
print("Using single LLM (Hermes-3-Llama-3.1-8B on port 8082)")
print("  - Agent uses llm_with_tools")
print("  - Evaluator uses raw llm")
print("="*70)

for question in test_cases:
    print(f"\n{'='*70}")
    print(f"Question: {question}")
    print('='*70)
    
    # Run the agent
    result = graph.invoke({"messages": [("user", question)]})
    
    # Display message flow
    print("\nüîÑ MESSAGE FLOW:")
    print("-"*70)
    for msg in result["messages"]:
        msg_type = type(msg).__name__
        print(f"\nType: {msg_type}")
        print(f"Content: {msg.content}")
        if hasattr(msg, 'tool_calls') and msg.tool_calls:
            print(f"Tool calls: {msg.tool_calls}")
    
    # Get final answer and tool calls for evaluation
    final_answer = result["messages"][-1].content
    
    tool_calls = []
    for msg in result["messages"]:
        if hasattr(msg, 'tool_calls') and msg.tool_calls:
            tool_calls.extend(msg.tool_calls)
    
    # Evaluate
    evaluation = evaluate_response(question, final_answer, tool_calls, result["messages"], tools)
    
    print(f"\n{'='*70}")
    print("üìä EVALUATION RESULTS:")
    print('='*70)
    print(f"Overall: {'‚úÖ PASS' if evaluation['overall'] == 'pass' else '‚ùå FAIL'}")
    
    print(f"\nüîß Tool Selection:")
    print(f"   Status: {evaluation['tool_selection']}")
    print(f"   Reasoning: {evaluation['tool_reasoning']}")
    
    print(f"\n‚≠ê Response Quality:")
    print(f"   Quality: {evaluation['quality']}")
    print(f"   Reasoning: {evaluation['quality_reasoning']}")
    
    print(f"\nüìù Overall Assessment:")
    print(f"   {evaluation['overall_reasoning']}")
    print()

AGENT TESTING WITH EVALUATION
Using single LLM (Hermes-3-Llama-3.1-8B on port 8082)
  - Agent uses llm_with_tools
  - Evaluator uses raw llm

Question: What's the weather in Boston?

üîÑ MESSAGE FLOW:
----------------------------------------------------------------------

Type: HumanMessage
Content: What's the weather in Boston?

Type: AIMessage
Content: 
Tool calls: [{'name': 'get_weather', 'args': {'location': 'Boston'}, 'id': 'chatcmpl-tool-ec5485a7d5f1498fb6277a190582559a', 'type': 'tool_call'}]

Type: ToolMessage
Content: Current weather in Boston: Temperature is 72¬∞F, conditions are sunny with clear skies

Type: AIMessage
Content: The current temperature in Boston is 72¬∞F and the weather conditions are sunny with clear skies.

üìä EVALUATION RESULTS:
Overall: ‚úÖ PASS

üîß Tool Selection:
   Status: correct
   Reasoning: The agent correctly identified that the 'get_weather' tool was needed to retrieve the weather information for Boston.

‚≠ê Response Quality:
   Quality: goo