<a href="https://colab.research.google.com/github/dimitarpg13/agentic_architectures_and_design_patterns/blob/main/notebooks/observability/braintrust_agentic_observability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agentic System Observability with Braintrust

This notebook demonstrates how to integrate **Braintrust** observability into an agentic AI system. We'll build a research assistant agent that uses multiple tools and track its behavior using Braintrust's logging and evaluation capabilities.

## What You'll Learn

1. Setting up Braintrust for agent observability
2. Creating a multi-step agent with tools
3. Tracking agent decisions, tool calls, and reasoning
4. Evaluating agent performance with custom metrics
5. Analyzing agent behavior through Braintrust's dashboard

## Architecture

```
User Query ‚Üí Agent (with Braintrust tracking)
              ‚Üì
         Tool Selection
              ‚Üì
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚Üì         ‚Üì         ‚Üì
  Search   Calculator  Wikipedia
    ‚Üì         ‚Üì         ‚Üì
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
              ‚Üì
         Synthesis
              ‚Üì
    Final Response (logged to Braintrust)
```

## Setup and Installation

In [None]:
# Install required packages
!pip install braintrust anthropic langgraph langchain langchain-anthropic wikipedia-api python-dotenv --quiet

In [None]:
import os
from datetime import datetime
from typing import TypedDict, Annotated, List, Dict, Any
import json

# Braintrust imports
import braintrust
from braintrust import current_span, traced

# LangChain and LangGraph imports
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage

# Tool imports
import wikipediaapi
import operator

print("‚úì All imports successful")

## 1. Configure Braintrust

Initialize Braintrust with your API key. Get your key from: https://www.braintrust.dev/

In [None]:
# Set your API keys
os.environ["BRAINTRUST_API_KEY"] = "your-braintrust-api-key-here"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key-here"

# Initialize Braintrust project
PROJECT_NAME = "agentic-research-assistant"

print(f"‚úì Braintrust configured for project: {PROJECT_NAME}")
print(f"‚úì View results at: https://www.braintrust.dev/app/{PROJECT_NAME}")

## 2. Define Agent Tools

Create tools that the agent can use, instrumented with Braintrust tracking.

In [None]:
class AgentTools:
    """Collection of tools available to the agent, instrumented with Braintrust."""
    
    def __init__(self):
        self.wiki = wikipediaapi.Wikipedia(
            user_agent='ResearchAgent/1.0',
            language='en'
        )
    
    @traced
    def search_wikipedia(self, query: str) -> Dict[str, Any]:
        """Search Wikipedia for information."""
        # Log inputs to Braintrust
        current_span().log(input={"query": query, "tool": "wikipedia"})
        
        try:
            page = self.wiki.page(query)
            
            if page.exists():
                result = {
                    "title": page.title,
                    "summary": page.summary[:500],  # First 500 chars
                    "url": page.fullurl,
                    "success": True
                }
            else:
                result = {
                    "error": "Page not found",
                    "success": False
                }
            
            # Log outputs to Braintrust
            current_span().log(
                output=result,
                metadata={"query_length": len(query)}
            )
            
            return result
            
        except Exception as e:
            error_result = {"error": str(e), "success": False}
            current_span().log(output=error_result)
            return error_result
    
    @traced
    def calculator(self, expression: str) -> Dict[str, Any]:
        """Evaluate mathematical expressions safely."""
        current_span().log(input={"expression": expression, "tool": "calculator"})
        
        try:
            # Safe evaluation (restricted namespace)
            allowed_names = {"abs": abs, "round": round, "min": min, "max": max}
            result = eval(expression, {"__builtins__": {}}, allowed_names)
            
            output = {
                "result": result,
                "success": True
            }
            
            current_span().log(
                output=output,
                metadata={"expression_length": len(expression)}
            )
            
            return output
            
        except Exception as e:
            error_result = {"error": str(e), "success": False}
            current_span().log(output=error_result)
            return error_result
    
    @traced
    def web_search_simulator(self, query: str) -> Dict[str, Any]:
        """Simulated web search (in production, use a real search API)."""
        current_span().log(input={"query": query, "tool": "web_search"})
        
        # Simulated results
        result = {
            "results": [
                {"title": f"Result 1 for {query}", "snippet": "Relevant information..."},
                {"title": f"Result 2 for {query}", "snippet": "More details..."}
            ],
            "success": True
        }
        
        current_span().log(
            output=result,
            metadata={"num_results": len(result["results"])}
        )
        
        return result

tools = AgentTools()
print("‚úì Agent tools initialized with Braintrust tracing")

## 3. Define Agent State and Graph

Create a LangGraph-based agent with state management and tool execution.

In [None]:
class AgentState(TypedDict):
    """State of the research agent."""
    messages: Annotated[List, add_messages]
    iterations: int
    tool_calls: List[Dict[str, Any]]
    final_answer: str

# Initialize the LLM
llm = ChatAnthropic(
    model="claude-sonnet-4-20250514",
    temperature=0.7,
    max_tokens=2000
)

# Tool definitions for Claude
tool_definitions = [
    {
        "name": "search_wikipedia",
        "description": "Search Wikipedia for factual information on any topic.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query or topic"
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "calculator",
        "description": "Evaluate mathematical expressions. Supports +, -, *, /, **, and common functions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {
                    "type": "string",
                    "description": "Mathematical expression to evaluate"
                }
            },
            "required": ["expression"]
        }
    },
    {
        "name": "web_search_simulator",
        "description": "Search the web for current information (simulated for demo).",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query"
                }
            },
            "required": ["query"]
        }
    }
]

print("‚úì Agent state and LLM configured")

## 4. Implement Agent Nodes

Define the agent's reasoning, tool execution, and synthesis nodes.

In [None]:
@traced(name="agent_reasoning")
def agent_node(state: AgentState) -> AgentState:
    """Main reasoning node - decides what to do next."""
    messages = state["messages"]
    
    # Log the current state to Braintrust
    current_span().log(
        input={
            "iteration": state["iterations"],
            "message_count": len(messages),
            "tool_calls_so_far": len(state.get("tool_calls", []))
        }
    )
    
    # Call LLM with tools
    response = llm.invoke(
        messages,
        tools=tool_definitions
    )
    
    # Log LLM decision
    current_span().log(
        output={
            "has_tool_calls": hasattr(response, 'tool_calls') and len(response.tool_calls) > 0,
            "response_type": response.__class__.__name__
        },
        metadata={
            "model": "claude-sonnet-4-20250514",
            "iteration": state["iterations"]
        }
    )
    
    return {
        "messages": [response],
        "iterations": state["iterations"] + 1
    }

@traced(name="tool_execution")
def tool_node(state: AgentState) -> AgentState:
    """Execute tools requested by the agent."""
    messages = state["messages"]
    last_message = messages[-1]
    
    tool_calls = state.get("tool_calls", [])
    tool_messages = []
    
    # Execute each tool call
    if hasattr(last_message, 'tool_calls'):
        for tool_call in last_message.tool_calls:
            tool_name = tool_call["name"]
            tool_input = tool_call["input"]
            tool_id = tool_call["id"]
            
            # Log tool execution start
            current_span().log(
                input={
                    "tool_name": tool_name,
                    "tool_input": tool_input
                }
            )
            
            # Execute the tool
            if tool_name == "search_wikipedia":
                result = tools.search_wikipedia(tool_input["query"])
            elif tool_name == "calculator":
                result = tools.calculator(tool_input["expression"])
            elif tool_name == "web_search_simulator":
                result = tools.web_search_simulator(tool_input["query"])
            else:
                result = {"error": f"Unknown tool: {tool_name}"}
            
            # Create tool message
            tool_message = ToolMessage(
                content=json.dumps(result),
                tool_call_id=tool_id
            )
            tool_messages.append(tool_message)
            
            # Track tool call
            tool_calls.append({
                "tool": tool_name,
                "input": tool_input,
                "output": result,
                "timestamp": datetime.now().isoformat()
            })
    
    # Log tool execution summary
    current_span().log(
        output={
            "tools_executed": len(tool_messages),
            "total_tool_calls": len(tool_calls)
        }
    )
    
    return {
        "messages": tool_messages,
        "tool_calls": tool_calls
    }

def should_continue(state: AgentState) -> str:
    """Decide whether to continue or end."""
    messages = state["messages"]
    last_message = messages[-1]
    
    # If the LLM makes a tool call, execute tools
    if hasattr(last_message, 'tool_calls') and len(last_message.tool_calls) > 0:
        return "continue"
    
    # If max iterations reached, end
    if state["iterations"] >= 5:
        return "end"
    
    # Otherwise, we have a final answer
    return "end"

print("‚úì Agent nodes defined with Braintrust tracing")

## 5. Build the Agent Graph

In [None]:
# Create the graph
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("agent", agent_node)
workflow.add_node("tools", tool_node)

# Set entry point
workflow.set_entry_point("agent")

# Add conditional edges
workflow.add_conditional_edges(
    "agent",
    should_continue,
    {
        "continue": "tools",
        "end": END
    }
)

# Tools always go back to agent
workflow.add_edge("tools", "agent")

# Compile the graph
agent_graph = workflow.compile()

print("‚úì Agent graph compiled successfully")

## 6. Run Agent with Braintrust Logging

Execute the agent and track everything in Braintrust.

In [None]:
@traced(name="research_agent_execution")
def run_research_agent(query: str) -> Dict[str, Any]:
    """Run the research agent with full Braintrust tracking."""
    
    # Log the input query
    current_span().log(input={"query": query})
    
    # Initialize state
    initial_state = {
        "messages": [
            HumanMessage(content=f"""You are a research assistant. Answer the following question using the tools available to you.
            
Question: {query}

Think step by step:
1. What information do you need?
2. Which tools should you use?
3. How will you synthesize the results?

Provide a comprehensive answer.""")
        ],
        "iterations": 0,
        "tool_calls": [],
        "final_answer": ""
    }
    
    # Run the agent
    final_state = agent_graph.invoke(initial_state)
    
    # Extract final answer
    final_message = final_state["messages"][-1]
    final_answer = final_message.content if hasattr(final_message, 'content') else str(final_message)
    
    # Prepare result
    result = {
        "query": query,
        "answer": final_answer,
        "iterations": final_state["iterations"],
        "tool_calls": final_state.get("tool_calls", []),
        "num_messages": len(final_state["messages"])
    }
    
    # Log comprehensive output to Braintrust
    current_span().log(
        output=result,
        metadata={
            "query_length": len(query),
            "answer_length": len(final_answer),
            "tools_used": len(set(tc["tool"] for tc in result["tool_calls"]))
        },
        metrics={
            "iterations": result["iterations"],
            "tool_calls": len(result["tool_calls"]),
            "message_count": result["num_messages"]
        }
    )
    
    return result

print("‚úì Agent execution function ready")

## 7. Test the Agent with Different Queries

Run multiple queries and see how Braintrust tracks everything.

In [None]:
# Test queries
test_queries = [
    "What is the capital of France and what's its population?",
    "Calculate the compound interest on $10,000 at 5% annual rate for 10 years using the formula A = P(1 + r)^t",
    "Tell me about the Eiffel Tower and when it was built"
]

# Run experiments with Braintrust
def run_experiments():
    """Run multiple experiments and log to Braintrust."""
    
    with braintrust.init(project=PROJECT_NAME) as bt:
        for i, query in enumerate(test_queries):
            print(f"\n{'='*60}")
            print(f"Query {i+1}: {query}")
            print(f"{'='*60}")
            
            # Run the agent
            result = run_research_agent(query)
            
            # Log to Braintrust experiment
            bt.log(
                input={"query": query},
                output={"answer": result["answer"]},
                metadata={
                    "iterations": result["iterations"],
                    "tool_calls": len(result["tool_calls"]),
                    "tools_used": [tc["tool"] for tc in result["tool_calls"]]
                },
                metrics={
                    "iterations": result["iterations"],
                    "tool_count": len(result["tool_calls"])
                },
                scores={
                    "efficiency": 1.0 / max(result["iterations"], 1)  # Lower iterations = higher efficiency
                }
            )
            
            # Display results
            print(f"\nüìù Answer: {result['answer'][:200]}...")
            print(f"\nüìä Metrics:")
            print(f"  - Iterations: {result['iterations']}")
            print(f"  - Tool calls: {len(result['tool_calls'])}")
            print(f"  - Tools used: {', '.join(set(tc['tool'] for tc in result['tool_calls']))}")

# Run if API keys are set
if os.getenv("BRAINTRUST_API_KEY", "").startswith("your-"):
    print("‚ö†Ô∏è  Please set your API keys in the configuration cell above before running experiments.")
else:
    print("Running experiments with Braintrust tracking...\n")
    run_experiments()
    print(f"\n‚úÖ All experiments complete! View results at: https://www.braintrust.dev/app/{PROJECT_NAME}")

## 8. Custom Evaluation Metrics

Define custom evaluators for agent performance.

In [None]:
from braintrust import Eval

def evaluate_answer_quality(output: str, expected: str = None) -> float:
    """Simple quality metric based on answer length and structure."""
    if not output:
        return 0.0
    
    score = 0.0
    
    # Length check (reasonable answers are 50-1000 chars)
    if 50 <= len(output) <= 1000:
        score += 0.3
    
    # Contains multiple sentences
    if output.count('.') >= 2:
        score += 0.3
    
    # Not just an error message
    if 'error' not in output.lower():
        score += 0.2
    
    # Has numerical data if expected
    if any(char.isdigit() for char in output):
        score += 0.2
    
    return min(score, 1.0)

def evaluate_efficiency(iterations: int, tool_calls: int) -> float:
    """Evaluate how efficiently the agent solved the problem."""
    # Perfect score for 1-2 iterations with minimal tools
    efficiency = 1.0
    
    if iterations > 2:
        efficiency -= (iterations - 2) * 0.1
    
    if tool_calls > 3:
        efficiency -= (tool_calls - 3) * 0.05
    
    return max(efficiency, 0.0)

def evaluate_tool_selection(tool_calls: List[Dict], query: str) -> float:
    """Evaluate whether appropriate tools were selected."""
    query_lower = query.lower()
    tools_used = set(tc["tool"] for tc in tool_calls)
    
    score = 0.5  # Base score for making any tool call
    
    # Check if appropriate tools were used
    if "calculate" in query_lower or "math" in query_lower:
        if "calculator" in tools_used:
            score += 0.5
    
    if any(word in query_lower for word in ["what is", "tell me about", "who is"]):
        if "search_wikipedia" in tools_used:
            score += 0.3
    
    return min(score, 1.0)

print("‚úì Custom evaluation metrics defined")

## 9. Run Comprehensive Evaluation

Evaluate agent performance across multiple dimensions.

In [None]:
def run_evaluation():
    """Run comprehensive evaluation with custom metrics."""
    
    evaluation_queries = [
        {
            "query": "What is machine learning?",
            "expected_tools": ["search_wikipedia"]
        },
        {
            "query": "Calculate 15% of 8500",
            "expected_tools": ["calculator"]
        },
        {
            "query": "Who invented the telephone and when?",
            "expected_tools": ["search_wikipedia"]
        }
    ]
    
    with braintrust.init(project=PROJECT_NAME, experiment="comprehensive-eval") as bt:
        for test_case in evaluation_queries:
            query = test_case["query"]
            
            # Run agent
            result = run_research_agent(query)
            
            # Calculate scores
            quality_score = evaluate_answer_quality(result["answer"])
            efficiency_score = evaluate_efficiency(
                result["iterations"],
                len(result["tool_calls"])
            )
            tool_selection_score = evaluate_tool_selection(
                result["tool_calls"],
                query
            )
            
            # Overall score (weighted average)
            overall_score = (
                quality_score * 0.5 +
                efficiency_score * 0.3 +
                tool_selection_score * 0.2
            )
            
            # Log to Braintrust
            bt.log(
                input={"query": query},
                output={"answer": result["answer"]},
                scores={
                    "overall": overall_score,
                    "quality": quality_score,
                    "efficiency": efficiency_score,
                    "tool_selection": tool_selection_score
                },
                metadata={
                    "iterations": result["iterations"],
                    "tool_calls": len(result["tool_calls"]),
                    "expected_tools": test_case["expected_tools"]
                }
            )
            
            print(f"\nQuery: {query}")
            print(f"Scores - Overall: {overall_score:.2f} | Quality: {quality_score:.2f} | Efficiency: {efficiency_score:.2f} | Tool Selection: {tool_selection_score:.2f}")

if not os.getenv("BRAINTRUST_API_KEY", "").startswith("your-"):
    run_evaluation()
    print(f"\n‚úÖ Evaluation complete! View detailed results at: https://www.braintrust.dev/app/{PROJECT_NAME}")
else:
    print("‚ö†Ô∏è  Set API keys to run evaluation")

## 10. Analyze Results in Braintrust Dashboard

After running the experiments, you can view comprehensive analytics in the Braintrust dashboard:

### What You'll See:

1. **Trace Timeline**: Visual representation of agent execution flow
2. **Tool Usage Patterns**: Which tools were called and when
3. **Performance Metrics**: Latency, token usage, iteration counts
4. **Score Distributions**: How your agent performs across different queries
5. **Error Tracking**: Any failures or issues during execution

### Key Insights:

- **Efficiency**: Are iterations and tool calls optimal?
- **Accuracy**: Does the agent select appropriate tools?
- **Quality**: Are the answers comprehensive and correct?
- **Consistency**: How does performance vary across similar queries?

### Next Steps:

1. Compare different prompting strategies
2. A/B test different models or temperatures
3. Identify failure patterns
4. Optimize tool selection logic
5. Set up alerts for performance degradation

## Summary

This notebook demonstrated:

‚úÖ **Braintrust Integration**: Comprehensive tracking of agent behavior
‚úÖ **Agentic Design**: Multi-tool agent with reasoning and tool selection
‚úÖ **Custom Metrics**: Evaluation of quality, efficiency, and tool selection
‚úÖ **Production Patterns**: Proper error handling, logging, and state management

### Key Takeaways:

1. **Observability is crucial** for understanding and improving agent behavior
2. **@traced decorator** makes it easy to instrument any function
3. **Custom metrics** help evaluate agent performance beyond simple accuracy
4. **Braintrust dashboard** provides powerful visualization and analysis tools

### Resources:

- [Braintrust Documentation](https://www.braintrust.dev/docs)
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
- [Anthropic API Reference](https://docs.anthropic.com/)

---

**Ready to build production agent systems with full observability! üöÄ**