# 01 - Experiments tracking - AWS Strands Agents with LiteLLM and LangFuse observability

This notebook demonstrates the unified testing approach using LiteLLM endpoints with the `run_test()` and `run_evaluation()` methods with human-readable results.

## Setup
______________________________________________________________

In [40]:
%pip install "strands-agents" "strands-agents-tools" "langfuse==3.1.1" "litellm" opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -q

Note: you may need to restart the kernel to use updated packages.


In [41]:
import os, base64
import time, uuid, boto3
import yaml
from datetime import datetime
from utils_litellm import UnifiedTester

In [42]:
# Initialize LiteLLM Unified Tester
tester = UnifiedTester()

# Load configuration
with open('config_experiments.yml', 'r') as f:
    config = yaml.safe_load(f)

prompts = config['system_prompts']
test_queries = config['test_queries']

print("‚úÖ LiteLLM Unified Tester initialized!")
print(f"‚úÖ Available Prompts: {list(prompts.keys())}")
print(f"‚úÖ Number of Test queries: {len(test_queries)}")

‚úÖ LiteLLM Unified Tester initialized!
‚úÖ Available Prompts: ['version1', 'version2']
‚úÖ Number of Test queries: 4


### LangFuse Setup

In [43]:
## 1. Set general environment variables first

os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-xxxxxxxxxxx" # Your Langfuse project secret key
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-xxxxxxxxxxxxx" # Your Langfuse project public key
os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com/" # Langfuse domain


def setup_langfuse_v3(langfuse_public_key, langfuse_secret_key, langfuse_api_url):
    """Set up LangFuse v3 with proper configuration"""
    
    
    # 2. Set up OpenTelemetry endpoint with proper authentication
    otel_endpoint = f"{langfuse_api_url}/api/public/otel/v1/traces"
    auth_token = base64.b64encode(
        f"{langfuse_public_key}:{langfuse_secret_key}".encode()
    ).decode()
    
    os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = otel_endpoint
    os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {auth_token}"
    
    print("‚úÖ LangFuse v3 Environment Configured:")
    print(f"   Host: {langfuse_api_url}")
    print(f"   OTEL Endpoint: {otel_endpoint}")
    print(f"   Authentication: Configured")
    
    return True

# Set up LangFuse
setup_langfuse_v3(os.environ["LANGFUSE_PUBLIC_KEY"], os.environ["LANGFUSE_SECRET_KEY"], os.environ["LANGFUSE_HOST"])

‚úÖ LangFuse v3 Environment Configured:
   Host: http://xxxxxxxxxxx
   OTEL Endpoint: http://xxxxxxxxxx/api/public/otel/v1/traces
   Authentication: Configured


True

## Tools Definition
__________________________________________________________________________________

#### Dependencies setup

In [44]:
from strands_tools import retrieve, current_time
from strands import Agent, tool
from strands.models.litellm import LiteLLMModel
import os

#### AWS Clients setup (Bedrock KnowledgeBase and DynamoDB)

In [45]:
kb_name = "restaurant-assistant"
dynamodb = boto3.resource("dynamodb")
smm_client = boto3.client("ssm")
table_name = smm_client.get_parameter(
    Name=f"{kb_name}-table-name", WithDecryption=False
)
table = dynamodb.Table(table_name["Parameter"]["Value"])
kb_id = smm_client.get_parameter(Name=f"{kb_name}-kb-id", WithDecryption=False)
print("DynamoDB table:", table_name["Parameter"]["Value"])
print("Knowledge Base Id:", kb_id["Parameter"]["Value"])

import uuid
session_id = uuid.uuid4()

DynamoDB table: restaurant-bookings
Knowledge Base Id: 9M16EZBECX


### Tools setup

In [46]:
%%writefile tool_get_booking_details.py

from strands import tool
import boto3 

@tool
def tool_booking_details(booking_id: str, restaurant_name: str) -> dict:
    """Get the relevant details for booking_id in restaurant_name
    Args:
        booking_id: the id of the reservation
        restaurant_name: name of the restaurant handling the reservation

    Returns:
        booking_details: the details of the booking in JSON format
    """

    try:
        response = table.get_item(
            Key={"booking_id": booking_id, "restaurant_name": restaurant_name}
        )
        if "Item" in response:
            return response["Item"]
        else:
            return f"No booking found with ID {booking_id}"
    except Exception as e:
        return str(e)

Overwriting tool_get_booking_details.py


In [47]:
%%writefile tool_delete_booking.py

from strands import tool
import boto3 

@tool
def tool_delete_booking(booking_id: str, restaurant_name:str) -> str:
    """delete an existing booking_id at restaurant_name
    Args:
        booking_id: the id of the reservation
        restaurant_name: name of the restaurant handling the reservation

    Returns:
        confirmation_message: confirmation message
    """
    kb_name = 'restaurant-assistant'
    dynamodb = boto3.resource('dynamodb')
    smm_client = boto3.client('ssm')
    table_name = smm_client.get_parameter(
        Name=f'{kb_name}-table-name',
        WithDecryption=False
    )
    table = dynamodb.Table(table_name["Parameter"]["Value"])
    try:
        response = table.delete_item(Key={'booking_id': booking_id, 'restaurant_name': restaurant_name})
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:
            return f'Booking with ID {booking_id} deleted successfully'
        else:
            return f'Failed to delete booking with ID {booking_id}'
    except Exception as e:
        return str(e)

Overwriting tool_delete_booking.py


In [48]:
%%writefile tool_create_booking.py

#Alternativelly, you can use the TOOL_SPEC approach when defining your tool

from typing import Any
from strands.types.tools import ToolResult, ToolUse
import boto3
import uuid


TOOL_SPEC = {
    "name": "tool_create_booking",
    "description": "Create a new booking at restaurant_name",
    "inputSchema": {
        "json": {
            "type": "object",
            "properties": {
                "date": {
                    "type": "string",
                    "description": """The date of the booking in the format YYYY-MM-DD. 
                    Do NOT accept relative dates like today or tomorrow. 
                    Ask for today's date for relative date."""
                },
                "hour": {
                    "type": "string",
                    "description": "the hour of the booking in the format HH:MM"
                },
                "restaurant_name": {
                    "type": "string",
                    "description": "name of the restaurant handling the reservation"
                },
                "guest_name": {
                    "type": "string",
                    "description": "The name of the customer to have in the reservation"
                },
                "num_guests": {
                    "type": "integer",
                    "description": "The number of guests for the booking"
                }
            },
            "required": ["date", "hour", "restaurant_name", "guest_name", "num_guests"]
        }
    }
}
# Function name must match tool name
def tool_create_booking(tool: ToolUse, **kwargs: Any) -> ToolResult:
    kb_name = 'restaurant-assistant'
    dynamodb = boto3.resource('dynamodb')
    smm_client = boto3.client('ssm')
    table_name = smm_client.get_parameter(
        Name=f'{kb_name}-table-name',
        WithDecryption=False
    )
    table = dynamodb.Table(table_name["Parameter"]["Value"])
    
    tool_use_id = tool["toolUseId"]
    date = tool["input"]["date"]
    hour = tool["input"]["hour"]
    restaurant_name = tool["input"]["restaurant_name"]
    guest_name = tool["input"]["guest_name"]
    num_guests = tool["input"]["num_guests"]
    
    results = f"Creating reservation for {num_guests} people at {restaurant_name}, " \
              f"{date} at {hour} in the name of {guest_name}"
    print(results)
    try:
        booking_id = str(uuid.uuid4())[:8]
        table.put_item(
            Item={
                'booking_id': booking_id,
                'restaurant_name': restaurant_name,
                'date': date,
                'name': guest_name,
                'hour': hour,
                'num_guests': num_guests
            }
        )
        return {
            "toolUseId": tool_use_id,
            "status": "success",
            "content": [{"text": f"Reservation created with booking id: {booking_id}"}]
        } 
    except Exception as e:
        return {
            "toolUseId": tool_use_id,
            "status": "error",
            "content": [{"text": str(e)}]
        } 

Overwriting tool_create_booking.py


In [49]:
import tool_create_booking
import tool_delete_booking
import tool_get_booking_details

#### Set Tools List and KnowledgeBase Id capture

In [50]:
#Knowledge Base
os.environ["KNOWLEDGE_BASE_ID"] = kb_id["Parameter"]["Value"]

#Tools list
tool_list = [retrieve, current_time, tool_get_booking_details, tool_create_booking, tool_delete_booking]

## LiteLLM Model Endpoints
-----------------------------------------------------------------------------

With LiteLLM, you can use various model endpoints. Here are some examples for AWS Bedrock:

## Tests Setup
_______________________________________________________________________________________________

In [56]:
# Test 1: Multiple LiteLLM models, single system prompt, multiple queries

#Test name
test_name = "Restaurant helper LiteLLM Test"

#Langfuse trace attributes
trace_attributes = {
    # Main trace name 
    "operation.name": test_name, 
    "langfuse.trace.name": test_name,
    # Core identifiers
    "session.id": session_id,
    "user.id": "palacan@amazon.com",
    # Langfuse metadata
    "langfuse.tags": [
        f"Agent-{test_name}"
    ],
    "langfuse.environment": "development"
}

# Test Definition and Execution using LiteLLM endpoints
results1 = tester.run_test(
    models=[
        "bedrock/openai.gpt-oss-120b-1:0",
        "bedrock/us.anthropic.claude-3-7-sonnet-20250219-v1:0"
    ],  # LiteLLM endpoints
    system_prompts=["version2"],  # System prompts list to test
    queries=test_queries[0],  # Queries to test
    prompts_dict=prompts,  # Dictionary of prompts
    tool=tool_list,  # Tools to test
    save_to_csv=True, # Default True
    trace_attributes=trace_attributes
)

tester.display_results(results1)


üöÄ Starting LiteLLM Test Suite
üìä Total combinations to test: 2
ü§ñ Models: ['bedrock/openai.gpt-oss-120b-1:0', 'bedrock/us.anthropic.claude-3-7-sonnet-20250219-v1:0']
üìù Prompts: ['version2']
‚ùì Queries: 1 query(ies)

[1/2] Testing: bedrock/openai.gpt-oss-120b-1:0 | version2
Query: What is the current bitcoin price?
------------------------------------------------------------
üîß Using AWS region: us-east-1
**Current Bitcoin (BTC) Price**

- **Price:** **$‚ÄØ‚Äã27,842.31 USD**  
- **Source:** Real‚Äëtime market data from major cryptocurrency exchanges (e.g., Coinbase, Binance, Kraken) aggregated via the Amazon Bedrock Knowledge Base.  
- **Timestamp:** 2025‚Äë11‚Äë05‚ÄØ08:12‚ÄØUTC  

*Disclaimer:* Cryptocurrency prices are highly volatile and can change within seconds. This figure reflects the price at the moment of retrieval and should not be considered financial advice. Always verify the latest price on a trusted exchange before making any trading decisions.
Tool #1: retri

In [19]:
# Test 2: Single LiteLLM model, multiple prompts, single query

#Test name
test_name = "Restaurant helper LiteLLM"

#Langfuse trace attributes
trace_attributes = {
    # Main trace name 
    "operation.name": test_name, 
    "langfuse.trace.name": test_name,
    # Core identifiers
    "session.id": session_id,
    "user.id": "palacan@amazon.com",
    # Langfuse metadata
    "langfuse.tags": [
        f"Agent-{test_name}"
    ],
    "langfuse.environment": "development"
}

results2 = tester.run_test(
    models=["bedrock/us.anthropic.claude-3-5-sonnet-20241022-v2:0"],  # Single LiteLLM endpoint
    system_prompts=["version1", "version2"],  # Multiple prompts
    queries=test_queries[0],  # Single query
    prompts_dict=prompts,
    tool=tool_list,
    trace_attributes=trace_attributes
)

tester.display_results(results2)


üöÄ Starting LiteLLM Test Suite
üìä Total combinations to test: 2
ü§ñ Models: ['bedrock/us.anthropic.claude-3-5-sonnet-20241022-v2:0']
üìù Prompts: ['version1', 'version2']
‚ùì Queries: 1 query(ies)

[1/2] Testing: bedrock/us.anthropic.claude-3-5-sonnet-20241022-v2:0 | version1
Query: What is the current bitcoin price?
------------------------------------------------------------
üîß Using AWS region: us-east-1
I apologize, but I notice that among the available tools, I don't have direct access to real-time cryptocurrency price data. While I'm designed to coordinate with a CryptoExpertAgent for such queries, I don't currently have a tool to fetch current Bitcoin prices.

To provide accurate and real-time cryptocurrency information, I would need access to a cryptocurrency price data feed or API. Without this, I cannot provide you with the current Bitcoin price.

If you need this information, I recommend:
1. Checking reputable cryptocurrency exchanges directly
2. Using cryptocurrenc

## [OPTIONAL] üß™ Test Case Evaluation with LLM-as-Judge using LiteLLM

This section demonstrates how to run structured test case evaluation using the `run_evaluation()` method with LiteLLM endpoints. This method:

- Loads test cases from `config_evaluation.yaml`
- Runs multi-turn conversations for each test case
- Uses an LLM-as-judge to evaluate responses against expected results
- Provides detailed scoring and analysis
- Optionally integrates with Langfuse for tracing

## Agent Evaluation (csv output)

In [20]:
# Load system prompts for evaluation
with open('config_experiments.yml', 'r') as f:
    config = yaml.safe_load(f)

prompts = config['system_prompts']
os.environ["KNOWLEDGE_BASE_ID"] = kb_id["Parameter"]["Value"]
tool_list = [retrieve, current_time, tool_get_booking_details, tool_create_booking, tool_delete_booking]

# Run evaluation with test cases using LiteLLM
evaluation_results = tester.run_evaluation(
    models=["bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0"],  # LiteLLM endpoint
    system_prompts=["version2"],  # Single prompt version
    prompts_dict=prompts,
    tool=tool_list,
    test_cases_path="config_evaluation.yml",  # Test cases file
    save_to_csv=True  # Save results to CSV
)

print(f"\n‚úÖ Evaluation completed with {len(evaluation_results)} test case results")


üß™ Starting LiteLLM Test Case Evaluation
üìä Total combinations: 2
ü§ñ Models: ['bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0']
üìù Prompts: ['version2']
üìã Test Cases: 2 test case(s)

[1/2] Evaluating: bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0 | version2 | comparacion_crypto_tradicional
------------------------------------------------------------
üîß Using AWS region: us-east-1
üìù Test Case: comparacion_crypto_tradicional

  Turn 1: ¬øPuedes proporcionarme una comparaci√≥n entre los precios de ...

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

    ‚ùå Error in question 1: litellm.BadRequestError: BedrockException - b'{"message":"Invocation of model ID anthropic.claude-3-5-sonnet-20241022-v2:0 with on-demand throughput isn\xe2\x80\x99t supported. Retry your request with the ID or ARN of an inference profile that contains this model."}'
‚ùå FAILED 

### Multi-Model Evaluation Comparison with LiteLLM

Compare multiple LiteLLM models and prompts across all test cases:

In [None]:
# Comprehensive evaluation across multiple LiteLLM configurations
# WARNING!!: This will take longer as it tests all combinations

comprehensive_evaluation = tester.run_evaluation(
    models=[
        "bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
        "bedrock/amazon.nova-pro-v1:0"
    ],  # Multiple LiteLLM endpoints
    system_prompts=["version2"],   # Multiple prompts
    prompts_dict=prompts,
    tool=tool_list,
    test_cases_path="config_evaluation.yml",
    save_to_csv=True
)

print(f"\nüéØ Comprehensive evaluation completed with {len(comprehensive_evaluation)} results")
print("\nüìä Detailed analysis shows model and prompt performance across all test cases")

## üîß LiteLLM Configuration Tips

### Environment Variables for AWS Bedrock
Make sure you have AWS credentials configured:
```bash
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_REGION=us-east-1
```

### Other Provider Examples
```python
# OpenAI
models = ["gpt-4", "gpt-3.5-turbo"]

# Anthropic Direct
models = ["claude-3-sonnet-20240229"]

# Azure OpenAI
models = ["azure/gpt-4"]
```

### Benefits of LiteLLM Integration
- **Unified Interface**: Same code works with 100+ LLM providers
- **Easy Switching**: Change models without code changes
- **Cost Optimization**: Compare costs across providers
- **Fallback Support**: Automatic failover between models
- **Rate Limiting**: Built-in rate limiting and retry logic