# Custom Agent Evaluation (Ragas and LangFuse)

This notebook provides a streamlined interface for evaluating agent performance using RAGAS metrics and LangFuse traces. All evaluation logic has been moved to `utils.py` and metrics are configured in `metrics_config.yaml`.

## Setup and Configuration

In [1]:
# Install required packages
%pip install ragas "strands-agents==0.1.9" "strands-agents-tools==0.1.7" "langfuse==3.1.1" pyyaml -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import base64
from utils import run_evaluation_pipeline, print_metric_summary

## Configuration of Test Parameters

Modify these parameters according to your evaluation needs:

In [None]:
# =============================================================================
# CONFIGURATION PARAMETERS - MODIFY AS NEEDED
# =============================================================================

# LangFuse Configuration
LANGFUSE_SECRET_KEY = "sk-lf-xxxxxxxx"
LANGFUSE_PUBLIC_KEY = "pk-lf-xxxxxxxxx"
LANGFUSE_BASE_URL = "https://us.cloud.langfuse.com"

# Evaluation Parameters
LOOKBACK_HOURS = 24          # Hours to look back for traces
BATCH_SIZE = 20              # Number of traces to process
LANGFUSE_TAGS = ["Observability-Tutorial"]  # Filter traces by tags (None for all)
SAVE_CSV = True              # Save results to CSV files

# Target LLM-as-Judge Model (from model_list.json)
TARGET_MODEL = "claude-3.7-sonnet"  # Available models: claude-4-sonnet, nova-premier, etc.

# File Paths
METRICS_CONFIG_PATH = "metrics_config.yaml"
MODEL_LIST_PATH = "model_list.json"

## Initialize Environment

In [9]:
# Set environment variables
os.environ["AWS_REGION_NAME"] = "us-east-1"
os.environ["LANGFUSE_SECRET_KEY"] = LANGFUSE_SECRET_KEY
os.environ["LANGFUSE_PUBLIC_KEY"] = LANGFUSE_PUBLIC_KEY
os.environ["LANGFUSE_HOST"] = LANGFUSE_BASE_URL

# Setup OpenTelemetry endpoint
otel_endpoint = LANGFUSE_BASE_URL + "/api/public/otel/v1/traces"
auth_token = base64.b64encode(f"{LANGFUSE_PUBLIC_KEY}:{LANGFUSE_SECRET_KEY}".encode()).decode()
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = otel_endpoint
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {auth_token}"

print(f"Environment configured for LangFuse host: {LANGFUSE_BASE_URL}")
print(f"Target evaluation model: {TARGET_MODEL}")

Environment configured for LangFuse host: https://us.cloud.langfuse.com
Target evaluation model: claude-3.7-sonnet


## Run Evaluation Pipeline

Execute the complete evaluation pipeline with the configured parameters:

In [11]:
# Prepare LangFuse configuration
langfuse_config = {
    "secret_key": LANGFUSE_SECRET_KEY,
    "public_key": LANGFUSE_PUBLIC_KEY,
    "host": LANGFUSE_BASE_URL
}

# Run the evaluation pipeline
print("Starting RAGAS evaluation pipeline...")
print(f"Configuration: {LOOKBACK_HOURS}h lookback, {BATCH_SIZE} traces, model: {TARGET_MODEL}")

results = run_evaluation_pipeline(
    langfuse_config=langfuse_config,
    model_name=TARGET_MODEL,
    lookback_hours=LOOKBACK_HOURS,
    batch_size=BATCH_SIZE,
    tags=LANGFUSE_TAGS,
    save_csv=SAVE_CSV,
    metrics_config_path=METRICS_CONFIG_PATH,
    model_list_path=MODEL_LIST_PATH
)

print("\nEvaluation pipeline completed!")

Starting RAGAS evaluation pipeline...
Configuration: 24h lookback, 20 traces, model: claude-3.7-sonnet
Fetching traces from 2025-11-13 11:01:46.206937 to 2025-11-14 11:01:46.206937
Fetched 20 traces
Evaluating 12 multi_turn samples


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Added score restaurant_directory_validation=0.0 to trace 911b6a60cc598118bfbaf02e1e1e80f1
Added score formal_greeting_compliance=0.0 to trace 911b6a60cc598118bfbaf02e1e1e80f1
Added score function_orchestration_efficiency=0.0 to trace 911b6a60cc598118bfbaf02e1e1e80f1
Added score reservation_workflow_completeness=1.0 to trace 911b6a60cc598118bfbaf02e1e1e80f1
Added score customer_experience_quality=1.0 to trace 911b6a60cc598118bfbaf02e1e1e80f1
Added score restaurant_directory_validation=0.0 to trace fc99846f6606f86e359f352a6d159cca
Added score formal_greeting_compliance=0.0 to trace fc99846f6606f86e359f352a6d159cca
Added score function_orchestration_efficiency=0.0 to trace fc99846f6606f86e359f352a6d159cca
Added score reservation_workflow_completeness=1.0 to trace fc99846f6606f86e359f352a6d159cca
Added score customer_experience_quality=1.0 to trace fc99846f6606f86e359f352a6d159cca
Added score restaurant_directory_validation=0.0 to trace d38afee92b495a478ce54b5262bc208d
Added score formal_g

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Exception in callback Task.__step()
handle: <Handle Task.__step()>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: cannot enter context: <_contextvars.Context object at 0x7f4568398980> is already entered


Error adding score: float() argument must be a string or a real number, not 'list'
Both scoring methods failed: 'Langfuse' object has no attribute 'score'
Error adding score: could not convert string to float: '{\'total_tests\': 4, \'successful_tests\': 4, \'results_summary\': [{\'test_id\': \'bedrock/us.amazon.nova-pro-v1:0_version2_1763071241\', \'model\': \'bedrock/us.amazon.nova-pro-v1:0\', \'prompt\': \'version2\', \'query\': \'Make a reservation for tonight at Rice & Spice for 5 persons At 8pm in the name of Andres\', \'response\': \'<thinking> The restaurant "Rice & Spice" exists in the directory, and the booking has been successfully created with the booking ID 5deaa43f. I will now inform the user of the successful reservation. </thinking> <answer> Restaurant Helper: Your reservation for tonight at 8 PM for 5 persons at Rice & Spice under the name of Andres has been successfully created. Your booking ID is 5deaa43f. If you need any further assistance, please do not hesitate to 

## View Results Summary

In [12]:
# Display results summary with configurable performance ranges
if results:
    
    has_results = False
    
    # Performance range configuration - adjust as needed
    # Examples: [0, 1] for 0-1 scale, [1, 5] for 1-5 scale
    PERFORMANCE_RANGE = [0, 1]  # Change this to [1, 5] for 1-5 scale evaluation
    
    if "conversation_results" in results and results["conversation_results"] is not None:
        if not results["conversation_results"].empty:
            print_metric_summary(
                results["conversation_results"], 
                "MULTI-TURN CONVERSATION EVALUATION",
                performance_range=PERFORMANCE_RANGE
            )
            has_results = True
    
    if "single_turn_results" in results and results["single_turn_results"] is not None:
        if not results["single_turn_results"].empty:
            print_metric_summary(
                results["single_turn_results"], 
                "SINGLE-TURN EVALUATION",
                performance_range=PERFORMANCE_RANGE
            )
            has_results = True
    
    if not has_results:
        print("\n‚ö†Ô∏è  No evaluation results available - check trace availability and configuration")
else:
    print("\n‚ùå No results returned from evaluation pipeline")


  MULTI-TURN CONVERSATION EVALUATION
üìä Samples Evaluated: 12

üìà METRIC SCORES SUMMARY
----------------------------------------

restaurant_directory_validation:
  Mean: 0.000 | Min: 0.000 | Max: 0.000 | üî¥ POOR

formal_greeting_compliance:
  Mean: 0.750 | Min: 0.000 | Max: 1.000 | üü° GOOD

function_orchestration_efficiency:
  Mean: 0.000 | Min: 0.000 | Max: 0.000 | üî¥ POOR

reservation_workflow_completeness:
  Mean: 2.667 | Min: 1.000 | Max: 5.000 | üü¢ EXCELLENT

customer_experience_quality:
  Mean: 3.083 | Min: 1.000 | Max: 4.000 | üü¢ EXCELLENT

  SINGLE-TURN EVALUATION
üìä Samples Evaluated: 8

üìà METRIC SCORES SUMMARY
----------------------------------------

restaurant_directory_validation:
  Mean: 0.250 | Min: 0.000 | Max: 1.000 | üî¥ POOR

formal_greeting_compliance:
  Mean: 1.000 | Min: 1.000 | Max: 1.000 | üü¢ EXCELLENT

function_orchestration_efficiency:
  Mean: 0.000 | Min: 0.000 | Max: 0.000 | üî¥ POOR

reservation_workflow_completeness:
  Mean: 4.125 