# Custom Agent Evaluation (Ragas and LangFuse)

This notebook provides a streamlined interface for evaluating agent performance using RAGAS metrics and LangFuse traces. All evaluation logic has been moved to `utils.py` and metrics are configured in `metrics_config.yaml`.

## Setup and Configuration

In [None]:
# Install required packages
%pip install ragas "strands-agents==0.1.9" "strands-agents-tools==0.1.7" "langfuse==3.1.1" pyyaml -q

In [2]:
import os
import base64
from utils import run_evaluation_pipeline, print_metric_summary

## Configuration of Test Parameters

Modify these parameters according to your evaluation needs:

In [3]:
# =============================================================================
# CONFIGURATION PARAMETERS - MODIFY AS NEEDED
# =============================================================================

# LangFuse Configuration
LANGFUSE_SECRET_KEY = "sk-lf-xxxxxxxx"
LANGFUSE_PUBLIC_KEY = "pk-lf-xxxxxxxx"
LANGFUSE_HOST = "https://us.cloud.langfuse.com"

# Evaluation Parameters
LOOKBACK_HOURS = 24          # Hours to look back for traces
BATCH_SIZE = 20              # Number of traces to process
LANGFUSE_TAGS = ["Observability-Tutorial"]  # Filter traces by tags (None for all)
SAVE_CSV = True              # Save results to CSV files

# Target LLM-as-Judge Model (from model_list.json)
TARGET_MODEL = "claude-3.7-sonnet"  # Available models: claude-4-sonnet, nova-premier, etc.

# File Paths
METRICS_CONFIG_PATH = "metrics_config.yaml"
MODEL_LIST_PATH = "model_list.json"

## Initialize Environment

In [4]:
# Set environment variables
os.environ["AWS_REGION_NAME"] = "us-east-1"
os.environ["LANGFUSE_SECRET_KEY"] = LANGFUSE_SECRET_KEY
os.environ["LANGFUSE_PUBLIC_KEY"] = LANGFUSE_PUBLIC_KEY
os.environ["LANGFUSE_HOST"] = LANGFUSE_HOST

# Setup OpenTelemetry endpoint
otel_endpoint = LANGFUSE_HOST + "/api/public/otel/v1/traces"
auth_token = base64.b64encode(f"{LANGFUSE_PUBLIC_KEY}:{LANGFUSE_SECRET_KEY}".encode()).decode()
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = otel_endpoint
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {auth_token}"

print(f"Environment configured for LangFuse host: {LANGFUSE_HOST}")
print(f"Target evaluation model: {TARGET_MODEL}")

Environment configured for LangFuse host: http://langfu-loadb-ukoqudmq8a8v-2110705221.us-east-1.elb.amazonaws.com
Target evaluation model: claude-3.7-sonnet


## Run Evaluation Pipeline

Execute the complete evaluation pipeline with the configured parameters:

In [5]:
# Prepare LangFuse configuration
langfuse_config = {
    "secret_key": LANGFUSE_SECRET_KEY,
    "public_key": LANGFUSE_PUBLIC_KEY,
    "host": LANGFUSE_HOST
}

# Run the evaluation pipeline
print("Starting RAGAS evaluation pipeline...")
print(f"Configuration: {LOOKBACK_HOURS}h lookback, {BATCH_SIZE} traces, model: {TARGET_MODEL}")

results = run_evaluation_pipeline(
    langfuse_config=langfuse_config,
    model_name=TARGET_MODEL,
    lookback_hours=LOOKBACK_HOURS,
    batch_size=BATCH_SIZE,
    tags=LANGFUSE_TAGS,
    save_csv=SAVE_CSV,
    metrics_config_path=METRICS_CONFIG_PATH,
    model_list_path=MODEL_LIST_PATH
)

print("\nEvaluation pipeline completed!")

Starting RAGAS evaluation pipeline...
Configuration: 24h lookback, 20 traces, model: claude-3.7-sonnet
Fetching traces from 2025-10-28 04:10:22.933937 to 2025-10-29 04:10:22.933937
No traces found with time filter, trying without time constraints...
Fetched 20 traces
Evaluating 15 multi_turn samples


Evaluating:   0%|          | 0/75 [00:00<?, ?it/s]

Added score Task/Objective Achieved=1.0 to trace 88d219576171ad507d88c34c873c64a3
Added score Tone of the Agent Metric=1.0 to trace 88d219576171ad507d88c34c873c64a3
Added score Tool Usage Effectiveness=1.0 to trace 88d219576171ad507d88c34c873c64a3
Added score Policy Compliance=1.0 to trace 88d219576171ad507d88c34c873c64a3
Added score Answer Correctness=4.0 to trace 88d219576171ad507d88c34c873c64a3
Added score Task/Objective Achieved=0.0 to trace c05cbb09247386e50cb4c1a633aa8c4c
Added score Tone of the Agent Metric=1.0 to trace c05cbb09247386e50cb4c1a633aa8c4c
Added score Tool Usage Effectiveness=0.0 to trace c05cbb09247386e50cb4c1a633aa8c4c
Added score Policy Compliance=1.0 to trace c05cbb09247386e50cb4c1a633aa8c4c
Added score Answer Correctness=5.0 to trace c05cbb09247386e50cb4c1a633aa8c4c
Added score Task/Objective Achieved=0.0 to trace d586b4dca7121b0a830f1283b8fe97ca
Added score Tone of the Agent Metric=1.0 to trace d586b4dca7121b0a830f1283b8fe97ca
Added score Tool Usage Effectiven

Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

Error adding score: float() argument must be a string or a real number, not 'list'
Both scoring methods failed: 'Langfuse' object has no attribute 'score'
Error adding score: could not convert string to float: '<answer> Restaurant Helper: Your reservation for tonight at Rice & Spice for 5 persons at 8pm under the name Andres has been successfully created. Your booking ID is 5c73b5a3. </answer>\n'
Both scoring methods failed: 'Langfuse' object has no attribute 'score'
Added score Task/Objective Achieved=1.0 to trace 21446e18d418b07a7c70b8ffca4c9145
Added score Tone of the Agent Metric=1.0 to trace 21446e18d418b07a7c70b8ffca4c9145
Added score Tool Usage Effectiveness=0.0 to trace 21446e18d418b07a7c70b8ffca4c9145
Added score Policy Compliance=1.0 to trace 21446e18d418b07a7c70b8ffca4c9145
Added score Answer Correctness=5.0 to trace 21446e18d418b07a7c70b8ffca4c9145
Error adding score: float() argument must be a string or a real number, not 'list'
Both scoring methods failed: 'Langfuse' obje

## View Results Summary

In [7]:
# Display results summary with configurable performance ranges
if results:
    
    has_results = False
    
    # Performance range configuration - adjust as needed
    # Examples: [0, 1] for 0-1 scale, [1, 5] for 1-5 scale
    PERFORMANCE_RANGE = [0, 1]  # Change this to [1, 5] for 1-5 scale evaluation
    
    if "conversation_results" in results and results["conversation_results"] is not None:
        if not results["conversation_results"].empty:
            print_metric_summary(
                results["conversation_results"], 
                "MULTI-TURN CONVERSATION EVALUATION",
                performance_range=PERFORMANCE_RANGE
            )
            has_results = True
    
    if "single_turn_results" in results and results["single_turn_results"] is not None:
        if not results["single_turn_results"].empty:
            print_metric_summary(
                results["single_turn_results"], 
                "SINGLE-TURN EVALUATION",
                performance_range=PERFORMANCE_RANGE
            )
            has_results = True
    
    if not has_results:
        print("\n‚ö†Ô∏è  No evaluation results available - check trace availability and configuration")
else:
    print("\n‚ùå No results returned from evaluation pipeline")


  MULTI-TURN CONVERSATION EVALUATION
üìä Samples Evaluated: 15

üìà METRIC SCORES SUMMARY
----------------------------------------

Task/Objective Achieved:
  Mean: 0.533 | Min: 0.000 | Max: 1.000 | üü† NEEDS IMPROVEMENT

Tone of the Agent Metric:
  Mean: 0.800 | Min: 0.000 | Max: 1.000 | üü¢ EXCELLENT

Tool Usage Effectiveness:
  Mean: 0.733 | Min: 0.000 | Max: 1.000 | üü° GOOD

Policy Compliance:
  Mean: 1.000 | Min: 1.000 | Max: 1.000 | üü¢ EXCELLENT

Answer Correctness:
  Mean: 4.333 | Min: 1.000 | Max: 5.000 | üü¢ EXCELLENT

  SINGLE-TURN EVALUATION
üìä Samples Evaluated: 5

üìà METRIC SCORES SUMMARY
----------------------------------------

Task/Objective Achieved:
  Mean: 1.000 | Min: 1.000 | Max: 1.000 | üü¢ EXCELLENT

Tone of the Agent Metric:
  Mean: 1.000 | Min: 1.000 | Max: 1.000 | üü¢ EXCELLENT

Tool Usage Effectiveness:
  Mean: 0.000 | Min: 0.000 | Max: 0.000 | üî¥ POOR

Policy Compliance:
  Mean: 1.000 | Min: 1.000 | Max: 1.000 | üü¢ EXCELLENT

Answer Corre