# Custom Agent Evaluation (Ragas and LangFuse)

This notebook provides a streamlined interface for evaluating agent performance using RAGAS metrics and LangFuse traces. All evaluation logic has been moved to `utils.py` and metrics are configured in `metrics_config.yaml`.

## Setup and Configuration

In [1]:
# Install required packages
%pip install ragas "strands-agents==0.1.9" "strands-agents-tools==0.1.7" "langfuse==3.1.1" pyyaml -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.4.0 requires nvidia-ml-py3<8.0,>=7.352.0, which is not installed.
autogluon-common 1.4.0 requires pyarrow<21.0.0,>=7.0.0, but you have pyarrow 21.0.0 which is incompatible.
autogluon-multimodal 1.4.0 requires transformers[sentencepiece]<4.50,>=4.38.0, but you have transformers 4.55.2 which is incompatible.
autogluon-timeseries 1.4.0 requires transformers[sentencepiece]<4.50,>=4.38.0, but you have transformers 4.55.2 which is incompatible.
langchain-aws 0.2.19 requires boto3>=1.37.24, but you have boto3 1.37.1 which is incompatible.
mlflow 2.22.0 requires pyarrow<20,>=4.0.0, but you have pyarrow 21.0.0 which is incompatible.
pathos 0.3.4 requires multiprocess>=0.70.18, but you have multiprocess 0.70.16 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use up

In [1]:
import os
import base64
from utils import run_evaluation_pipeline, print_metric_summary

## Configuration of Test Parameters

Modify these parameters according to your evaluation needs:

In [11]:
# =============================================================================
# CONFIGURATION PARAMETERS - MODIFY AS NEEDED
# =============================================================================

# LangFuse Configuration
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-283af809-1820-4eb4-9a3c-05dbc8d68e9b"  # Starts with sk-
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-837662dd-32fa-4331-9b86-609b62640939"  # Starts with pk-
os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com"

# Evaluation Parameters
LOOKBACK_HOURS = 24          # Hours to look back for traces
BATCH_SIZE = 20              # Number of traces to process
LANGFUSE_TAGS = ""  # Filter traces by tags (None for all)
SAVE_CSV = True              # Save results to CSV files

# Target LLM-as-Judge Model (from model_list.json)
TARGET_MODEL = "claude-3.7-sonnet"  # Available models: claude-4-sonnet, nova-premier, etc.

# File Paths
METRICS_CONFIG_PATH = "metrics_config.yaml"
MODEL_LIST_PATH = "model_list.json"

### Initialize LangFuse Environment

In [12]:
# Setup OpenTelemetry endpoint
otel_endpoint = os.environ["LANGFUSE_HOST"] + "/api/public/otel/v1/traces"
auth_token = base64.b64encode(f"{os.environ["LANGFUSE_PUBLIC_KEY"] }:{os.environ["LANGFUSE_SECRET_KEY"]}".encode()).decode()
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = otel_endpoint
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {auth_token}"

print(f"Environment configured for LangFuse host: {os.environ["LANGFUSE_HOST"]}")
print(f"Target evaluation model: {TARGET_MODEL}")

Environment configured for LangFuse host: https://us.cloud.langfuse.com
Target evaluation model: claude-3.7-sonnet


## Run Evaluation Pipeline

Execute the complete evaluation pipeline with the configured parameters:

In [None]:
# Prepare LangFuse configuration
langfuse_config = {
    "secret_key": os.environ["LANGFUSE_SECRET_KEY"],
    "public_key": os.environ["LANGFUSE_PUBLIC_KEY"],
    "host": os.environ["LANGFUSE_HOST"]
}

# Run the evaluation pipeline
print("Starting RAGAS evaluation pipeline...")
print(f"Configuration: {LOOKBACK_HOURS}h lookback, {BATCH_SIZE} traces, model: {TARGET_MODEL}")

results = run_evaluation_pipeline(
    langfuse_config=langfuse_config,
    model_name=TARGET_MODEL,
    lookback_hours=LOOKBACK_HOURS,
    batch_size=BATCH_SIZE,
    tags=LANGFUSE_TAGS,
    save_csv=SAVE_CSV,
    metrics_config_path=METRICS_CONFIG_PATH,
    model_list_path=MODEL_LIST_PATH
)

print("\nEvaluation pipeline completed!")

Starting RAGAS evaluation pipeline...
Configuration: 24h lookback, 20 traces, model: claude-3.7-sonnet
Fetching traces from 2025-10-22 18:52:44.366111 to 2025-10-23 18:52:44.366111
Fetched 9 traces
Evaluating 7 multi_turn samples


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Error raised by bedrock service
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/langchain_aws/llms/bedrock.py", line 935, in _prepare_input_and_invoke
    response = self.client.invoke_model(**request_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/botocore/client.py", line 570, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/botocore/context.py", line 124, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/botocore/client.py", line 1031, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait befor

## View Results Summary

In [6]:
# Display results summary with configurable performance ranges
if results:
    
    has_results = False
    
    # Performance range configuration - adjust as needed
    # Examples: [0, 1] for 0-1 scale, [1, 5] for 1-5 scale
    PERFORMANCE_RANGE = [0, 1]  # Change this to [1, 5] for 1-5 scale evaluation
    
    if "conversation_results" in results and results["conversation_results"] is not None:
        if not results["conversation_results"].empty:
            print_metric_summary(
                results["conversation_results"], 
                "MULTI-TURN CONVERSATION EVALUATION",
                performance_range=PERFORMANCE_RANGE
            )
            has_results = True
    
    if "single_turn_results" in results and results["single_turn_results"] is not None:
        if not results["single_turn_results"].empty:
            print_metric_summary(
                results["single_turn_results"], 
                "SINGLE-TURN EVALUATION",
                performance_range=PERFORMANCE_RANGE
            )
            has_results = True
    
    if not has_results:
        print("\n⚠️  No evaluation results available - check trace availability and configuration")
else:
    print("\n❌ No results returned from evaluation pipeline")


  MULTI-TURN CONVERSATION EVALUATION
📊 Samples Evaluated: 3

📈 METRIC SCORES SUMMARY
----------------------------------------

Tone of the agent Metric:
  Mean: 0.000 | Min: 0.000 | Max: 0.000 | 🔴 POOR

Tool Usage Effectiveness:
  Mean: 0.000 | Min: 0.000 | Max: 0.000 | 🔴 POOR

Tarea/Objetivo Cumplido:
  Mean: 0.667 | Min: 0.000 | Max: 1.000 | 🟡 GOOD

CI/CD Gate (Regresiones de Precisión):
  Mean: 1.000 | Min: 1.000 | Max: 1.000 | 🟢 EXCELLENT

Cumplimiento de Políticas:
  Mean: 0.333 | Min: 0.000 | Max: 1.000 | 🔴 POOR

Answer Correctness:
  Mean: 5.000 | Min: 5.000 | Max: 5.000 | 🟢 EXCELLENT
