# Tracing and Evaluating Bedrock Agents with Langfuse

This notebook demonstrates how to integrate [Langfuse](https://langfuse.com) for tracing and evaluation of [Bedrock Agents](https://aws.amazon.com/bedrock/agents/). Bedrock Agents enable you to build generative AI applications that can perform tasks, orchestrate calls to company systems, and access knowledge sources.

We will cover:
1.  **Setup**: Installing necessary packages and configuring AWS and Langfuse credentials.
2.  **Basic Agent Invocation**: Interacting with a Bedrock Agent.
3.  **Tracing with OpenTelemetry**: Sending detailed traces of agent interactions to Langfuse.
4.  **Offline Evaluation**: Using Langfuse Datasets to systematically test your agent and compare performance across different versions or configurations.

## Part 0: Setup - Install Dependencies

First, let's install the necessary Python packages. This includes `boto3` for interacting with AWS services, OpenTelemetry packages for tracing, and `langfuse` for observability.

In [None]:
%pip install boto3
%pip install opentelemetry-api
%pip install opentelemetry-sdk
%pip install opentelemetry-exporter-otlp
%pip install wrapt
%pip install langfuse

Next, we import the required libraries for interacting with AWS, handling data, unique identifiers, and setting up Langfuse tracing via our custom `core` module.

In [1]:
import time
import boto3
import uuid
import json
from core.timer_lib import timer
from core import instrument_agent_invocation, flush_telemetry
import os
import base64

## Part 1: Interacting with AWS Bedrock Agents

### Step 1.1: Configure AWS Credentials

To interact with your AWS Bedrock Agent, you need to configure your AWS credentials. 

**IMPORTANT**: Replace the placeholder values below with your actual `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_DEFAULT_REGION`. 
For more information on AWS credentials and best practices (like using IAM roles), refer to the [AWS Boto3 Credentials Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).

In [2]:
os.environ["AWS_ACCESS_KEY_ID"] = "AKIAWLHGBJNACCDMQBFP"
os.environ["AWS_SECRET_ACCESS_KEY"] = "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"
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

### Step 1.2: Specify Bedrock Agent Details

Provide the `agent_id` and `agent_alias_id` for the Bedrock Agent you want to interact with. You can find these in your AWS Bedrock console after creating and deploying an agent.

In [3]:
agent_id = "VDD9470BPM"  # <- Configure your Bedrock Agent ID
agent_alias_id = "TSTALIASID"  

### Step 1.3: Initialize Bedrock Agent Runtime Client

We create a `boto3` client for the `bedrock-agent-runtime` service. This client will be used to invoke the agent.

In [4]:
import boto3

# Create the client to invoke Agents in Amazon Bedrock:
br_agents_runtime = boto3.client("bedrock-agent-runtime")

### Step 1.4: Test Basic Agent Invocation

Let's perform a simple invocation to ensure we can connect to the agent and receive a response. We'll use a unique session ID for this initial test. A `sessionId` helps maintain context across multiple turns in a conversation.

In [5]:
print(f"Trying to invoke alias {agent_alias_id} of agent {agent_id}...")
agent_resp = br_agents_runtime.invoke_agent(
    agentAliasId=agent_alias_id,
    agentId=agent_id,
    inputText="Hello!",
    sessionId="dummy-session",
)
if "completion" in agent_resp:
    print("✅ Got response")
else:
    raise ValueError(f"No 'completion' in agent response:\n{agent_resp}")

Trying to invoke alias TSTALIASID of agent VDD9470BPM...
✅ Got response


## Part 2: Tracing Agent Invocations with Langfuse

Langfuse provides detailed tracing for your LLM applications. We'll use OpenTelemetry (OTEL) to send trace data from our Bedrock Agent interactions to Langfuse. This allows us to monitor performance, debug issues, and understand the agent's behavior.

Refer to the [Langfuse OpenTelemetry documentation](/docs/opentelemetry/get-started) for more details on setting up OTEL with Langfuse.

### Step 2.1: Configure Langfuse Credentials & OTEL Variables

Set up your Langfuse public key, secret key, and host. You can find your API keys in your Langfuse project settings ([Cloud](https://cloud.langfuse.com) or your self-hosted instance). 

In [6]:
# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-d4bf91ce-b51a-45ea-bda8-cf2368e06af5" 
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-16e924ad-f614-4b8c-844d-1602e8a9e764" 

os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # 🇪🇺 EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # 🇺🇸 US region

# For Langfuse specifically but you can add any other observability provider:
os.environ["OTEL_SERVICE_NAME"] = 'Langfuse'
os.environ["DEPLOYMENT_ENVIRONMENT"] = "dev"
project_name = "agent-observability"
environment = "dev"

LANGFUSE_AUTH = base64.b64encode(
    f"{os.environ.get('LANGFUSE_PUBLIC_KEY')}:{os.environ.get('LANGFUSE_SECRET_KEY')}".encode()
).decode()

os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = os.environ.get("LANGFUSE_HOST") + "/api/public/otel/v1/traces"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {LANGFUSE_AUTH}"

### Step 2.2: Define Instrumented Agent Invocation Function

We define a wrapper function `invoke_bedrock_agent_instrumented`. This function is decorated with `@instrument_agent_invocation`, which handles the OpenTelemetry span creation and export to Langfuse. This decorator automatically captures details about the agent's execution, including inputs, outputs, metadata, and any errors.

In [7]:
@instrument_agent_invocation
def invoke_bedrock_agent(
    inputText: str, agentId: str, agentAliasId: str, sessionId: str, **kwargs
):
    """Invoke a Bedrock Agent with instrumentation for Langfuse."""
    # Create Bedrock client
    bedrock_rt_client = boto3.client("bedrock-agent-runtime")
    use_streaming = kwargs.get("streaming", False)
    invoke_params = {
        "inputText": inputText,
        "agentId": agentId,
        "agentAliasId": agentAliasId,
        "sessionId": sessionId,
        "enableTrace": True,  # Required for instrumentation
    }

    # Add streaming configurations if needed
    if use_streaming:
        invoke_params["streamingConfigurations"] = {
            "applyGuardrailInterval": 10,
            "streamFinalResponse": True,
        }
    response = bedrock_rt_client.invoke_agent(**invoke_params)
    return response

### Step 2.3: Invoke Agent with Langfuse Tracing

Now, let's call our instrumented function. We can pass additional metadata that will be captured by Langfuse, such as `trace_id` (to group related operations), `userId` (to track interactions per user), `tags` (for categorization), `project_name`, and `environment`. This metadata enriches the traces in Langfuse, making them easier to search, filter, and analyze. 

In [9]:
# Generate a custom trace ID
trace_id = str(uuid.uuid4())

# Single invocation that works for both streaming and non-streaming
response = invoke_bedrock_agent(
    inputText="Hi there!",
    agentId="VDD9470BPM",
    agentAliasId="TSTALIASID",
    sessionId="session-123456789",
    show_traces=True,
    SAVE_TRACE_LOGS=True,
    userId="user-1234",
    tags=["bedrock-agent", "example", "development"],
    trace_id=trace_id,
    project_name="bedrock-agent-observability",
    environment="dev",
    langfuse_public_key=os.environ.get('LANGFUSE_PUBLIC_KEY'),
    langfuse_secret_key=os.environ.get('LANGFUSE_SECRET_KEY'),
    langfuse_api_url=os.environ.get('LANGFUSE_HOST'),
    streaming=False,
    model_id="claude-3-5-sonnet-20241022",
)

print(response)

flush_telemetry()

Overriding of current TracerProvider is not allowed
Error during agent invocation: AWSHTTPSConnectionPool(host='bedrock-agent-runtime.us-east-1.amazonaws.com', port=443): Read timed out.
Traceback (most recent call last):
  File "/Users/jannik/Documents/GitHub/langfuse-docs/.venv/lib/python3.13/site-packages/urllib3/response.py", line 754, in _error_catcher
    yield
  File "/Users/jannik/Documents/GitHub/langfuse-docs/.venv/lib/python3.13/site-packages/urllib3/response.py", line 1219, in read_chunked
    self._update_chunk_length()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/jannik/Documents/GitHub/langfuse-docs/.venv/lib/python3.13/site-packages/urllib3/response.py", line 1138, in _update_chunk_length
    line = self._fp.fp.readline()  # type: ignore[union-attr]
  File "/opt/homebrew/Cellar/python@3.13/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/socket.py", line 719, in readinto
    return self._sock.recv_into(b)
           ~~~~~~~~~~~~~~~~~~~~^^^
  File "/opt/

{'error': "AWSHTTPSConnectionPool(host='bedrock-agent-runtime.us-east-1.amazonaws.com', port=443): Read timed out.", 'exception': "AWSHTTPSConnectionPool(host='bedrock-agent-runtime.us-east-1.amazonaws.com', port=443): Read timed out."}


## Part 3: Offline Evaluation of Bedrock Agents using Langfuse Datasets

While online evaluation provides live feedback, **offline evaluation** is crucial for systematically testing your agent against a benchmark dataset before deployment or during development iterations. This helps ensure quality and reliability. 

In a typical offline evaluation workflow with Langfuse Datasets:
1.  You prepare a benchmark dataset in Langfuse. Each dataset item consists of an input (e.g., a question) and optionally an expected output or other metadata.
2.  You iterate through the dataset items, running your Bedrock Agent for each input.
3.  You link each agent execution (which is a Langfuse trace) back to the corresponding dataset item in Langfuse.
4.  Optionally, you can add scores (e.g., for correctness, relevance) to these linked runs, either manually or using automated methods like [Model-Based Evals](/docs/scores/model-based-evals). Langfuse then enables you to compare performance across different evaluation runs.

### Step 3.2: Define Agent Function for Dataset Item Processing

We define a function `run_agent_on_dataset_item` that takes a dataset item's input, invokes the Bedrock agent using our instrumented function `invoke_bedrock_agent_instrumented`, and returns the trace ID and the agent's output.

In [None]:
def my_agent(question):      
        
        # Generate a custom trace ID
        trace_id = str(uuid.uuid4())

        response = invoke_bedrock_agent(
            inputText=question,
            agentId=agentId,
            agentAliasId=agentAliasId,
            sessionId=sessionId,
            show_traces=True,
            SAVE_TRACE_LOGS=True,
            userId=userId,
            tags=tags,
            trace_id=trace_id,
            project_name=project_name,
            environment=environment,
            langfuse_public_key=os.environ.get('LANGFUSE_PUBLIC_KEY'),
            langfuse_secret_key=os.environ.get('LANGFUSE_SECRET_KEY'),
            langfuse_api_url=os.environ.get('LANGFUSE_HOST'),
            streaming=False,
            model_id=agent_model_id,
        )        

        return trace_id, response['extracted_completion']

### Step 3.3: Run Agent on Dataset and Link Traces to Langfuse

Finally, we fetch a dataset from Langfuse and then iterate through each item in the dataset. 

Finally, `langfuse.flush()` ensures all trace data and linkage information are sent to Langfuse.

In [None]:
from langfuse import Langfuse
langfuse = Langfuse()

dataset = langfuse.get_dataset('dataset-restaurant-agent')

for item in dataset.items:

    trace_id, output = my_agent(item.input["text"])

    # link the execution trace to the dataset item and give it a run_name
    item.link(
        trace_or_observation = langfuse.trace(id = trace_id),
        run_name = "run_test",
        run_description="my dataset run", # optional
        run_metadata={ "model": "gpt-4.5-preview" } # optional
    )

langfuse.flush()

### Step 3.4: Analyze Results in Langfuse

After running the evaluation, navigate to your project in Langfuse. In the 'Datasets' section, select your dataset. You will find your evaluation run listed under the 'Runs' tab for that dataset.

Langfuse helps you compare these runs based on the captured traces and associated scores. You can set up [Model-Based Evals](/docs/scores/model-based-evals) in Langfuse to automatically assess aspects like correctness against expected answers (if your dataset includes them) or other quality dimensions based on LLM-as-a-judge.