# Evaluate Langfuse Traces using an external evaluation pipeline

#### An external evaluation pipeline is useful when you need:
- More control over when traces get evaluated. You could schedule the pipeline to run at specific times or responding to event-based triggers like Webhooks.
- Greater flexibility with your custom evaluations, when your needs go beyond what’s possible with the Langfuse UI
- Version control for your custom evaluations
- The ability to evaluate data using existing evaluation frameworks and pre-defined metrics

In this notebook, we will learn to implement a external evaluation pipeline by doing the following:
1. Create a synthetic dataset to test your models.
2. Use the Langfuse client to gather and filter traces of previous model runs
3. Evaluate these traces offline and incrementally
4. Add scores to existing Langfuse traces



In [None]:
%pip install datasets ragas llama_index python-dotenv langchain-aws boto3 pandas --upgrade
%pip install langfuse==2.54.1 

In [13]:
# import all the necessary packages
from typing import Any, Dict, List, Optional, Tuple

import boto3
from botocore.exceptions import ClientError
from langfuse import Langfuse
from langfuse.decorators import langfuse_context, observe
from langfuse.model import PromptClient

In [14]:
# Define the environment variables
import os
os.environ["LANGFUSE_SECRET_KEY"] = "" # Your Langfuse project secret key
os.environ["LANGFUSE_PUBLIC_KEY"] = "" # Your Langfuse project public key
os.environ["LANGFUSE_HOST"] = "" # Langfuse domain

In [113]:
import json

from typing import List, Dict, Optional, Any

from langfuse import Langfuse
from langfuse.client import PromptClient
from langfuse.decorators import observe, langfuse_context
from botocore.exceptions import ClientError

# External Dependencies:
import boto3  # General Python SDK for AWS (including Bedrock)
import pandas as pd  # For working with tabular data
from tqdm.notebook import tqdm  # Progress bars

botosess = boto3.Session(region_name="us-east-1")
region = botosess.region_name
account_id = boto3.client('sts').get_caller_identity()['Account']

# langfuse client
langfuse = Langfuse()
langfuse.auth_check()

True

In [15]:
# used to access Bedrock configuration
bedrock = boto3.client(
    service_name="bedrock",
    region_name="us-east-1"
)
 
# used to invoke the Bedrock Converse API
bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1"
)

bedrock_agent_runtime = boto3.client(
    service_name="bedrock-agent-runtime",
    region_name="us-east-1"
)

# Check which models are available in your account
models = bedrock.list_inference_profiles()
for model in models["inferenceProfileSummaries"]:
  print(model["inferenceProfileName"] + " - " + model["inferenceProfileId"])

US Anthropic Claude 3 Sonnet - us.anthropic.claude-3-sonnet-20240229-v1:0
US Anthropic Claude 3 Opus - us.anthropic.claude-3-opus-20240229-v1:0
US Anthropic Claude 3 Haiku - us.anthropic.claude-3-haiku-20240307-v1:0
US Meta Llama 3.2 11B Instruct - us.meta.llama3-2-11b-instruct-v1:0
US Meta Llama 3.2 3B Instruct - us.meta.llama3-2-3b-instruct-v1:0
US Meta Llama 3.2 90B Instruct - us.meta.llama3-2-90b-instruct-v1:0
US Meta Llama 3.2 1B Instruct - us.meta.llama3-2-1b-instruct-v1:0
US Anthropic Claude 3.5 Sonnet - us.anthropic.claude-3-5-sonnet-20240620-v1:0
US Anthropic Claude 3.5 Haiku - us.anthropic.claude-3-5-haiku-20241022-v1:0
US Meta Llama 3.1 8B Instruct - us.meta.llama3-1-8b-instruct-v1:0
US Meta Llama 3.1 70B Instruct - us.meta.llama3-1-70b-instruct-v1:0
US Nova Lite - us.amazon.nova-lite-v1:0
US Nova Pro - us.amazon.nova-pro-v1:0
US Nova Micro - us.amazon.nova-micro-v1:0
US Meta Llama 3.3 70B Instruct - us.meta.llama3-3-70b-instruct-v1:0
US Anthropic Claude 3.5 Sonnet v2 - us.a

# Generate synthetic data

In this notebook, we consider a use case of leveraging a LLM to generate product descriptions that can be used in advising the product on a e-commerce page. The first step is to generate a list of products and for each of them, instruct Amazon Nova Lite to 
generate brief product descriptions.

In [28]:
# Let's prompt the model to generate 50 products

messages = [
    {"role": "user", "content": [{"text": "For a variety of 50 different product categories sold on a e-commerce website, \
    generate one product that is interesting to a consumer. The product names should be reflective of the actual product being sold. \
    Generate the 50 product items as comma separated values. Do not generate any additional words apart from the product names"}
                                ]
    },
]

# Make the API call to the Nova Lite model
model_response = bedrock_runtime.converse(
    modelId='us.amazon.nova-lite-v1:0', 
    messages=messages
)

# Print the generated text
print("\n[Response Content Text]")
print(model_response['output']['message']['content'][0]['text'])


[Response Content Text]
"SmartHome Voice Assistant", "Eco-Friendly Reusable Water Bottle", "Wireless Noise-Canceling Headphones", "Portable Solar Charger", "AI-Powered Personal Fitness Coach", "Ergonomic Standing Desk Converter", "Augmented Reality Glasses", "High-Resolution 4K Smart TV", "Multi-Functional Electric Pressure Cooker", "Smart Home Security Camera", "Foldable Electric Scooter", "Smart Sleep Mask", "Biodegradable Dishwashing Tablets", "Compact Travel Vacuum Cleaner", "Smart Pet Feeder", "Handheld Cordless Vacuum", "Smart Indoor Herb Garden", "UV-C Disinfection Box", "Wearable Fitness Tracker", "Smart Refrigerator", "Noise-Canceling Earplugs", "Automated Pet Walking Robot", "High-Capacity Portable Charger", "Smart Air Purifier", "Electric Toothbrush with Smart Sensor", "Smart Lock with Fingerprint Access", "GPS-Enabled Smart Dog Collar", "Smart LED Desk Lamp", "Portable Mini Projector", "Smart Irrigation System", "Voice-Activated Home Assistant", "Robot Vacuum Cleaner", "Sm

In [29]:
products_text = model_response['output']['message']['content'][0]['text']
products = [item.strip() for item in products_text.split(",")]

for prd in products:
    print(prd)

"SmartHome Voice Assistant"
"Eco-Friendly Reusable Water Bottle"
"Wireless Noise-Canceling Headphones"
"Portable Solar Charger"
"AI-Powered Personal Fitness Coach"
"Ergonomic Standing Desk Converter"
"Augmented Reality Glasses"
"High-Resolution 4K Smart TV"
"Multi-Functional Electric Pressure Cooker"
"Smart Home Security Camera"
"Foldable Electric Scooter"
"Smart Sleep Mask"
"Biodegradable Dishwashing Tablets"
"Compact Travel Vacuum Cleaner"
"Smart Pet Feeder"
"Handheld Cordless Vacuum"
"Smart Indoor Herb Garden"
"UV-C Disinfection Box"
"Wearable Fitness Tracker"
"Smart Refrigerator"
"Noise-Canceling Earplugs"
"Automated Pet Walking Robot"
"High-Capacity Portable Charger"
"Smart Air Purifier"
"Electric Toothbrush with Smart Sensor"
"Smart Lock with Fingerprint Access"
"GPS-Enabled Smart Dog Collar"
"Smart LED Desk Lamp"
"Portable Mini Projector"
"Smart Irrigation System"
"Voice-Activated Home Assistant"
"Robot Vacuum Cleaner"
"Smart Sleep Apnea Monitor"
"Electric Toothbrush with UV Ste

#### For each of the products, we will now generate product descriptions using Amazon Nova Lite, and capture the traces to Langfuse using the ```@observe()``` decorator

In [52]:
# Generate product descriptions for each product

@observe()
def general_chat(
    product,
    messages: List[Dict[str, Any]],
    prompt: Optional[PromptClient] = None,
    **kwargs,
) -> str | None:
    # 1. extract model metadata
    #modelId = "anthropic.claude-3-haiku-20240307-v1:0"
    modelId = "us.amazon.nova-lite-v1:0"
    inferenceConfig = {"maxTokens": 500, "temperature": 0.1}
    #additionalModelRequestFields = {"top_k": 250}, this seems to be causing the issue with Nova model, need to revist
    additionalModelRequestFields = {}

    model_parameters = {**inferenceConfig, **additionalModelRequestFields}

    langfuse_context.update_current_observation(
        input=messages,
        model=modelId,
        model_parameters=model_parameters,
        prompt=prompt,
        metadata=kwargs,
    )
    
    langfuse_context.update_current_trace(
        name=f"Description of '{product}'",
        tags=["bedrock_eval_pipelines"]
    )

    # Extract the system prompts from the messages and convert them to the format expected by the Bedrock Converse API
    system_prompts = [
        {"text": message["content"]}
        for message in messages
        if message["role"] == "system"
    ]

    # Convert the rest of messages to the format expected by the Bedrock Converse API
    messages = [
        {
            "role": message["role"],
            "content": (
                message["content"]
                if isinstance(message["content"], list)
                else [{"text": message["content"]}]
            ),
        }
        for message in messages
        if message["role"] != "system"  # Add this condition
    ]

    # 2. model call with error handling
    try:
        response = bedrock_runtime.converse(
            modelId=modelId,
            messages=messages,
            system=system_prompts,
            inferenceConfig=inferenceConfig,
            additionalModelRequestFields=additionalModelRequestFields,
            **kwargs,
        )
    except (ClientError, Exception) as e:
        error_message = f"ERROR: Can't invoke '{modelId}'. Reason: {e}"
        langfuse_context.update_current_observation(
            level="ERROR", status_message=error_message
        )
        print(error_message)
        return

    # 3. extract response metadata
    response_text = response["output"]["message"]["content"][0]["text"]
    langfuse_context.update_current_observation(
        output=response_text,
        usage={
            "input": response["usage"]["inputTokens"],
            "output": response["usage"]["outputTokens"],
            "total": response["usage"]["totalTokens"],
        },
        metadata={
            "ResponseMetadata": response["ResponseMetadata"],
        },
    )

    return response_text
 
prompt_template = "You are a product marketer and you need to generate detailed \
product descriptions for products which will be used for selling \
the product on a e-commerce website. Any catchy phrases from the \
descriptions will also be used for social meda campaigns. \
From the product descriptions, customers should be able to understand \
how the product can help them in their lives but also be able to trust \
this company. Your descriptions are fun and engaging. \
Your answer should be 4 sentences at max."

for product in products:
    print(f"Input: Generate a description for {product}")
    messages = [
        {"role": "system", "content": prompt_template},
        {"role": "user", "content": f"Generate a description for {product}"}
    ]
    print(f"Answer: {general_chat(product, messages)} \n")

Input: Generate a description for "SmartHome Voice Assistant"
Answer: Unleash the power of voice control with our SmartHome Voice Assistant! Seamlessly integrate and command your smart devices effortlessly. Experience convenience and efficiency like never before, transforming your home into a futuristic haven. Trust in our cutting-edge technology to simplify your life effortlessly. 

Input: Generate a description for "Eco-Friendly Reusable Water Bottle"
Answer: Quench your thirst sustainably with our Eco-Friendly Reusable Water Bottle! Crafted from BPA-free materials, it's your perfect companion for a greener lifestyle. Stay hydrated on-the-go while reducing plastic waste—because every sip counts! Join the movement and sip smart today. 

Input: Generate a description for "Wireless Noise-Canceling Headphones"
Answer: Elevate your audio experience with our Wireless Noise-Canceling Headphones! Immerse yourself in crystal-clear sound, while the advanced noise-cancellation technology blocks

### Now you should see these product descriptions in the Traces section of the langfuse UI.
![Traces collected from the LLM generations](product_description_traces.png "Langfuse Traces")


The goal of this tutorial is to show you how to build an external evaluation pipeline. These pipelines will run in your CI/CD environment, or be run in a different orchestrated container service. No matter the environment you choose, three key steps always apply:

1. Fetch Your Traces: Get your application traces to your evaluation environment
2. Run Your Evaluations: Apply any evaluation logic you prefer
3. Save Your Results: Attach your evaluations back to the Langfuse trace used for calculating them.

***
Goal: This evaluation pipeline is executed on all the traces over the past 24 hours
***

## 1. Fetch the traces

The ```fetch_traces()``` function has arguments to filter the traces by tags, timestamps, and beyond. We can also choose the number of samples for pagination.

In [53]:
from langfuse import Langfuse
from datetime import datetime, timedelta
 
BATCH_SIZE = 10
TOTAL_TRACES = 50
 
now = datetime.now()
last_24_hours = now - timedelta(days=1)

langfuse = Langfuse(
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    host="https://us.cloud.langfuse.com" 
)
 
traces_batch = langfuse.fetch_traces(page=1,
                                     limit=BATCH_SIZE,
                                     tags="bedrock_eval_pipelines",
                                     from_timestamp=last_24_hours,
                                     to_timestamp=datetime.now()
                                   ).data
 
print(f"Traces in first batch: {len(traces_batch)}")

Traces in first batch: 10


In [54]:
traces_batch[1]

TraceWithDetails(id='3e4f798e-4a3d-4118-9a2d-470da4829402', timestamp=datetime.datetime(2025, 2, 15, 18, 49, 57, 617000, tzinfo=datetime.timezone.utc), name='Description of \'"Smart Luggage with GPS Tracker"\'', input=[{'role': 'system', 'content': 'You are a product marketer and you need to generate detailed product descriptions for products which will be used for selling the product on a e-commerce website. Any catchy phrases from the descriptions will also be used for social meda campaigns. From the product descriptions, customers should be able to understand how the product can help them in their lives but also be able to trust this company. Your descriptions are fun and engaging. Your answer should be 4 sentences at max.'}, {'role': 'user', 'content': 'Generate a description for "Smart Luggage with GPS Tracker"'}], output="Travel with ease and confidence using our Smart Luggage with GPS Tracker! Say goodbye to lost luggage worries, as our innovative GPS feature allows you to locat

## 2. Categorical Evaluation using LLM-as-a-judge

Evaluation functions should take a trace as input and yield a valid score.
When analyzing the outputs of your LLM applications, you may want to evaluate traits that are defined qualitatively such as readability, helpfulness or measures for reducing hallucinations such as completeness.

We're building product descriptions and to ensure it resonates with customers, we want to measure readability. For more LLM-as-a-judge definitions, check out the judge based evaluator prompts defined in the [Amazon Bedrock Evaluator Prompts](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)

In [None]:
template_readability = """
You are a helpful agent that can assess an LLM response according to the given rubrics.

You are given a product description generated by a LLM. Your task is to assess the readability of the LLM response to the question, in other words, how easy it is for a typical reading audience to comprehend the response at a normal reading rate.

Please rate the readability of the response based on the following scale:
- unreadable: The response contains gibberish or could not be comprehended by any normal audience.
- poor readability: The response is comprehensible, but it is full of poor readability factors that make comprehension very challenging.
- fair readability: The response is comprehensible, but there is a mix of poor readability and good readability factors, so the average reader would need to spend some time processing the text in order to understand it.
- good readability: Very few poor readability factors. Mostly clear, well-structured sentences. Standard vocabulary with clear context for any challenging words. Clear organization with topic sentences and supporting details. The average reader could comprehend by reading through quickly one time.
- excellent readability: No poor readability factors. Consistently clear, concise, and varied sentence structures. Simple, widely understood vocabulary. Logical organization with smooth transitions between ideas. The average reader may be able to skim the text and understand all necessary points.

Here is the product description that needs to be evaluated: {prd_desc}

Firstly explain your response, followed by your final answer. You should follow the format
Explanation: [Explanation], Answer: [Answer],
where '[Answer]' can be one of the following:
```
unreadable
poor readability
fair readability
good readability
excellent readability
```
"""

def generate_readability_score(trace_output):
    prd_desc_readability = template_readability.format(prd_desc=trace_output)
    message_1 = {
        "role": "user",
        "content": [{"text": prd_desc_readability}]
    }
    
    query = [f"Rate the readability of product description: {traces_batch[1].output}"]
    
    readability_score = bedrock_runtime.converse(
        modelId='us.amazon.nova-lite-v1:0', 
        messages = [message_1]
    )
    explanation, score = readability_score['output']['message']['content'][0]['text'].split('\n\n')
    return explanation, score


print(f"User query: {traces_batch[1].input[1]['content']}")
print(f"Model answer: {traces_batch[1].output}")
explanation, score = generate_readability_score(traces_batch[1].output)
print(f"Readability: {score}, Explanation: {explanation}")

## 3. Add the evaluation to the trace

Now that we have generated a readability score as well as a explanation, we can use the Langfuse client to add scores to existing traces.

In [110]:
langfuse.score(
    trace_id=traces_batch[1].id,
    name="readability",
    value=explanation,
    comment=score
)

<langfuse.client.StatefulClient at 0x7f3cec2a89a0>

# Putting everything together

We just saw how to do this for one trace, let's put it all together in a function to run it on all the traces collected in the last 24 hours.

In [112]:
import math
 
for page_number in range(1, math.ceil(TOTAL_TRACES/BATCH_SIZE)):
 
    traces_batch = langfuse.fetch_traces(page=1,
                                         limit=BATCH_SIZE,
                                         tags="bedrock_eval_pipelines",
                                         from_timestamp=last_24_hours,
                                         to_timestamp=datetime.now()
                                       ).data
 
    for trace in traces_batch:
        print(f"Processing {trace.name}")
 
        if trace.output is None:
            print(f"Warning: \n Trace {trace.name} had no generated output, \
            it was skipped")
            continue
        explanation, score = generate_readability_score(trace.output)
        langfuse.score(
            trace_id=trace.id,
            name="readability",
            value=score,
            comment=explanation
        )
 
    print(f"Batch {page_number} processed 🚀 \n")

Processing Description of '"Electric Standing Desk"'
Processing Description of '"Smart Luggage with GPS Tracker"'
Processing Description of '"Smart Pet Fountain"'
Processing Description of '"Electric Car Jump Starter"'
Processing Description of '"Smart Lock with Keypad Access"'
Processing Description of '"Smart Wine Cooler"'
Processing Description of '"Electric Heated Blanket"'
Processing Description of '"Smart Coffee Maker with App Control"'
Processing Description of '"Smart Home Energy Monitor"'
Processing Description of '"Smart Bed Frame with Sleep Tracker"'
Batch 1 processed 🚀 

Processing Description of '"Electric Standing Desk"'
Processing Description of '"Smart Luggage with GPS Tracker"'
Processing Description of '"Smart Pet Fountain"'
Processing Description of '"Electric Car Jump Starter"'
Processing Description of '"Smart Lock with Keypad Access"'
Processing Description of '"Smart Wine Cooler"'
Processing Description of '"Electric Heated Blanket"'
Processing Description of '"S

#### If your pipeline ran successfully, you should now see scores added to your traces

![Langfuse Trace with score added for readability](scored_trace.png "Scored trace on langfuse")