# How to evaluate Langfuse traces Using Flow Judge

This notebook demonstrates how to use Flow Judge to evaluate Langfuse traces. 
It covers setting up Langfuse and Flow Judge, evaluating a single trace,
and batch evaluating multiple traces.

In this tutorial, we will simulate two different types of usage scenarios:

1. **Evaluating Traces in a Production Setup**: We will demonstrate how to integrate Flow Judge with a deployed version of the model to evaluate traces in real-time. This setup is ideal for applications where continuous monitoring and evaluation of LLM system responses are required.

2. **Batch Evaluations in an Offline Setup**: We will also cover how to perform batch evaluations of multiple traces in an offline environment. This approach is useful for retrospective analysis, development, experimentation, and evaluation of LLM system performance on historical data, allowing for comprehensive assessments without the need for a live system.


By the end of this notebook, you will have a clear understanding of how to implement both real-time and batch evaluation strategies using Flow Judge and Langfuse, tailored to your specific production and analysis needs. 

## Langfuse
Langfuse is an open-source LLM engineering platform that helps teams collaboratively debug, analyze, and iterate on their LLM applications. Integrates with LlamaIndex, Langchain, LiteLLM, and more.

## Flow Judge
Flow-Judge-v0.1 is an open, small yet powerful language model evaluator designed for custom LLM system evaluation. Flow Judge library supports Tranformers, vLLM, Llamafile and Baseten model types. We have both pre-defined evaluation metrics and ease of creating your own custom metric and rubrics. We also support batched evaluation for efficient processing.


## Set-up Langfuse and Flow Judge

In this section, we'll set up our Langfuse credentials and initialize 
the Flow Judge model for evaluation.


In [1]:
# Set up credentials 
import os

# Get keys for your project from 
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."


In [2]:
#!pip install langfuse
import langfuse

# Initialize Langfuse
langfuse = langfuse.Langfuse(secret_key=os.environ["LANGFUSE_SECRET_KEY"], public_key=os.environ["LANGFUSE_PUBLIC_KEY"])

In [3]:
# Import flow-judge
from flow_judge import EvalInput, FlowJudge, Hf #Baseten
from flow_judge.metrics import CustomMetric, RubricItem
from IPython.display import Markdown, display

# Initialize the model
model = Hf()
#model = Baseten()


INFO:flow_judge.models.huggingface:Downloading the model from Hugging Face Hub using hf-transferfor faster downloads...


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In this example we will use a custom metric to demonstrate the flexibility to tailor evaluations to your needs. 

1. To create your own metric you can refer to the tutorial on [how to create a custom metrics](examples/2_custom_evaluation_criteria.ipynb)


2. Flow-Judge offers also a set of pre-defined metrics you can use. To see the full list of available metrics you can run
    ```python
    from flow_judge import list_all_metrics 
    list_all_metrics() 
    ```

## 1. Scoring individual traces in production set-up

We'll demonstrate how to evaluate a single trace using Flow Judge by creating a sample trace with user query, retrieved context and response.
 
This example demonstrates the evaluation of an LLM system designed to assist with customer support. We will create a custom metric to evaluate issue resolution.

### 1.1 Evaluation sample

In [4]:
# E-commerce Customer Support Example

# Query
query = """I placed an order last week, but I haven't received any shipping updates. Can you help me track my order?"""

# Context
context = """

 Customer Details:
- Customer ID: CUST-987654321 

# Order Details: 
- Customer ID: 
- Order ID: 123456789
- Order Date: October 1, 2023
- Current Status: In Transit

# Shipping Policy:
- Standard shipping takes 5-7 business days.
- Tracking info is emailed once shipped.
"""

# Response
response = """
# Tracking Your Order

Your order (ID: 123456789) is currently processing. Standard shipping typically takes 5-7 business days from the order date, October 1, 2023. You can expect your order to arrive between October 6 and October 8, 2023.

1. **Check Email**: Look for tracking emails in your inbox or spam once the order ships.
2. **Track Online**: Use Order ID 123456789 on the courier's site for updates.
3. **Contact Support**: If you don't receive updates by October 8, please contact us for assistance.

Thank you for your patience.
"""


### 1.2 Create a custom metric and initialize the judge

Now we will create a custom metric on a likert-5 scale to evaluate how effectively the response resolves the customer's issue. To learn more about creating custom metrics please refer to our [tutorial](examples/2_custom_evaluation_criteria.ipynb).

In [5]:
from flow_judge.metrics import CustomMetric, RubricItem

# Define the criteria for issue resolution
evaluation_criteria = """How well does the response resolve the customer's issue?"""

# Define the rubric using RubricItem's for a Likert 5-point scale
rubric = [
    RubricItem(
        score=1,
        description="The response does not address the customer's issue at all. The customer is likely to be very dissatisfied."),
    RubricItem(
        score=2,
        description="The response addresses the issue poorly, with significant gaps or incorrect information. The customer is likely to be dissatisfied."),
    RubricItem(
        score=3,
        description="The response partially addresses the issue, but lacks clarity or completeness. The customer may be neutral or slightly dissatisfied."),
    RubricItem(
        score=4,
        description="The response mostly resolves the issue with clear and mostly accurate information. The customer is likely to be satisfied."),
    RubricItem(
        score=5,
        description="The response fully resolves the issue with clear and accurate information. The customer is likely to be very satisfied."),
]

# Define the required inputs and output for the metric
required_inputs = ["query", "context"]
required_output = "response"

# Create the custom metric
issue_resolution_metric = CustomMetric(
    name="issue-resolution",
    criteria=evaluation_criteria,
    rubric=rubric,
    required_inputs=required_inputs,
    required_output=required_output
)

# Initialize the Flow Judge 
issue_resolution_judge = FlowJudge(metric=issue_resolution_metric, model=model)

### 1.3 Creating and updating a Langfuse trace with Flow Judge evaluation

Here we will simulate how the process of creating a trace, running your LLM system and evaluating the response would work in production.

For actual integration of Langfuse with your application, please refer to their the [quickstart guide](https://langfuse.com/docs/get-started).


In [6]:
# Set up trace when a new customer issue is created 
trace = langfuse.trace(name = "flow-judge-example-customer-issue")  

# Now your LLM system will interact with the customer issue and produce a response  
# For example retrieve the relevant context via RAG 
# context = retrieve_context(customer_issue) 
# Pass the context to trace as span
trace.span(
    name = "retrieval", input={'query': query}, output={'context': context}
)  

# Use LLM to generate a response to the customer issue 
# Response = generate_response(query, context)
trace.span(
    name = "generation", input={"query":query, "context":context}, output={'response': response}
)

# Update trace with the inputs and output 
trace.update(input={"query":query, "context":context}, output={'response': response})

# Evaluate the trace with Flow Judge
# Create an EvalInput
eval_input = EvalInput(
    inputs=[
        {"query": query},
        {"context": context},
    ],
    output={"response": response},
)

# Run the evaluation
result = issue_resolution_judge.evaluate(eval_input, save_results=False)

# Update trace with the score and feedback 
trace.score(name=issue_resolution_metric.name, value=result.score, comment=result.feedback)


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


<langfuse.client.StatefulClient at 0x7fa1fecf9b10>

In this example, we evaluated a single trace in a production setup using Flow Judge and Langfuse, focusing on an e-commerce customer support scenario. By defining evaluation criteria and using a Likert-5 scale rubric, we provided feedback on how well the response addressed the customer's issue. After the evaluation, we updated the Langfuse trace with the score and feedback, showcasing the integration of Flow Judge with Langfuse. 


![Successful Trace in Langfuse](langfuse_images/langfuse_trace.png)

In [7]:
from IPython.display import Markdown, display  

# View the results
display(Markdown(f"**Score:** {result.score}"))
display(Markdown(f"**Feedback:** {result.feedback}"))


**Score:** 3

**Feedback:** The response partially addresses the customer's issue but lacks some crucial information and clarity. While it provides some useful information about the expected delivery window and suggests checking the email for tracking updates, it fails to acknowledge the customer's specific concern about not receiving any shipping updates. The response should have directly addressed this issue by explaining why the customer hasn't received updates yet. Additionally, the suggestion to contact support if no updates are received by October 8 is vague and could be more specific about what kind of updates to expect. The response also doesn't mention the possibility of using the courier's site for tracking, which could be helpful for the customer. Overall, while the response provides some useful information, it lacks the necessary clarity and completeness to fully resolve the customer's issue.

## 2. Batch evaluating multiple traces in an offline setup

In this section, we'll show how to batch evaluate multiple traces.
We'll use a sample dataset, load it into Langfuse, retrieve the traces,
and then evaluate them using Flow Judge.  

For this example we will be using one-of the built-in metrics for evaluating response faithfulness. 

### 2.1 Load the data into Langfuse

>*NOTE: In practice you would fetch your traces directly from Langfuse and skip this first step.*

In [9]:
import json

# Load your JSON data
with open('sample_data/csr_assistant.json', 'r') as f:
    data = json.load(f)

# Prepare the data for Langfuse
traces = []
for item in data:
    trace = langfuse.trace(name = "flow-judge-csr-assistant")
    trace.update(
        input={
            "query": item['query'],
            "context": item['context']
            },
        output={"response": item['response']},
    )
# Ensure all events have been processed by the Langfuse SDK before attempting to retrieve them in the subsequent step
langfuse.flush()

### 2.2 Retrieve traces from Langfuse 

In [10]:
import requests

# Retrieve traces from Langfuse 
traces = langfuse.fetch_traces(
    page=1,
    limit=len(data),
    name="flow-judge-csr-assistant" 
    ).data

### 2.3. Evaluate the traces

Here we'll demonstrate how to batch evaluate multiple traces using Flow Judge. We'll prepare the inputs and outputs, create EvalInput objects,
and then run the batch evaluation.

In [12]:
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT

# create a list of inputs and outputs
inputs_batch = [
    [
        {"query": trace.input["query"]},
        {"context": trace.input["context"]}
    ]
    for trace in traces
]
outputs_batch = [{"response":trace.output["response"]} for trace in traces]

# create a list of EvalInputs
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]

# Initialize the judge  
faithfulness_judge = FlowJudge(metric=RESPONSE_FAITHFULNESS_5POINT, model=model)

# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)

INFO:flow_judge.models.huggingface:Automatically determined batch size: 4
Processing batches: 100%|██████████| 2/2 [00:53<00:00, 26.58s/it]


### 2.4. Update evaluation results to Langfuse

Finally, we'll update the Langfuse traces with the evaluation results.
This allows us to store the Flow Judge scores and feedback in Langfuse
for monitoring and analysis.

In [13]:
# Update the Langfuse traces with the evaluation results
for trace, result in zip(traces, results):
    langfuse.score( 
        trace_id=trace.id,
        name=RESPONSE_FAITHFULNESS_5POINT.name,
        value=result.score,
        comment=result.feedback
    )

# Flush to ensure all updates are sent to Langfuse
langfuse.flush()

# Shutdown Langfuse client
langfuse.shutdown()

![Batch evaluated traces in Langfuse](langfuse_images/batch_evaluated_traces.png) 

You can also monitor the evaluation results overtime on the Langfuse dashboard. 

![Langfuse dashboard](langfuse_images/langfuse_dashboard.png)



## Summary

This notebook demonstrated the use of Flow Judge to evaluate Langfuse traces in two key scenarios: individual trace evaluation in a production setup and batch evaluation in an offline environment.

In the first section, we focused on an e-commerce customer support scenario, where we created a custom metric to assess the issue resolution of the LLM system's response. By defining clear evaluation criteria and utilizing a Likert-5 scale rubric, we provided feedback on how well the response addressed the customer's issue, updating the Langfuse trace with the score and feedback.

In the second section, we explored batch evaluation by loading a dataset into Langfuse, retrieving traces, and evaluating them using predefined faithfulness metric. This approach allows for efficient processing and analysis of multiple traces, enhancing the overall evaluation process.

These techniques can be adapted to fit various production environments and datasets, providing valuable insights into LLM system performance.