# Automatically find the bad LLM responses in your LLM Evals with Cleanlab

This guide will walk you through the process of evaluating LLM responses captured in MLflow with Cleanlab's Trustworthy Language Models (TLM).

TLM boosts the reliability of any LLM application by indicating when the model’s response is untrustworthy.

This guide requires a Cleanlab TLM API key. If you don't have one, you can sign up for a free trial [here](https://tlm.cleanlab.ai/).

## Install dependencies & Set environment variables

In [None]:
%%bash
pip install -q mlflow openai cleanlab-tlm --upgrade

In [None]:
import mlflow
import os
import json
import pandas as pd
from getpass import getpass
import dotenv
dotenv.load_dotenv()

## API Keys

This guide requires four API keys:
- [OpenAI API Key](https://platform.openai.com/api-keys)
- [Cleanlab TLM API Key](https://tlm.cleanlab.ai/)


In [2]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
if not (cleanlab_tlm_api_key := os.getenv("CLEANLAB_TLM_API_KEY")):
    cleanlab_tlm_api_key = getpass("🔑 Enter your Cleanlab TLM API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key
os.environ["CLEANLAB_TLM_API_KEY"] = cleanlab_tlm_api_key

## Set Up MLflow Tracking Server and Logging

You can start a tutorial and log models, experiments without a tracking server set up. With this mode, your experiment data and artifacts are saved directly under your current directory.

In [None]:
# This will start a server on port 8080, in the background
# Navigate to http://localhost:8080 to see the MLflow UI
%%bash --bg
mlflow server --host 127.0.0.1 --port 8080

In [None]:
# Set up MLflow tracking server
mlflow.set_tracking_uri("http://localhost:8080")

# Enable logging for OpenAI SDK
mlflow.openai.autolog()

# Set experiment name
mlflow.set_experiment("Eval OpenAI Traces w/ TLM")

# Get experiment ID
experiment_id = mlflow.get_experiment_by_name("Eval OpenAI Traces w/ TLM").experiment_id

## Prepare trace dataset and load into MLflow

For the sake of demonstration purposes, we'll briefly generate some traces and track them in MLflow. Typically, you would have already captured traces in MLflow and would skip to "Download trace dataset from MLflow"

NOTE: TLM requires the entire input to the LLM to be provided. This includes any system prompts, context, or other information that was originally provided to the LLM to generate the response. Notice below that we include the system prompt in the trace metadata since by default the trace does not include the system prompt within the input.

In [5]:
import openai

In [None]:
# Let's use some tricky trivia questions to generate some traces
trivia_questions = [    
    "What is the 3rd month of the year in alphabetical order?",
    "What is the capital of France?",
    "How many seconds are in 100 years?",
    "Alice, Bob, and Charlie went to a café. Alice paid twice as much as Bob, and Bob paid three times as much as Charlie. If the total bill was $72, how much did each person pay?",
    "When was the Declaration of Independence signed?"
]

def generate_answers(trivia_question):
    system_prompt = "You are a trivia master."

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": trivia_question},
        ],
    )
    
    answer = response.choices[0].message.content
    return answer


# Generate answers
answers = []
for i in range(len(trivia_questions)):
    answer = generate_answers(trivia_questions[i])
    answers.append(answer)  
    print(f"Question {i+1}: {trivia_questions[i]}")
    print(f"Answer {i+1}:\n{answer}\n")

print(f"Generated {len(answers)} answers and tracked them in MLflow.")

## Download trace dataset from MLflow

Fetching traces from MLflow is straightforward. Just set up the MLflow client and use one of its functions to fetch the data. We'll fetch the traces and evaluate them. After that, we'll add our scores back into MLflow.

The `search_traces()` function has arguments to filter the traces by tags, timestamps, and beyond. You can find more about other methods to [query traces](https://mlflow.org/docs/latest/python_api/mlflow.client.html#mlflow.client.MlflowClient.search_traces) in the docs.

In this example, we'll fetch all traces from the experiment.

In [None]:
client = mlflow.client.MlflowClient()
traces = client.search_traces(experiment_ids=[experiment_id])

## Generate evaluations with TLM

Instead of running TLM individually on each trace, we'll provide all of the prompt, response pairs in a list to TLM in a single call. This is more efficient and allows us to get scores and explanations for all of the traces at once. Then, using the `request.id`, we can attach the scores and explanations back to the correct trace in MLflow.



In [8]:
from cleanlab_tlm import TLM

tlm = TLM(options={"log": ["explanation"]})

In [9]:
# This helper method will extract the prompt and response from each trace and return three lists: request ID's, prompts, and responses.
def get_prompt_response_pairs(traces):
    prompts = []
    responses = []
    for trace in traces:
        # Parse request and response JSON
        request_data = json.loads(trace.data.request)
        response_data = json.loads(trace.data.response)
        
        # Extract system prompt and user message from request
        system_prompt = request_data["messages"][0]["content"]
        user_message = request_data["messages"][1]["content"]
        
        # Extract assistant's response from response
        assistant_response = response_data["choices"][0]["message"]["content"]
        
        prompts.append(system_prompt + "\n" + user_message)
        responses.append(assistant_response)
    return prompts, responses

request_ids = [trace.info.request_id for trace in traces]
prompts, responses = get_prompt_response_pairs(traces)

Now, let's use TLM to generate a `trustworthiness score` and `explanation` for each trace.

**IMPORTANT:** It is essential to always include any system prompts, context, or other information that was originally provided to the LLM to generate the response. You should construct the prompt input to `get_trustworthiness_score()` in a way that is as similar as possible to the original prompt. This is why we included the system prompt in the trace metadata.

In [None]:
# Evaluate each of the prompt, response pairs using TLM
evaluations = tlm.get_trustworthiness_score(prompts, responses)

# Extract the trustworthiness scores and explanations from the evaluations
trust_scores = [entry["trustworthiness_score"] for entry in evaluations]
explanations = [entry["log"]["explanation"] for entry in evaluations]

# Create a DataFrame with the evaluation results
trace_evaluations = pd.DataFrame({
    'request_id': request_ids,
    'prompt': prompts,
    'response': responses, 
    'trust_score': trust_scores,
    'explanation': explanations
})

Awesome! Now we have a DataFrame mapping trace IDs to their scores and explanations. We've also included the prompt and response for each trace for demonstration purposes to find the **least trustworthy trace!**

In [None]:
sorted_df = trace_evaluations.sort_values(by="trust_score", ascending=True)
sorted_df.head(3)

In [None]:
# Let's look at the least trustworthy trace.
print("Prompt: ", sorted_df.iloc[0]["prompt"], "\n")
print("OpenAI Response: ", sorted_df.iloc[0]["response"], "\n")
print("TLM Trust Score: ", sorted_df.iloc[0]["trust_score"], "\n")
print("TLM Explanation: ", sorted_df.iloc[0]["explanation"])


#### Awesome! TLM was able to identify multiple traces that contained incorrect answers from OpenAI.

Let's upload the `trust_score` and `explanation` columns to MLflow.

## Upload evaluations to MLflow

In [13]:
for idx, row in trace_evaluations.iterrows():
    request_id = row["request_id"]
    trust_score = row["trust_score"]
    explanation = row["explanation"]
    
    # Add the trustworthiness score and explanation to the trace as a tag
    client.set_trace_tag(request_id=request_id, key="trust_score", value=trust_score)
    client.set_trace_tag(request_id=request_id, key="explanation", value=explanation)
    

You should now see the TLM trustworthiness score and explanation in the MLflow UI!


From here you can continue collecting and evaluating traces!

# Evaluator

Here's how you might use TLM with MLflow Evaluation. This will log a table of trustworthiness scores and explanations and also provide an interface in the UI for comparing scores across runs. For example, you could use this to compare the trustworthiness scores of different models across the same set of prompts.

In [29]:
import mlflow
from mlflow.metrics import MetricValue, make_metric
from cleanlab_tlm import TLM

def _tlm_eval_fn(predictions, inputs, targets=None):
    """
    Evaluate trustworthiness using Cleanlab TLM.
    
    Args:
        predictions: The model outputs/answers
        targets: Not used for this metric
        **kwargs: Should contain 'inputs' with the prompts
    """
    # Initialize TLM
    tlm = TLM(options={"log": ["explanation"]})
    inputs = inputs.to_list()
    predictions = predictions.to_list()
    
    # Get trustworthiness scores
    evaluations = tlm.get_trustworthiness_score(inputs, predictions)
    
    # Extract scores and explanations
    scores = [float(eval_result["trustworthiness_score"]) for eval_result in evaluations]
    justifications = [eval_result["log"]["explanation"] for eval_result in evaluations]
    
    # Return metric value
    return MetricValue(
        scores=scores,
        justifications=justifications,
        aggregate_results={
            "mean": sum(scores) / len(scores),
            "min": min(scores),
            "max": max(scores)
        }
    )

def tlm_trustworthiness():
    """Creates a metric for evaluating trustworthiness using Cleanlab TLM"""
    return make_metric(
        eval_fn=_tlm_eval_fn,
        greater_is_better=True,
        name="tlm_trustworthiness"
    )

In [None]:
tlm_metric = tlm_trustworthiness()

eval_df = pd.DataFrame({
    'inputs': prompts,
    'outputs': answers
})


results = mlflow.evaluate(
    data=eval_df,
    predictions="outputs",
    model=None,
    extra_metrics=[tlm_metric],
    evaluator_config={
        "col_mapping": {
            "inputs": "inputs",
            "predictions": "outputs"
        }
    }
)

# Tracing TLM

You could also trace the TLM trustworthiness metric itself. This will log the trustworthiness scores and explanations for each trace.

Ultimately you would likely want to set this up with a parent span and nested spans, or just entirely separate spans, when the user passes in a list of prompts and responses, perhaps like [this](https://mlflow.org/docs/latest/tracing/api/manual-instrumentation#context-manager). It would also be fairly straightforward to set up a custom MLflow model (or even just a simple function) that invokes the OpenAI model, passes the results to TLM, and traces both.

In [None]:
# Tracing TLM

@mlflow.trace
def tlm_trustworthiness_wrapper(inputs, predictions):
    tlm = TLM(options={"log": ["explanation"]})
    evaluations = tlm.get_trustworthiness_score(inputs, predictions)
    return evaluations

tlm_trustworthiness_wrapper(prompts[0], answers[0])