# Using TLM via the OpenAI library to score response trustworthiness

This tutorial demonstrates how to integrate your VPC installation of Cleanlab's Trustworthy Language Model (TLM) into existing GenAI apps. You will learn how to assess the trustworthiness of OpenAI model responses, directly through the [OpenAI library](https://github.com/openai/openai-python). Existing OpenAI users: you can obtain real-time trustworthiness scores for every OpenAI response, without changing your code.

Using TLM via the [OpenAI library](https://github.com/openai/openai-python) enables you to leverage OpenAI's features, while reliably scoring the trustworthiness of each response to automatically catch errors/hallucinations made by OpenAI.


In this tutorial, we perform multi-label classification (i.e. document tagging) with trustworthiness scores from TLM. The same method can be used to score the trustworthiness of any type of output from OpenAI.

## API access to the TLM backend service

This demo assumes that you have access to the deployed TLM backend service at the URL `http://localhost:8080/api`. You may set this up by port forwarding to your local machine as shown below.

In [None]:
# Update your kubeconfig with credentials from the deployed cluster
ENVIRONMENT_NAME=enterprise
az aks get-credentials --resource-group tlm-$ENVIRONMENT_NAME-rg --name tlm-$ENVIRONMENT_NAME-cluster --overwrite-existing

# Port forward the backend service to your local machine
kubectl port-forward service/tlm-chat-backend 8080:8080 -n tlm

You are welcome to expose the TLM API however you prefer, depending on the unique needs of your networking environment. Simply replace the base URL when instantiating the OpenAI Python client later in this tutorial.

## Setup

The Python packages required for this tutorial can be installed using pip:

In [None]:
%pip install --upgrade openai tqdm

In [None]:
import pandas as pd
from enum import Enum
from pydantic import BaseModel
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
import ast
import time
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

## Fetch example Dataset

This tutorial uses a modified version of the [Alexa intent detection dataset](https://huggingface.co/datasets/AmazonScience/massive). 

Each text sample contains several statements that could correspond to multiple intents (for example controlling devices, asking for information etc). The label corresponding to each example specifies what the intent of that statement is, where there could be more than one intent corresponding to each sample. Let's take a look at the dataset below.

In this tutorial, we will only run the LLM inference on 50 randomly sampled examples of this dataset as a demonstration.

In [None]:
!wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/massive_multilabel_classification.csv

In [None]:
data = pd.read_csv("massive_multilabel_classification.csv")
data["labels"] = data["labels"].apply(ast.literal_eval)
data = data.sample(50, random_state=123).reset_index(drop=True)
data.head()

## Obtain LLM Predictions

First, we need to get a list of all possible classes from the given dataset:

In [None]:
multilabel_classes = data["labels"].explode().unique()
multilabel_classes[:5]

In [None]:
multilabel_classes

### Prompt OpenAI 

Then, we can instantiate the OpenAI client, pointing the `base_url` to TLM, which allows us to also get the trustworthiness score associated with each response.

In [None]:
# the VPC installation of TLM does not authenticate requests, but the OpenAI client requires an API key, so we pass a fake value here
client = OpenAI(
    api_key="test",
    base_url="http://localhost:8080/api"
)

Here is an example of how we can prompt OpenAI with one sample text:

In [None]:
sample_text = data['text'][1]
sample_text

In [None]:
completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
            {"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {sample_text}"}  
        ],
    extra_body={
        "tlm": {
            "quality_preset": "low"
        }
    }
)

In [None]:
completion

The returned object matches what OpenAI would ordinarily return, except it has an additional `tlm` field from TLM with extra information like the trustworthiness `score` and other `metadata`. This way you can use TLM as a drop-in replacement for OpenAI in any application (and will still get back the same responses you'd get directly from OpenAI).  Let's parse the predictions and trustworthiness score from the returned response:

In [None]:
response = completion.choices[0].message.content
tlm_score = completion.choices[0].tlm["score"]

print(f"LLM Response: {response}")
print(f"TLM Score: {tlm_score}")

### Batch Prompt on a Dataset

Here, we define a quick helper function that allows us to process multiple texts in parallel, which will speed up prompting the LLM on an entire dataset. The helper function also parses and collects the predictions and trustworthiness score in a DataFrame for easy downstream analysis.

In [None]:
def classify_text(text):
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {text}"}],
        extra_body={
            "tlm": {
                "quality_preset": "low"
            }
        }
    )
    
    return {
        "response": completion.choices[0].message.content,
        "tlm_score": completion.choices[0].tlm["score"],
    }

def classify_texts_batch(texts, batch_size=5, max_threads=3, sleep_time=10):
    results = []
    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i + batch_size]
        
        with ThreadPoolExecutor(max_threads) as executor:
            futures = [executor.submit(classify_text, text) for text in batch]
            batch_results = [f.result() for f in futures]
        
        results.extend(batch_results)

        # sleep to prevent hitting rate limits
        if i + batch_size < len(texts):
            time.sleep(sleep_time)
    
    return pd.DataFrame(results)

In [None]:
classify_text(data["text"][0])

In [None]:
results = classify_texts_batch(data["text"], batch_size=5)
results.head()

In [None]:
results

## Examine Results

We have now obtained the predictions and trustworthiness score for each given text. Let's examine the results in more detail.

In [None]:
combined_results = pd.concat([data, results], axis=1)

### High Trustworthiness Scores

The responses with the highest trustworthiness scores represent texts where TLM is the most confident that it has predicted the correct intents.

We can see below that the predictions for the samples below match the ground truth labels and are correctly classified.

In [None]:
combined_results.sort_values("tlm_score", ascending=False).head(3)

### Low Trustworthiness Scores

The responses with the lowest trustworthiness scores indicate outputs we are least confident are good.

Results with low trustworthiness scores would benefit most from manual review, especially if we need almost all outputs across the dataset to be correct.

For examples with the lowest trustworthiness scores in our dataset shown below, you can see that the predictions tend to be incorrect or could use further review.

In [None]:
combined_results.sort_values("tlm_score").head(3)

## Using Different Quality Presets

You can use TLM with different [quality presets](/tlm/tutorials/tlm_advanced/#quality-presets) by specifying the preset after the model name. 

For example, in this example below we specify `model="gpt-4o-low"` to use TLM on `low` quality preset (for lower cost/latency). If unspecified, the default quality preset used is `medium`.

Currently, only `base`, `low`, and `medium` presets are supported when using TLM via the OpenAI library. Read more about quality presets [here](/tlm/api/python/tlm/#class-tlmoptions).

In [None]:
sample_text = data['text'][0]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-low",
    messages=[
            {"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {sample_text}"}  
        ],
    extra_body={
        "tlm": {
            "quality_preset": "low"
        }
    }
)

We re-emphasize that you can use TLM via the [OpenAI library](https://github.com/openai/openai-python) to score the trustworthiness of *any* type of OpenAI output (not just structured outputs).
Beyond structured outputs, we recommend using TLM via the [OpenAI library](https://github.com/openai/openai-python) for LLM applications involving: function calling, system prompts and multiple user/assistant messages, as well as other advanced features offered by OpenAI but not most LLM APIs.

For questions about the OpenAI API, refer to the documentation linked from [their library](https://github.com/openai/openai-python).