# Using TLM via the OpenAI library to score the trustworthiness of: structured outputs, function calling, messages, and more

This tutorial demonstrates how to integrate your VPC installation of Cleanlab's Trustworthy Language Model (TLM) into existing GenAI apps. You will learn how to assess the trustworthiness of OpenAI model responses, directly through the [OpenAI library](https://github.com/openai/openai-python). Existing OpenAI users: you can obtain real-time trustworthiness scores for every OpenAI response, without changing your code.

Using TLM via the [OpenAI library](https://github.com/openai/openai-python) enables you to leverage OpenAI's features, while reliably scoring the trustworthiness of each response to automatically catch errors/hallucinations made by OpenAI.


![Getting TLM trustworthiness scores from using OpenAI API](./assets/tlm-structured-outputs/tlm-openai-api.png)

In this tutorial, we use OpenAI's structured outputs feature to perform multi-label classification (i.e. document tagging) with trustworthiness scores from TLM. The same method can be used to score the trustworthiness of any type of output from OpenAI (not just structured outputs).

## Setup

The Python packages required for this tutorial can be installed using pip:

In [1]:
%pip install --upgrade openai tqdm

Collecting openai
  Downloading openai-1.78.0-py3-none-any.whl (680 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m680.4/680.4 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.77.0
    Uninstalling openai-1.77.0:
      Successfully uninstalled openai-1.77.0
Successfully installed openai-1.78.0
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from enum import Enum
from pydantic import BaseModel
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
import ast
import time
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

## Fetch example Dataset

This tutorial uses a modified version of the [Alexa intent detection dataset](https://huggingface.co/datasets/AmazonScience/massive). 

Each text sample contains several statements that could correspond to multiple intents (for example controlling devices, asking for information etc). The label corresponding to each example specifies what the intent of that statement is, where there could be more than one intent corresponding to each sample. Let's take a look at the dataset below:

In this tutorial, we will only run the LLM inference on 50 randomly sampled examples of this dataset as a demonstration.

In [2]:
!wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/massive_multilabel_classification.csv

File ‘massive_multilabel_classification.csv’ already there; not retrieving.



In [3]:
data = pd.read_csv("massive_multilabel_classification.csv")
data["labels"] = data["labels"].apply(ast.literal_eval)
data = data.sample(50, random_state=123).reset_index(drop=True)
data.head()

Unnamed: 0,text,labels
0,lets have a chat,[general_quirky]
1,what are meeting scheduled for today,[calendar_query]
2,erase all the events. resume my audio book from karl pilkington. tell me the profession of celebrity,"[calendar_remove, play_audiobook, qa_factoid]"
3,thirty minute reminder on meeting for tuesday,[calendar_set]
4,i have a nine am meeting on wednesday send me a reminder,[calendar_set]


### Configuring API access to the TLM backend service

This demo assumes that you have access to the deployed TLM backend service at the URL `http://localhost:8080/api`. You may set this up by port forwarding to your local machine as shown below.

In [None]:
# Update your kubeconfig with credentials from the deployed cluster
ENVIRONMENT_NAME=enterprise
az aks get-credentials --resource-group tlm-$ENVIRONMENT_NAME-rg --name tlm-$ENVIRONMENT_NAME-cluster --overwrite-existing

# Port forward the backend service to your local machine
kubectl port-forward service/tlm-chat-backend 8080:8080 -n tlm

You are welcome to expose the TLM API however you prefer, depending on the unique needs of your networking environment. Simply replace the base URL when instantiating the OpenAI Python client later in this tutorial.

## Obtain LLM Predictions

First, we need to get a list of all possible classes from the given dataset:

In [7]:
multilabel_classes = data["labels"].explode().unique()
multilabel_classes[:5]

array(['general_quirky', 'calendar_query', 'calendar_remove',
       'play_audiobook', 'qa_factoid'], dtype=object)

In [19]:
multilabel_classes

array(['general_quirky', 'calendar_query', 'calendar_remove',
       'play_audiobook', 'qa_factoid', 'calendar_set', 'email_query',
       'alarm_remove', 'social_post', 'recommendation_movies', 'qa_stock',
       'alarm_query', 'lists_createoradd', 'iot_hue_lightchange',
       'news_query', 'social_query', 'weather_query', 'qa_definition',
       'iot_hue_lightoff', 'iot_hue_lightup', 'play_game',
       'email_sendemail', 'audio_volume_mute', 'takeaway_order',
       'lists_query', 'audio_volume_down', 'datetime_query',
       'lists_remove', 'music_query', 'recommendation_events',
       'cooking_recipe', 'datetime_convert', 'play_music',
       'music_likeness', 'general_joke', 'qa_maths', 'qa_currency',
       'email_addcontact'], dtype=object)

### Prompt OpenAI 

Then, we can instantiate the OpenAI client, pointing the `base_url` to TLM, which allows us to also get the trustworthiness score associated with each response.

In [4]:
# the VPC installation of TLM does not authenticate requests, but the OpenAI client requires an API key, so we pass a fake value here
client = OpenAI(
    api_key="test",
    base_url="http://localhost:8080/api"
)

Here is an example of how we can prompt OpenAI with one sample text:

In [5]:
sample_text = data['text'][1]
sample_text

'what are meeting scheduled for today'

In [10]:
completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
            {"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {sample_text}"}  
        ],
    extra_body={
        "tlm": {
            "quality_preset": "low"
        }
    }
)

In [11]:
completion

ParsedChatCompletion[NoneType](id='chat-cmpl-ac3fa423e31f4fc9', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content='The text "what are meeting scheduled for today" can be classified under the label **\'calendar_query\'**.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, name=None), tlm={'score': 0.5979667890547007, 'metadata': {'result': {'score': 0.5979667890547007, 'completion': {'role': 'assistant', 'content': 'The text "what are meeting scheduled for today" can be classified under the label **\'calendar_query\'**.', 'logprobs': [{'logprob': -0.00033725024, 'token': 'The'}, {'logprob': -0.613089, 'token': ' text'}, {'logprob': -0.15458047, 'token': ' "'}, {'logprob': -1.9361265e-07, 'token': 'what'}, {'logprob': 0.0, 'token': ' are'}, {'logprob': -0.061979678, 'token': ' meeting'}, {'logprob': -7.89631e-07, 'token': ' scheduled'}, {'logpro

The returned object matches what OpenAI would ordinarily return, except it has an additional `tlm` field from TLM with extra information like the trustworthiness `score` and other `metadata`. This way you can use TLM as a drop-in replacement for OpenAI in any application (and will still get back the same responses you'd get directly from OpenAI).  Let's parse the predictions and trustworthiness score from the returned response:

In [12]:
response = completion.choices[0].message.content
tlm_score = completion.choices[0].tlm["score"]

print(f"LLM Response: {response}")
print(f"TLM Score: {tlm_score}")

LLM Response: The text "what are meeting scheduled for today" can be classified under the label **'calendar_query'**.
TLM Score: 0.5979667890547007


### Batch Prompt on a Dataset

Here, we define a quick helper function that allows us to process multiple texts in parallel, which will speed up prompting the LLM on an entire dataset. The helper function also parses and collects the predictions and trustworthiness score in a DataFrame for easy downstream analysis.

In [9]:
def classify_text(text):
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {text}"}],
        extra_body={
            "tlm": {
                "quality_preset": "low"
            }
        }
    )
    
    return {
        "response": completion.choices[0].message.content,
        "tlm_score": completion.choices[0].tlm["score"],
    }

def classify_texts_batch(texts, batch_size=5, max_threads=3, sleep_time=10):
    results = []
    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i + batch_size]
        
        with ThreadPoolExecutor(max_threads) as executor:
            futures = [executor.submit(classify_text, text) for text in batch]
            batch_results = [f.result() for f in futures]
        
        results.extend(batch_results)

        # sleep to prevent hitting rate limits
        if i + batch_size < len(texts):
            time.sleep(sleep_time)
    
    return pd.DataFrame(results)

In [10]:
classify_text(data["text"][0])

{'response': 'The text "lets have a chat" can be classified as **\'general_quirky\'**.',
 'tlm_score': 0.6696376884111997}

In [13]:
results = classify_texts_batch(data["text"], batch_size=5)
results.head()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [43:13<00:00, 259.33s/it]


Unnamed: 0,response,tlm_score
0,"The text ""lets have a chat"" can be classified under the label: **general_quirky**.",0.50426
1,"The text ""what are meeting scheduled for today"" is best classified under the label **'calendar_query'**.",0.576897
2,"The provided text can be classified into the following labels based on its components:\n\n1. ""erase all the events."" - This corresponds to **calendar_remove**.\n2. ""resume my audio book from karl pilkington."" - This corresponds to **play_audiobook**.\n3. ""tell me the profession of celebrity."" - This corresponds to **qa_factoid**.\n\nGiven the several distinct intents in the text, multiple classifications apply.",0.328716
3,The text can be classified as 'calendar_set' since it refers to setting a reminder for a meeting.,0.59841
4,"The appropriate label for the text ""i have a nine am meeting on wednesday send me a reminder"" would be **'calendar_set'**.",0.680641


In [1]:
results

NameError: name 'results' is not defined

## Examine Results

We have now obtained the predictions and trustworthiness score for each given text. Let's examine the results in more detail.

In [17]:
combined_results = pd.concat([data, results], axis=1)

### High Trustworthiness Scores

The responses with the highest trustworthiness scores represent texts where TLM is the most confident that it has predicted the correct intents.

We can see below that the predictions for the samples below match the ground truth labels and are correctly classified.

In [18]:
combined_results.sort_values("tlm_score", ascending=False).head(3)

Unnamed: 0,text,labels,response,tlm_score
34,get me latest updates from facebook. turn on favorite songs. who am i meeting today. where jackie chan currently live,"[social_query, play_music, calendar_query, qa_factoid]","The text can be classified under the following labels:\n\n1. ""social_query"" - for ""get me latest updates from facebook.""\n2. ""play_music"" - for ""turn on favorite songs.""\n3. ""calendar_query"" - for ""who am i meeting today.""\n4. ""qa_factoid"" - for ""where jackie chan currently live.""\n\nSo, the appropriate labels are: ['social_query', 'play_music', 'calendar_query', 'qa_factoid'].",0.83354
44,what movie can i watch tonight on the theater here in boston,[recommendation_events],The appropriate label for the provided text is: **recommendation_movies**,0.82275
37,delete alarm. when will the world end. can you please add an item to my grocery list,"[alarm_remove, general_quirky, lists_createoradd]","The text can be classified into the following labels:\n\n1. **alarm_remove** - ""delete alarm""\n2. **qa_factoid** - ""when will the world end""\n3. **lists_createoradd** - ""can you please add an item to my grocery list""\n\nTherefore, the relevant labels for the provided text are: ['alarm_remove', 'qa_factoid', 'lists_createoradd'].",0.78355


### Low Trustworthiness Scores

The responses with the lowest trustworthiness scores indicate outputs we are least confident are good.

Results with low trustworthiness scores would benefit most from manual review, especially if we need almost all outputs across the dataset to be correct.

For examples with the lowest trustworthiness scores in our dataset shown below, you can see that the predictions tend to be incorrect or could use further review.

In [20]:
combined_results.sort_values("tlm_score").head(3)

Unnamed: 0,text,labels,response,tlm_score
31,answer to pawel's email. tell me the recipe of. olly tweet apple that the iphone doesn't work,"[email_sendemail, cooking_recipe, social_post]","The text contains various components that suggest it could fit multiple categories. However, the most relevant classifications are:\n\n- **email_sendemail**: It mentions ""answer to pawel's email,"" indicating an intention to communicate via email.\n- **cooking_recipe**: The phrase ""tell me the recipe of"" indicates a request for a recipe.\n- **social_post**: The phrase ""olly tweet apple that the iphone doesn't work"" implies an action related to social media (specifically tweeting).\n\nGiven these observations, the text can be classified primarily under:\n1. **email_sendemail**\n2. **cooking_recipe**\n3. **social_post** \n\nIf only one label is to be chosen, **email_sendemail** would be the most appropriate due to the actionable intent indicated.",0.302665
2,erase all the events. resume my audio book from karl pilkington. tell me the profession of celebrity,"[calendar_remove, play_audiobook, qa_factoid]","The provided text can be classified into the following labels based on its components:\n\n1. ""erase all the events."" - This corresponds to **calendar_remove**.\n2. ""resume my audio book from karl pilkington."" - This corresponds to **play_audiobook**.\n3. ""tell me the profession of celebrity."" - This corresponds to **qa_factoid**.\n\nGiven the several distinct intents in the text, multiple classifications apply.",0.328716
16,start a competitive crossword for us. set a reminder in one hour for my bread to bake. tell me the percentage of the chance of rain in seattle washington for today. open saga,"[play_game, calendar_set, weather_query, play_game]","The text contains multiple tasks that can be classified into different categories. Here’s a breakdown of each task:\n\n1. ""start a competitive crossword for us."" - This could be categorized under **general_quirky** since it refers to starting a fun activity.\n2. ""set a reminder in one hour for my bread to bake."" - This can be classified as **calendar_set** since it involves setting a reminder.\n3. ""tell me the percentage of the chance of rain in seattle washington for today."" - This is best classified under **weather_query** as it involves asking for weather information.\n4. ""open saga"" - This doesn't neatly fit into the provided categories, but could be seen as initiating an action related to **play_game** if ""saga"" refers to a game.\n\nConsidering these classifications, the primary categories relevant to the provided text would be:\n\n- general_quirky\n- calendar_set\n- weather_query\n- (potentially play_game)\n\nIf only one label is needed, I would suggest **general_quirky** as it captures the overall essence of the text while acknowledging the additional categories.",0.359532


## Using Different Quality Presets

You can use TLM with different [quality presets](/tlm/tutorials/tlm_advanced/#quality-presets) by specifying the preset after the model name. 

For example, in this example below we specify `model="gpt-4o-low"` to use TLM on `low` quality preset (for lower cost/latency). If unspecified, the default quality preset used is `medium`.

Currently, only `base`, `low`, and `medium` presets are supported when using TLM via the OpenAI library. Read more about quality presets [here](/tlm/api/python/tlm/#class-tlmoptions).

In [17]:
sample_text = data['text'][0]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-low",
    messages=[
            {"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {sample_text}"}  
        ],
    extra_body={
        "tlm": {
            "quality_preset": "low"
        }
    }
)

We re-emphasize that you can use TLM via the [OpenAI library](https://github.com/openai/openai-python) to score the trustworthiness of *any* type of OpenAI output (not just structured outputs).
Beyond structured outputs, we recommend using TLM via the [OpenAI library](https://github.com/openai/openai-python) for LLM applications involving: function calling, system prompts and multiple user/assistant messages, as well as other advanced features offered by OpenAI but not most LLM APIs.

For questions about the OpenAI API, refer to the documentation linked from [their library](https://github.com/openai/openai-python).