# Integrating TLM into your RAG app via the OpenAI Python client

This tutorial demonstrates how to integrate your VPC installation of Cleanlab's Trustworthy Language Model (TLM) into existing GenAI apps. You will learn how to assess the trustworthiness of OpenAI model responses, directly through the [OpenAI client library](https://github.com/openai/openai-python).

In this tutorial, we perform multi-label classification (i.e. document tagging) with trustworthiness scores from TLM. The same method can be used to score the trustworthiness of any type of output from OpenAI.

## API access to the TLM backend service

This demo assumes that you have access to the deployed TLM backend service at the URL `http://example.customer.com:8080/api`. You are welcome to expose the TLM service however you prefer, depending on the unique needs of your networking environment. Simply replace the base URL when instantiating the OpenAI Python client later in this tutorial.

Please note that Google Colab does **_not_** have built-in support to access services on your local machine. This is because Colab [runs in a virtual machine](https://research.google.com/colaboratory/faq.html#executed-code), so `localhost` refers to that VM, rather than your computer. If you would like to access TLM by port-forwarding to your local machine, you may do so by downloading the `.ipynb` file and running Jupyter locally, or by using a tunneling service like [ngrok](https://ngrok.com/).

## Setup

The Python packages required for this tutorial can be installed using pip:

In [None]:
%pip install --upgrade openai tqdm

In [2]:
import pandas as pd
from enum import Enum
from pydantic import BaseModel
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
import ast
import time
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

## Fetch example Dataset

This tutorial uses a modified version of the [Alexa intent detection dataset](https://huggingface.co/datasets/AmazonScience/massive). 

Each text sample contains several statements that could correspond to multiple intents (for example controlling devices, asking for information etc). The label corresponding to each example specifies what the intent of that statement is, where there could be more than one intent corresponding to each sample. Let's take a look at the dataset below.

In this tutorial, we will only run the LLM inference on 50 randomly sampled examples of this dataset as a demonstration.

In [None]:
!wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/massive_multilabel_classification.csv

In [3]:
data = pd.read_csv("massive_multilabel_classification.csv")
data["labels"] = data["labels"].apply(ast.literal_eval)
data = data.sample(50, random_state=123).reset_index(drop=True)
data.head()

Unnamed: 0,text,labels
0,lets have a chat,[general_quirky]
1,what are meeting scheduled for today,[calendar_query]
2,erase all the events. resume my audio book from karl pilkington. tell me the profession of celebrity,"[calendar_remove, play_audiobook, qa_factoid]"
3,thirty minute reminder on meeting for tuesday,[calendar_set]
4,i have a nine am meeting on wednesday send me a reminder,[calendar_set]


## Obtain LLM Predictions

First, we need to get a list of all possible classes from the given dataset:

In [4]:
multilabel_classes = data["labels"].explode().unique()
multilabel_classes[:5]

array(['general_quirky', 'calendar_query', 'calendar_remove',
       'play_audiobook', 'qa_factoid'], dtype=object)

### Prompt OpenAI 

Then, we can instantiate the OpenAI client, pointing the `base_url` to TLM, which allows us to also get the trustworthiness score associated with each response.

In [5]:
# the VPC installation of TLM does not authenticate requests by default, but the OpenAI client requires an API key, so we pass a fake value here
client = OpenAI(
    api_key="test",
    base_url="http://example.customer.com:8080/api"
)

Here is an example of how we can prompt OpenAI with one sample text:

In [7]:
sample_text = data['text'][1]

In [8]:
completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
            {"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {sample_text}"}  
        ],
    extra_body={
        "tlm": {
            "quality_preset": "low"
        }
    }
)

The returned object matches what OpenAI would ordinarily return, except it has an additional `tlm` field from TLM with extra information like the trustworthiness `score` and other `metadata`. This way you can use TLM as a drop-in replacement for OpenAI in any application (and will still get back the same responses you'd get directly from OpenAI).  Let's parse the predictions and trustworthiness score from the returned response:

In [10]:
response = completion.choices[0].message.content
tlm_score = completion.choices[0].tlm["score"]

print(f"LLM Response: {response}")
print(f"TLM Score: {tlm_score}")

LLM Response: The appropriate label for the text "what are meeting scheduled for today" is **'calendar_query'**.
TLM Score: 0.6187163760956422


### Batch Prompt on a Dataset

Here, we define a quick helper function that allows us to process multiple texts in parallel, which will speed up prompting the LLM on an entire dataset. The helper function also parses and collects the predictions and trustworthiness score in a DataFrame for easy downstream analysis.

In [12]:
def classify_text(text):
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {text}"}],
        extra_body={
            "tlm": {
                "quality_preset": "low"
            }
        }
    )
    
    return {
        "response": completion.choices[0].message.content,
        "tlm_score": completion.choices[0].tlm["score"],
    }

def classify_texts_batch(texts, batch_size=5, max_threads=3, sleep_time=10):
    results = []
    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i + batch_size]
        
        with ThreadPoolExecutor(max_threads) as executor:
            futures = [executor.submit(classify_text, text) for text in batch]
            batch_results = [f.result() for f in futures]
        
        results.extend(batch_results)

        # sleep to prevent hitting rate limits
        if i + batch_size < len(texts):
            time.sleep(sleep_time)
    
    return pd.DataFrame(results)

In [13]:
results = classify_texts_batch(data["text"], batch_size=5)
results.head()

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [13:47<00:00, 82.79s/it]


Unnamed: 0,response,tlm_score
0,"The text ""lets have a chat"" is best classified as **general_quirky**.",0.570052
1,"The text ""what are meeting scheduled for today"" can be classified as **'calendar_query'**.",0.588551
2,"The text can be classified into multiple categories based on its content. Here are the appropriate labels:\n\n1. ""calendar_remove"" - related to ""erase all the events.""\n2. ""play_audiobook"" - related to ""resume my audio book from karl pilkington.""\n3. ""qa_factoid"" - related to ""tell me the profession of celebrity.""\n\nTherefore, the classification would be:\n\n- calendar_remove\n- play_audiobook\n- qa_factoid",0.485598
3,"The appropriate classification for the text ""thirty minute reminder on meeting for tuesday"" is **calendar_query**.",0.59791
4,"The appropriate label for the text ""i have a nine am meeting on wednesday send me a reminder"" is **calendar_set**.",0.705016


## Examine Results

We have now obtained the predictions and trustworthiness score for each given text. Let's examine the results in more detail.

In [15]:
combined_results = pd.concat([data, results], axis=1)

### High Trustworthiness Scores

The responses with the highest trustworthiness scores represent texts where TLM is the most confident that it has predicted the correct intents.

We can see below that the predictions for the samples below match the ground truth labels and are correctly classified.

In [16]:
combined_results.sort_values("tlm_score", ascending=False).head(3)

Unnamed: 0,text,labels,response,tlm_score
44,what movie can i watch tonight on the theater here in boston,[recommendation_events],The text can be classified under the label: **recommendation_movies**.,0.818268
41,features of google pixel. what is the deepest point on earth,"[general_quirky, qa_factoid]",The text can be classified under the label: **qa_factoid**.,0.768271
23,add a wrist watch to the shopping list. when is the next friday the thirteenth. olly clear the list,"[lists_createoradd, datetime_query, lists_remove]","The text contains multiple requests, which can be classified into different categories:\n\n1. ""add a wrist watch to the shopping list."" - This can be classified as **lists_createoradd**.\n2. ""when is the next friday the thirteenth."" - This can be classified as **datetime_query**.\n3. ""olly clear the list"" - This can be classified as **lists_remove**.\n\nOverall, the classification labels applicable to the text are:\n\n- lists_createoradd\n- datetime_query\n- lists_remove",0.765526


### Low Trustworthiness Scores

The responses with the lowest trustworthiness scores indicate outputs we are least confident are good.

Results with low trustworthiness scores would benefit most from manual review, especially if we need almost all outputs across the dataset to be correct.

For examples with the lowest trustworthiness scores in our dataset shown below, you can see that the predictions tend to be incorrect or could use further review.

In [17]:
combined_results.sort_values("tlm_score").head(3)

Unnamed: 0,text,labels,response,tlm_score
37,delete alarm. when will the world end. can you please add an item to my grocery list,"[alarm_remove, general_quirky, lists_createoradd]","The text contains multiple requests that can be classified into different categories. Here are the classifications for each part:\n\n1. ""delete alarm."" - This corresponds to the label **'alarm_remove'**.\n2. ""when will the world end."" - This corresponds to the label **'qa_factoid'** (since it's a question seeking factual information).\n3. ""can you please add an item to my grocery list"" - This corresponds to the label **'lists_createoradd'**.\n\nIf you are looking for a single label classification for the entire text, it would be more complex since it contains multiple different requests. However, if choosing one main label, you could consider it as **'calendar_remove'** because it may suggest removing or adjusting an alarm as a primary action. \n\nBut as the text stands, it covers three separate requests.",0.289426
36,where do most celebrities hang out. please rate current song as five stars,"[qa_factoid, music_likeness]",The text can be classified as: **general_quirky**.,0.30198
48,yes i like to save opinion on song playing and which player support that song. what's the currency exchange rate for china,"[music_likeness, qa_currency]","The text can be classified under the following labels:\n\n1. **music_likeness** - ""yes i like to save opinion on song playing...""\n2. **qa_currency** - ""...which player support that song. what's the currency exchange rate for china""\n\nHowever, since it presents two different ideas, if only one label is to be chosen based on the primary focus, **qa_currency** would be more appropriate, given the clear question concerning the currency exchange rate. \n\nFinal classification: **qa_currency**",0.425064


## Using Different Quality Presets

You can use TLM with different [quality presets](/tlm/tutorials/tlm_advanced/#quality-presets) by specifying the preset after the model name. 

For example, in this example below we specify `model="gpt-4o-low"` to use TLM on `low` quality preset (for lower cost/latency). If unspecified, the default quality preset used is `medium`.

Currently, only `base`, `low`, and `medium` presets are supported when using TLM via the OpenAI library. Read more about quality presets [here](/tlm/api/python/tlm/#class-tlmoptions).

In [18]:
sample_text = data['text'][0]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-low",
    messages=[
            {"role": "user", "content": f"Classify the following text, using these labels for guidance: {multilabel_classes}. The text is: {sample_text}"}  
        ],
    extra_body={
        "tlm": {
            "quality_preset": "low"
        }
    }
)

We re-emphasize that you can use TLM via the [OpenAI library](https://github.com/openai/openai-python) to score the trustworthiness of *any* type of OpenAI output.

For questions about the OpenAI API, refer to the documentation linked from [their library](https://github.com/openai/openai-python).