# Using TLM with OpenAI's Chat Completions API

This tutorial demonstrates how to integrate your VPC installation of Cleanlab's Trustworthy Language Model (TLM) into existing GenAI apps. You will learn how to assess the trustworthiness of LLM model responses, directly through the [OpenAI client library](https://github.com/openai/openai-python) or Cleanlab's `cleanlab-tlm` library.

## API access to the TLM backend service

This demo assumes that you have access to the deployed TLM backend service at the URL `http://example.customer.com:8080/api`. You are welcome to expose the TLM service however you prefer, depending on the unique needs of your networking environment. Simply replace the base URL in the corresponding cell blocks below.

Please note that Google Colab does **_not_** have built-in support to access services on your local machine. This is because Colab [runs in a virtual machine](https://research.google.com/colaboratory/faq.html#executed-code), so `localhost` refers to that VM, rather than your computer. If you would like to access TLM by port-forwarding to your local machine, you may do so by downloading the `.ipynb` file and running Jupyter locally, or by using a tunneling service like [ngrok](https://ngrok.com/).

In [1]:
import os

os.environ["BASE_URL"] = "http://example.customer.com:8080/api"
os.environ["OPENAI_API_KEY"] = ""

## Setup

The Python packages required for this tutorial can be installed using pip:

In [2]:
%pip install --upgrade openai
%pip install --upgrade cleanlab-tlm

In [3]:
import openai
from openai import OpenAI
from cleanlab_tlm.utils.vpc.chat_completions import TLMChatCompletion

## Overview of this tutorial

The workflows showcased below demonstrates how to incorporate trust scoring into your existing LLM code with minimal code changes. We'll explore three workflows:

- Workflow 1 & 2: Use your own existing LLM infrastructure to generate responses, then use Cleanlab to score them
- Workflow 3: Use Cleanlab for both generating and scoring responses (response-generation can be from any LLM model supported in your VPC deployment)

## Workflow 1: Score Responses from Existing LLM Calls

The easiest way to use TLM if you're already using OpenAI's ChatCompletions API is to score any existing LLM call you've made.

You can first obtain generate LLM responses as usual using the OpenAI API (or any of your existing infrastructure, note that many LLM providers like Gemini/DeepSeek also support OpenAI's Chat Completions API):

In [4]:
openai_kwargs = {
    "model": "gpt-4.1-mini",
    "messages":[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
}
response = openai.chat.completions.create(**openai_kwargs)
response

ChatCompletion(id='chatcmpl-Bjuu48x0SxfeGuz3GY3jI2TpeAEUK', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750283196, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_6f2eabb9a5', usage=CompletionUsage(completion_tokens=7, prompt_tokens=24, total_tokens=31, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

We can then use TLM to score the generated response.

Here, we first instantiate a `TLMChatCompletion` object. For more configurations, view the valid arguments [below](#input-arguments-to-tlm).

In [5]:
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "gpt-4o-mini"}) 

In [6]:
score_result = tlm.score(
    response=response,
    **openai_kwargs
)

print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {score_result['trustworthiness_score']:.4f}")

Response: The capital of France is Paris.
TLM Score: 1.0000


## Workflow 2: Adding a Decorator to your LLM Call

Alternatively, you decorate your call to `openai.chat.completions.create()` with a decorator that then appends the trust score as a key in the returned response. This workflow only requires an initial setup which then requires zero changes to the rest of your existing code:

In [7]:
import functools

def add_trust_scoring(tlm_instance):
    """Decorator factory that creates a trust scoring decorator."""
    def trust_score_decorator(fn):
        @functools.wraps(fn)
        def wrapper(**kwargs):
            response = fn(**kwargs)
            score_result = tlm_instance.score(response=response, **kwargs)
            response.tlm_metadata = score_result
            return response
        return wrapper
    return trust_score_decorator

Then, we can decorate the OpenAI client, and then your existing code automatically gets trust scores:

In [8]:
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "gpt-4.1-mini"}) 

In [9]:
openai.chat.completions.create = add_trust_scoring(tlm)(openai.chat.completions.create)

response = openai.chat.completions.create(**openai_kwargs)
response

ChatCompletion(id='chatcmpl-Bjuu831HubejCal4AsdB2UWeLEJW2', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750283200, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_6f2eabb9a5', usage=CompletionUsage(completion_tokens=7, prompt_tokens=24, total_tokens=31, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)), tlm_metadata={'trustworthiness_score': 0.9999999233293982})

In [10]:
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")

Response: The capital of France is Paris.
TLM Score: 1.0000


## Workflow 3: Use Cleanlab to Generate and Score Responses

You can point the OpenAI client directly to Cleanlab's infrastructure. This approach generates responses using Cleanlab's backend while simultaneously providing trustworthiness scores.

Here, you can replace the base URL with your actual TLM service endpoint, and then use the `chat.completions.create()` method as you normally would:

In [11]:
client = OpenAI(
    api_key=".",  # the VPC installation of TLM does not require API key, but the OpenAI client does, so we pass a fake value here
    base_url="http://example.customer.com:8080/api"  # replace with your TLM service URL
)

In [12]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    extra_body={
        "quality_preset": "low"
    }
)
response

ChatCompletion(id='chatcmpl-BjuuC9gCQQk7yfonPQ3upUQsBef3v', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750283204, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier='default', system_fingerprint='fp_34a54ae93c', usage=CompletionUsage(completion_tokens=7, prompt_tokens=24, total_tokens=31, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)), tlm_metadata={'trustworthiness_score': 0.9999997198696753})

The `extra_body` argument contains additional TLM configurations. For all supported inputs, view the valid arguments [below](#input-arguments-to-tlm).

In [13]:
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")

Response: The capital of France is Paris.
TLM Score: 1.0000


### Adding a decorator to pass in TLM configurations via `extra_body`

Here, we demonstrate how to decorate your call to `openai.chat.completions.create()` which will automatically add the `extra_body` argument to all your subsequent calls to the `create()` method, which after the initial setup will require zero changes to your existing code.

In [14]:
import functools

def add_extra_body(tlm_kwargs):
    def decorator(fn):
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            kwargs["extra_body"] = tlm_kwargs
            return fn(*args, **kwargs)
        return wrapper
    return decorator

Similar to above, we can decorate the OpenAI client. After this monkey-patch, the code below is functionally equivalent to the one above where we specified `extra_body` in each `create()` call -- this make it such that you can use your existing code with minimal changes.

In [15]:
tlm_kwargs = {"quality_preset": "low"}
client.chat.completions.create = add_extra_body(tlm_kwargs)(client.chat.completions.create)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
)
response

ChatCompletion(id='chatcmpl-BjuuGxSKNvPWoQqYL8iSNC6e8UYjt', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750283208, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier='default', system_fingerprint='fp_34a54ae93c', usage=CompletionUsage(completion_tokens=7, prompt_tokens=24, total_tokens=31, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)), tlm_metadata={'trustworthiness_score': 0.9999997198696664})

## Input Arguments to TLM

These are optional TLM configurations you can specify either when initializing `TLMChatCompletion` object, or in the `extra_body` argument to the OpenAI API client.

- `quality_preset` ({"base", "low", "medium"}, default = "medium"): a preset configuration to control the quality of TLM responses and trustworthiness scores vs. latency/costs. The "medium" preset produces more reliable trustworthiness scores than "low". The "base" preset provides the lowest possible latency/cost. Higher presets have increased runtime and cost. Reduce your preset if you see token-limit errors. 

- `options` is a dictionary of configuration options for TLM. Inputs include:
    - `model` (default = “gpt-4.1-mini”): Underlying base LLM to use (better models yield better results, faster models yield faster results). 
    
        Note that if you are using the OpenAI `openai.chat.completions.create()` API, you should provide the model name there instead of in the options dictionary here.


## Getting Cheaper / Faster Results

The default TLM settings are not latency-optimized because they have to remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application.

- You can stream in a response from any (fast) LLM you are using, and then use `TLMChatCompletion.score` to subsequently stream in the trustworthiness score for the response. If you run TLM with a lower `quality_preset` and cheaper model, then the additional cost/runtime of trustworthiness scoring can be only a fraction of your cost/runtime of producing the response with your own LLM.

- Reduce the quality_preset setting (e.g. to "low" or "base:).

- Specify `options` to further reduce TLM runtimes by: changing model to a faster base LLM (e.g. `gpt-4.1-nano`)


## Running on Batches and Managing Rate Limits

When processing large datasets, here are some tips to handle rate limits and implement proper batching strategies:

### Prevent hitting rate limits
- Process data in small batches (e.g. 10-50 requests at a time)
- Add sleep intervals between batches (e.g. `time.sleep(1)`) to stay under rate limits

### Handling errors
- Save partial results frequently to avoid losing progress
- Consider using a try/except block to catch errors, and implement retry logic when rate limits are hit

Here are some sample helper functions that could help with batching:

In [None]:
import time
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

client = OpenAI(
    api_key=".",  # the VPC installation of TLM does not require API key, but the OpenAI client does, so we pass a fake value here
    base_url="http://example.customer.com:8080/api"  # replace with your TLM service URL
)

def invoke_llm_with_retries(openai_kwargs, retries=3, backoff=2):
    attempt = 0
    while attempt <= retries:
        try:
            # the code to invoke the LLM goes here, feel free to modify
            response = client.chat.completions.create(**openai_kwargs)
            return {
                "response": response.choices[0].message.content,
                "trustworthiness_score": response.tlm_metadata["trustworthiness_score"],
                "raw_completion": response
            }
        except Exception as e:
            if attempt == retries:
                return {"error": str(e), "input": openai_kwargs}
            sleep_time = backoff ** attempt
            time.sleep(sleep_time)
            attempt += 1

def run_batch(batch_data, batch_size=20, max_threads=8, sleep_time=5):
    results = []
    
    for i in tqdm(range(0, len(batch_data), batch_size)):
        data = batch_data[i:i + batch_size]
        batch_results = [None] * len(data)
        
        with ThreadPoolExecutor(max_workers=max_threads) as executor:
            future_to_idx = {executor.submit(invoke_llm_with_retries, d): idx for idx, d in enumerate(data)}
            for future in as_completed(future_to_idx):
                idx = future_to_idx[future]
                batch_results[idx] = future.result()
                
        results.extend(batch_results)

        # sleep to prevent hitting rate limits
        if i + batch_size < len(batch_data):
            time.sleep(sleep_time)
            
    return results

sample_input = {
    "model": "gpt-4.1-mini",
    "messages":[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
}
sample_batch = [sample_input] * 10
run_batch(sample_batch)

More information about handling rate limits can be found in [this OpenAI cookbook](https://cookbook.openai.com/examples/how_to_handle_rate_limits).