# Using TLM with OpenAI's Responses API

This tutorial demonstrates how to score the trustworthiness of responses from the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses). With minimal changes to your existing Responses API code, you can score the trustworthiness of every LLM response in real-time, even when relying on OpenAI tools like function calling, web search, and file search.

## Setup

The Python packages required for this tutorial can be installed using pip:

In [None]:
%pip install --upgrade --quiet cleanlab-tlm openai trafilatura

This tutorial requires a TLM API key. Get one [here](https://tlm.cleanlab.ai/).

In [None]:
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<Cleanlab TLM API key>"  # Get your free API key from: https://tlm.cleanlab.ai/
os.environ["OPENAI_API_KEY"] = "<OpenAI API key>"  # for using OpenAI client library

Let's first initialize clients.

In [None]:
from openai import OpenAI
from cleanlab_tlm.utils.responses import TLMResponses

client = OpenAI()
tlm = TLMResponses(options={"log": ["explanation"]})

## Usage

We'll showcase different OpenAI Responses API workflows, and how you can score the trustworthiness of results in each workflow.

### Workflow 1: Single Turn Q&A

Here is the standard OpenAI Responses code you'd write to call the LLM with a prompt and get a response.

In [9]:
openai_kwargs = dict(
  model = "gpt-4.1-mini",
  input = "What is the capital of France?",
)

response = client.responses.create(**openai_kwargs)

print("Response:", next(message for message in response.output if message.type == "message").content[0].text)

Response: The capital of France is Paris.


Score the trustworthiness of this response using the TLMResponse `score()` method, passing in the OpenAI keyword arguments that you had passed in the OpenAI Responses call that generated this response.

In [10]:
tlm_result = tlm.score(response=response, **openai_kwargs)

print(f"TLM Score: {tlm_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {tlm_result['log']['explanation']}")

TLM Score: 0.9990
TLM Explanation: Did not find a reason to doubt trustworthiness.


### Workflow 2: Multi-Turn Chat

In the following example, the user first asks the LLM about FIFA finalists and follows up with a question about the Golden Ball winner in the same chat. Here we manage the `messages` variables to track the conversation history. Again, we can get trust scores for every LLM response, simply by passing in the same arguments to TLMResponses that were passed to OpenAI to generate that response.

In [11]:
print("Turn one:")

messages = [
  {
    "role": "user",
    "content": "Who were the finalists in 2022 FIFA World Cup?"
  }
]

openai_kwargs = dict(
    model = "gpt-4.1-mini",
    input = messages,
)

response = client.responses.create(**openai_kwargs)

text_response = next(message for message in response.output if message.type == "message").content[0].text
print("Response:", text_response)

## Extra Cleanlab code ##
tlm_result = tlm.score(response=response, **openai_kwargs)
print(f"TLM Score: {tlm_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {tlm_result['log']['explanation']}")
## End of extra Cleanlab code ##

print("\n\nTurn two:")

messages.append({
  "role": "assistant",
  "content": text_response
})
messages.append({
  "role": "user",
  "content": "Who won Golden Ball?"
})

openai_kwargs = dict(
    model = "gpt-4.1-mini",
    input = messages,
)

response = client.responses.create(**openai_kwargs)

print("Response:", next(message for message in response.output if message.type == "message").content[0].text)

## Extra Cleanlab code ##
tlm_result = tlm.score(response=response, **openai_kwargs)
print(f"TLM Score: {tlm_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {tlm_result['log']['explanation']}")
## End of extra Cleanlab code ##

Turn one:
Response: The finalists in the 2022 FIFA World Cup were Argentina and France.
TLM Score: 0.9907
TLM Explanation: Did not find a reason to doubt trustworthiness.


Turn two:
Response: The Golden Ball award at the 2022 FIFA World Cup was won by Lionel Messi of Argentina.
TLM Score: 0.9904
TLM Explanation: Did not find a reason to doubt trustworthiness.


### Workflow 3: Including Web Search Tool

The OpenAI Responses API provides LLMs access to a native `web_search` tool. Here, we will force the usage of the web search tool, demonstrating how scoring the trustworthiness of web-search powered LLM responses can be achieved with the same TLM code as before. Note that when trust-scoring a response that uses web search, you will need to install the `trafilatura` package to analyze the content of web pages.

In [20]:
openai_kwargs = dict(
    model = "gpt-4.1-mini",
    input = "Who wrote pride and prejudice?",
    tools = [{"type": "web_search"}],
    tool_choice = {"type": "web_search"},
)

response = client.responses.create(**openai_kwargs)

print("Response Text:", next(message for message in response.output if message.type == "message").content[0].text)
print("\nResponse Object:", response)

Response Text: "Pride and Prejudice" is a novel written by Jane Austen, first published in 1813. Austen, an English author, is renowned for her keen observations of social manners and relationships in the early 19th century. "Pride and Prejudice" is considered one of her most significant works, exploring themes of love, class, and societal expectations. ([britannica.com](https://www.britannica.com/topic/Pride-and-Prejudice?utm_source=openai)) 

Response Object: Response(id='resp_096511bf77ab1d520068ddae4afad4819faa3789b70c62c481', created_at=1759358539.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4.1-mini-2025-04-14', object='response', output=[ResponseFunctionWebSearch(id='ws_096511bf77ab1d520068ddae4b080c819fbfa935065492e28a', action=ActionSearch(query='Who wrote pride and prejudice?', type='search', sources=None), status='completed', type='web_search_call'), ResponseOutputMessage(id='msg_096511bf77ab1d520068ddae4c4818819fb0be010b8fcf2983', conte

In [21]:
tlm_response = tlm.score(response=response, **openai_kwargs)

print(f"TLM Score: {tlm_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {tlm_result['log']['explanation']}")

TLM Score: 0.9904
TLM Explanation: Did not find a reason to doubt trustworthiness.


### Workflow 4: Including File Search Tool (RAG)

OpenAI Responses can also run RAG, retrieving relevant data from a vector-store that is considered by the LLM when generating its response. This involves using the native `file_search` tool to find relevant documents and passages that can inform the response. 

Let's download a sample PDF file containing OpenAI prices and upload it to an OpenAI vector store. This way, we can ask questions about OpenAI pricing using the `file_search` tool.

In [14]:
import requests

url = "https://storage.googleapis.com/files-hosting/openai-pricing.pdf"
pdf_path = "openai-pricing.pdf"

response = requests.get(url)
with open(pdf_path, "wb") as f:
  f.write(response.content)

print(f"Downloaded PDF to {pdf_path}")

file = client.files.create(file=open("openai-pricing.pdf", "rb"), purpose="user_data")
vector_store = client.vector_stores.create(name="knowledge_base")
client.vector_stores.files.create_and_poll(vector_store_id=vector_store.id, file_id=file.id)

print("Created vector store with ID:", vector_store.id)

Downloaded PDF to openai-pricing.pdf
Created vector store with ID: vs_68ddad3a883c8191b5586b074b349ac5


Now, we're ready to send a test message to OpenAI Responses. When you do this, you must have `"include": ["file_search_call.results"]` in your request payload to properly score the file search.

In [15]:
openai_kwargs = {
  "model": "gpt-4.1-mini",
  "input": "How much does GPT-5 cost?",
  "tools": [{"type": "file_search", "vector_store_ids": [vector_store.id]}],
  "include": ["file_search_call.results"],
  "tool_choice": {"type": "file_search"},
}

response = client.responses.create(**openai_kwargs)

print("Response:", next(message for message in response.output if message.type == "message").content[0].text)
print("\nResponse Object:", response)

Response: The cost of using GPT-5 via API pricing is as follows:

- For the main GPT-5 model:
  - Input tokens: $1.250 per 1 million tokens
  - Cached input tokens: $0.125 per 1 million tokens
  - Output tokens: $10.000 per 1 million tokens

- GPT-5 mini (a faster, cheaper version):
  - Input tokens: $0.250 per 1 million tokens
  - Cached input tokens: $0.025 per 1 million tokens
  - Output tokens: $2.000 per 1 million tokens

- GPT-5 nano (the fastest and cheapest version):
  - Input tokens: $0.050 per 1 million tokens
  - Cached input tokens: $0.005 per 1 million tokens
  - Output tokens: $0.400 per 1 million tokens

This pricing structure reflects usage costs based on tokens processed (both input and output) by the API. For more details, you can refer to the OpenAI pricing document provided.

Response Object: Response(id='resp_0eb45fdecd2d60c30068ddad3d6edc81a2bed8caf423a3bf28', created_at=1759358269.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-

As long as you have the `file_search_call.results` included in your OpenAI request, then you can score the trustworthiness of the file search powered response using the same TLM code.

In [16]:
tlm_response = tlm.score(response=response, **openai_kwargs)

print(f"TLM Score: {tlm_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {tlm_result['log']['explanation']}")

TLM Score: 0.9904
TLM Explanation: Did not find a reason to doubt trustworthiness.


## Make your existing code also produce trust scores (via decorator)

You decorate your call to `openai.responses.create()` with a decorator that then appends the trust score as a key in the returned response. This workflow only requires minimal initial setup; after that zero changes are needed in the rest of your existing code!

In [17]:
import functools

def add_trust_scoring(tlm_instance):
    """Decorator factory that creates a trust scoring decorator."""
    def trust_score_decorator(fn):
        @functools.wraps(fn)
        def wrapper(**kwargs):
            response = fn(**kwargs)
            score_result = tlm_instance.score(response=response, **kwargs)
            response.tlm_metadata = score_result
            return response
        return wrapper
    return trust_score_decorator

Then decorate your OpenAI Responses function like this:

In [18]:
client.responses.create = add_trust_scoring(tlm)(client.responses.create)

After you decorate OpenAI’s Responses function like this, all of your existing Responses API code will automatically compute trust scores as well (zero change needed in other code):

In [19]:
response = client.responses.create(input="What is the capital of France?", model="gpt-4.1-mini")

print(f"Response: {response.output[0].content[0].text}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {response.tlm_metadata['log']['explanation']}")

Response: The capital of France is Paris.
TLM Score: 0.9990
TLM Explanation: Did not find a reason to doubt trustworthiness.
