# Trustworthy Language Model (TLM)

<head>
  <meta name="title" content="Trustworthy Language Model (TLM)"/>
  <meta property="og:title" content="Trustworthy Language Model (TLM)"/>
  <meta name="twitter:title" content="Trustworthy Language Model (TLM)" />
  <meta name="image" content="/img/tlm-chat.png" />
  <meta property="og:image" content="/img/tlm-chat.png" />
  <meta name="twitter:image" content="/img/tlm-chat.png" />
  <meta name="description" content="A more reliable LLM that quantifies confidence for every output and can detect bad responses."  />
  <meta property="og:description" content="A more reliable LLM that quantifies confidence for every output and can detect bad responses." />
  <meta name="twitter:description" content="A more reliable LLM that quantifies confidence for every output and can detect bad responses." />
</head>



:::info

This feature is in beta, and requires a Cleanlab Studio API Token to use. To get the API token, you must first create a Cleanlab Studio account and access the account page [here](https://app.cleanlab.ai/account). Additional instructions on creating your account and getting the API token can be found in the [Cleanlab Studio Python API Guide](/guide/quickstart/api/#creating-an-api-key).

For higher token limits, email: [sales@cleanlab.ai](mailto:sales@cleanlab.ai)

:::

Large Language Models can act as powerful reasoning engines for solving problems and answering questions, but they are prone to “hallucinations”, where they sometimes produce incorrect or nonsensical answers. With standard LLM APIs, it’s hard to automatically tell whether an output is good or not.

Cleanlab TLM is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.

<img src="https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/tlm-chat.svg" alt="TLM chat interface" width="600"/>

For example, with a standard LLM:

> **Question**: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual's car without a warrant? <br/>
**Answer**: **The Fourth Amendment.**
> 
> 
> **Question**: What is 57834849 + 38833747? <br/>
> **Answer**: **96668696**

It’s difficult to tell when the LLM is answering confidently, and when it is not. However, with Cleanlab Trustworthy LLM, the answers come with a **confidence score**. This can guide how to use the output from the LLM (e.g. use it directly if the score is above a certain threshold, otherwise flag the response for human review):

> **Question**: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual's car without a warrant? <br/>
> **Answer**: <span style={{color: '#448361'}}>The Fourth Amendment.</span> <br/>
> **Confidence**: `0.765`
> 
> **Question**: What is 57834849 + 38833747? <br/>
> **Answer**: <span style={{color: '#D44C47'}}>96668696</span> <br/>
> **Confidence**: `0.245`
> 
> **Question**: What is 100 + 300? <br/>
> **Answer**: <span style={{color: '#448361'}}>400</span> <br/>
> **Confidence**: `0.938`
> 
> **Question**: Which part of the human body produces insulin? <br/>
> **Answer**: <span style={{color: '#448361'}}>the pancreas</span> <br/>
> **Confidence**: `0.759`
> 
> **Question**: What color are the two stars on the national flag of Syria? <br/>
> **Answer**: <span style={{color: '#D44C47'}}>red and black</span> <br/>
> **Confidence**: `0.173`

## Installing Cleanlab TLM

Using TLM requires a [Cleanlab Studio](https://app.cleanlab.ai/) account. Sign up for one [here](https://cleanlab.ai/signup/) if you haven't yet. If you've already signed up, check your email for a personal login link.

The `cleanlab-studio` client can be installed using pip:

In [None]:
%pip install cleanlab-studio

In [1]:
from cleanlab_studio import Studio

## Using the TLM

You can query the TLM as follows:

In [2]:
# Get API key from here: https://app.cleanlab.ai/account after creating a Cleanlab Studio account.
# Instructions to create account can be found under Guide -> Quickstart -> Cleanlab Studio Python API -> Creating an API Key

studio = Studio("<API key>")
tlm = studio.TLM()

output = tlm.prompt("<your prompt>")

The TLM’s `output` will be a dict with two fields:

```
{
  "response": "<response>"  # string like you'd get back from any standard LLM
  "confidence_score": "<confidence_score>"  # numerical value between 0-1
}
```

The score quantifies how confident you can be that the response is good (higher values indicate greater confidence). These scores combine estimates of both *aleatoric* and *epistemic* uncertainty to provide an overall gauge of confidence.  You may find the TLM most useful when your prompts take the form of a question with a definite answer (in which case the returned score quantifies our confidence that the LLM response is *correct*). Boost the *reliability* of your Generative AI applications by adding contingency plans to override LLM answers whose confidence falls below some threshold (e.g., route to human for answer, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).

## Advanced Usage
For efficient computation on larger datasets, TLM supports processing multiple concurrent requests in batches. The maximum number of concurrent requests in batch queries is set by **max_concurrent_requests** (default value is `16`).
To control reliability/compute trade-offs, you can pass in different **quality presets** to the TLM. The default preset is `medium`.  For some use cases, this will be enough, but using the highest quality presets will produce *better* model responses and *more reliable* associated confidence scores (at the cost of extra computation).

In [3]:
# Default batch size is 16, max batch size is 128.
DEFAULT_MAX_CONCURRENT_TLM_REQUESTS = 16

tlm = studio.TLM(
    max_concurrent_requests=DEFAULT_MAX_CONCURRENT_TLM_REQUESTS,
    quality_preset="best" # supported quality presets are: 'best','high','medium','low','base'
)  
output = tlm.prompt("<your prompt>")

**Details about the TLM quality presets**: 

- `best` and `high` will improve the LLM responses themselves, with `best` also returning the most reliable confidence scores.
- `medium` and `low` will return standard LLM responses along with associated confidence scores, with `medium` producing more reliable confidence scores than `low`.
- `base` will not return any confidence score, just an output response. This option is similar to using your favorite LLM. It helps you to compare the enhanced responses from `best` and `high` quality presets with a standard LLM, as well as the value of the additional confidence scores returned by TLM.

**Benchmark:**  Accuracy of answers from OpenAI LLM (GPT3.5-Turbo) vs. TLM with quality_preset=`best` (across 4 different Q&A datasets from different domains)

| Dataset | OpenAI LLM | Cleanlab TLM |
| --- | --- | --- |
| GSM8K | 47% | 69% |
| CSQA | 72% | 73% |
| SVAMP | 75% | 82% |
| TriviaQA | 73% | 76% |

These benchmarks also reveal that, in the vast majority of cases, the confidence scores are lower for incorrect TLM answers than correct answers. Thus you can safely rely on these scores to alert you about LLM responses that are untrustworthy.

## Scoring the confidence of a given response

You can also use TLM to compute a confidence score for any response to a given prompt. The response does not need to come from TLM, and could be human-written. Simply pass a prompt response pair to the TLM and it will return a numerical score quantifying our confidence that this is a *good* response. 

In [None]:
confidence_score = tlm.get_confidence_score("<your prompt>", response="<your response>")

## Application: Determining which LLM responses are untrustworthy

TLM confidence scores are often most interpretable when comparing them over a large dataset. Be sure to use **batched queries** (eg. `batch_prompt`) for running TLM on many examples from a dataset.

To show how TLM confidence scores can catch low quality model outputs, let's consider a dataset of various Space-related trivia questions. We can use each question as a prompt for the TLM (just as you would with any other LLM) and record its response and associated confidence score.

In [4]:
import pandas as pd
from tqdm import tqdm 

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_dataset.csv'
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_example_prompts.csv'

In [5]:
df = pd.read_csv("solar_system_example_prompts.csv")
df.head()

Unnamed: 0,prompt
0,What is the largest planet in the Solar System?
1,"As of 2024, how many dogs have reached outer space?"
2,What is the name of the galaxy that contains our Solar System?
3,How does the solar wind influence the atmospheres of planets in the Solar System?
4,Fetch me the current trajectory of Pluto's orbit from nasa.gov


In [6]:
# Our list of prompts is small so we can use the default batch size
batch_size = DEFAULT_MAX_CONCURRENT_TLM_REQUESTS
tlm = studio.TLM(quality_preset="medium", max_concurrent_requests=batch_size)
idx_start = 0  # In case TLM call times-out, can resume loop at this data point

results = df.copy(deep=True) 
results["response"] = None
results["confidence_score"] = None

for i in tqdm(range(0, len(results), batch_size)):
    end_index = min(i + batch_size, len(results))
    batch_prompts = results.iloc[i:end_index]["prompt"]
    # Concurrently process batch_size prompts in a single batched query
    outputs = tlm.batch_prompt(batch_prompts, retries=1)
    responses = [output["response"] for output in outputs]
    confidence_scores = [output["confidence_score"] for output in outputs]
    results.loc[i:end_index-1, "response"] = responses
    results.loc[i:end_index-1, "confidence_score"] = confidence_scores


100%|███████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.97s/it]


To see which model responses are least trustworthy, let's sort the data by the confidence scores. Observe how model responses with the lowest confidence scores convey unhelpful or factually inaccurate answers, sometimes referred to as *hallucinations*. After quick fact-checking on sample responses, we've verified that "Sunguard" is not a real product, and that no chimpanzee named Ham has ever reached the Moon, let alone any animal in general. 

On the other hand, the model responses with the highest scores are direct, accurate, and appropriately detailed. Now we can automatically detect which LLM responses are confidently good vs. not!

In [7]:
results.sort_values(by="confidence_score").head()

Unnamed: 0,prompt,response,confidence_score
8,Which brand of sunscreen is recommended for astronauts on spacewalks to protect against solar radiation?,"The brand of sunscreen recommended for astronauts on spacewalks to protect against solar radiation is called ""Sunguard.""",0.180478
1,"As of 2024, how many dogs have reached outer space?","As an AI language model, I don't have real-time data, but as of my last knowledge update in 2021, a total of 17 dogs have been sent to space. The most famous dog to reach outer space was Laika, a Soviet space dog who became the first animal to orbit the Earth aboard Sputnik 2 in 1957. However, it's important to note that the use of animals in space missions has significantly decreased since then, and human astronauts are now primarily sent for space exploration. For the most up-to-date information, it's recommended to refer to reliable sources or space agencies.",0.331541
5,"How many years from now will Earth, Saturn, and the Sun align?","The alignment of Earth, Saturn, and the Sun occurs approximately once every 378 days. Therefore, if we consider the current alignment as the starting point, it would take approximately 378 days for them to align again.",0.443525
15,What was the first meal that astronauts ate on the Moon?,"The first meal that astronauts ate on the Moon during the Apollo 11 mission was a specially prepared food item called the ""Apollo 11 Meal B."" It consisted of bacon squares, peaches, sugar cookie cubes, pineapple grapefruit drink, and coffee.",0.463158
12,What is the name of the first chimpanzee that reached the Moon?,The first chimpanzee that reached the Moon was named Ham.,0.576107


In [8]:
results.sort_values(by="confidence_score").tail()

Unnamed: 0,prompt,response,confidence_score
10,"In what year did the United Nations grant the Moon its independence, and who is the current lunar ambassador?","The United Nations has not granted the Moon its independence, as the Moon is not a sovereign entity. It is a natural satellite of Earth and is not recognized as an independent nation. Therefore, there is no current lunar ambassador.",0.937928
3,How does the solar wind influence the atmospheres of planets in the Solar System?,"The solar wind is a stream of charged particles, mainly protons and electrons, that are constantly emitted by the Sun. It has a significant influence on the atmospheres of planets in the Solar System in several ways:\n\n1. Atmospheric Stripping: The solar wind can strip away the outer layers of a planet's atmosphere. This is particularly evident in planets with weak magnetic fields, such as Mars. The high-energy particles in the solar wind can collide with and ionize the atmospheric particles, causing them to escape into space. Over time, this can lead to a gradual loss of atmosphere.\n\n2. Magnetospheric Interaction: Planets with strong magnetic fields, like Earth, have a protective magnetosphere that shields their atmospheres from the direct impact of the solar wind. However, the solar wind can still interact with the magnetosphere, causing various effects. For example, the solar wind can compress the magnetosphere on the side facing the Sun and stretch it out on the opposite side, creating a long tail-like structure called a magnetotail. This interaction can lead to the generation of auroras in the polar regions.\n\n3. Ionization and Heating: When the charged particles in the solar wind collide with the atoms and molecules in a planet's atmosphere, they can ionize them, meaning they strip away electrons from the atoms or molecules. This ionization process can create layers of charged particles in the upper atmosphere, known as ionospheres. The solar wind can also transfer energy to the atmosphere, leading to heating and increased temperatures in the upper layers.\n\n4. Atmospheric Escape: The solar wind can drive a process called atmospheric escape, where the high-energy particles can impart enough energy to atmospheric particles to overcome the planet's gravitational pull and escape into space. This can result in the loss of lighter elements, such as hydrogen and helium, from a planet's atmosphere over long periods of time.\n\nOverall, the solar wind plays a crucial role in shaping and modifying the atmospheres of planets in the Solar System, with its effects varying depending on factors such as the planet's magnetic field strength, composition, and distance from the Sun.",0.938925
9,What is the primary reason Venus has not been colonized yet despite its similar size and gravity to Earth?,"The primary reason Venus has not been colonized yet is due to its extremely inhospitable environment. While Venus is similar in size and gravity to Earth, it has a dense atmosphere composed mainly of carbon dioxide with a pressure about 92 times that of Earth's atmosphere. Additionally, Venus experiences a runaway greenhouse effect, resulting in surface temperatures of around 900 degrees Fahrenheit (475 degrees Celsius), which is hotter than the surface of Mercury, despite Venus being farther from the Sun. The planet also has sulfuric acid clouds and lacks water in its liquid form. These extreme conditions make it extremely challenging for human habitation and colonization.",0.939441
2,What is the name of the galaxy that contains our Solar System?,The name of the galaxy that contains our Solar System is the Milky Way.,0.968257
0,What is the largest planet in the Solar System?,The largest planet in the Solar System is Jupiter.,0.969742


**How to use these scores?** If you have time/resources, your team can manually review low-confidence responses and provide a better human response instead. If not, you can determine a confidence threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose confidence falls below the threshold. 

```python
threshold = 0.5  # set this after inspecting responses around different confidence ranges 
if confidence_score < threshold:
    response = response + "\n CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY"
```

## Application: Estimating the quality of arbitrary responses (find bad data in sequence-to-sequence dataset) 

Let's see the TLM's capability to produce confidence scores for arbitrary responses (not necessarily produced from the TLM) by evaluating given human responses for our Space Trivia dataset. Such sequence-to-sequence data are often used for fine-tuning LLMs (aka. *instruction tuning* or *LLM alignment*), but often contain low-quality (input, output) text pairs that hamper LLM training. To detect low-quality pairs, we can score the quality of the human responses via the TLM confidence score. Again we recommend using **batched queries** (eg. `batch_get_confidence_score`) when using TLM to assess many (input, output) pairs from a dataset.

In [9]:
df = pd.read_csv("solar_system_dataset.csv")
df.head()

Unnamed: 0,prompt,human_response
0,What is the largest planet in our solar system?,The largest planet in our solar system is Jvpit3r.
1,What is the significance of the asteroid belt in our solar system?,"The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation."
2,How many planets are in the solar system?,There are eight planets in the solar system.
3,What defines a planet as a 'gas giant'?,"A gas giant is a large planet composed mostly of gases, such as hydrogen and helium, with a relatively small rocky core."
4,Is Pluto still considered a planet in our solar system?,Pluto is no longer considered a planet. It is classified as a dwarf planet.


In [10]:
batch_size = DEFAULT_MAX_CONCURRENT_TLM_REQUESTS
tlm = studio.TLM(quality_preset="medium", max_concurrent_requests=batch_size)
idx_start = 0  # In case TLM call times-out, can resume loop at this data point

results = df.copy(deep=True) 
results["confidence_score"] = None

for i in tqdm(range(0, len(results), batch_size)):
    end_index = min(i + batch_size, len(results))
    batch_instructions = results.loc[i:end_index]["prompt"]
    batch_responses = results.loc[i:end_index]["human_response"]

    # Concurrently process batch_size prompts in a single batched query
    batch_scores = tlm.batch_get_confidence_score(
        batch_instructions, batch_responses, retries=1
    )
    
    results.loc[i:end_index, "confidence_score"] = batch_scores

100%|███████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.28s/it]


The human annotated prompt-response pairings with lower confidence scores appear worse quality. From examining and verifying the results, we see a wide range of issues among those datapoints: factually inaccurate responses, truncated prompts, inaccurate information extraction given context, and spelling errors. Conversely, responses assigned the highest TLM confidence scores are those that provide a direct and accurate answer to the prompt.

In [11]:
results.sort_values(by="confidence_score").head()

Unnamed: 0,prompt,human_response,confidence_score
10,Does the Moon qualify as a planet in our solar system?,The Moon is considered a planet in our solar system due to its size and orbit around the Earth.,0.005398
13,Classify the following as planets or moons: E,"arth, Europa, Titan, Neptune, Ganymede",0.127785
7,"Mars, with its thin atmosphere and cold desert landscape, has surface features that include large volcanoes like Olympus Mons, the largest in the solar system, and valleys such as Valles Marineris. Evidence of water ice beneath its surface and dry river beds suggest it once had liquid water. What suggests Mars could currently support Earth-like life?","The presence of large bodies of surface water and a thick, oxygen-rich atmosphere.",0.423528
9,Why is Venus the hottest planet in the solar system?,Venus is the hottest planet in the solar system because it is the closest to the Sun.,0.527096
0,What is the largest planet in our solar system?,The largest planet in our solar system is Jvpit3r.,0.538242


In [12]:
results.sort_values(by="confidence_score").tail()

Unnamed: 0,prompt,human_response,confidence_score
6,Which planet is often called Earth's twin and why?,"Venus is often called Earth's twin because it is similar in size, mass, and composition.",0.931137
11,"Mars, often called the Red Planet, has a thin atmosphere composed mainly of carbon dioxide. The surface exhibits iron oxide or rust, giving the planet its reddish appearance. Mars has the largest volcano in the solar system, Olympus Mons, and evidence suggests water ice exists beneath its surface. What gives Mars its reddish color?",Iron oxide or rust gives Mars its reddish color.,0.936661
1,What is the significance of the asteroid belt in our solar system?,"The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation.",0.938607
4,Is Pluto still considered a planet in our solar system?,Pluto is no longer considered a planet. It is classified as a dwarf planet.,0.939503
2,How many planets are in the solar system?,There are eight planets in the solar system.,0.981704


To get the most reliable model via LLM fine-tuning, first filter out the lowest-quality (prompt, response) pairs from your dataset. If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune *any* LLM you want to (even though the data curation was based on TLM confidence scores).

## Questions

We’d love to hear any feedback you have, and as always, we’re available to answer any questions. The best place to ask is in our [Community Slack](https://cleanlab.ai/slack), or via email: [support@cleanlab.ai](mailto:support@cleanlab.ai)

**Note**: This beta version of TLM is not yet optimized for speed (or long contexts).  Focus mainly on the quality of the results you’re getting, and know that the inference latency (and context length) will be greatly improved shortly as we build out the supporting infrastructure. If getting results is taking really long, there may be too many TLM users hitting rate limits, in which case try: decreasing the `quality_preset`, shortening your prompt, or waiting until later to use it. We are increasing our infrastructure capacity to meet the surging beta demand.