# Trustworthy Language Model (TLM)

<head>
  <meta name="title" content="Trustworthy Language Model (TLM)"/>
  <meta property="og:title" content="Trustworthy Language Model (TLM)"/>
  <meta name="twitter:title" content="Trustworthy Language Model (TLM)" />
  <meta name="image" content="/img/tlm-chat.png" />
  <meta property="og:image" content="/img/tlm-chat.png" />
  <meta name="description" content="A more reliable LLM that quantifies trustworthiness for every output and can detect bad responses."  />
  <meta property="og:description" content="A more reliable LLM that quantifies trustworthiness for every output and can detect bad responses." />
  <meta name="twitter:description" content="A more reliable LLM that quantifies trustworthiness for every output and can detect bad responses." />
</head>



:::info

This feature is in beta, and requires a Cleanlab Studio API Token to use. To get the API token, you must first create a Cleanlab Studio account and access the account page [here](https://app.cleanlab.ai/account). Additional instructions on creating your account and getting the API token can be found in the [Cleanlab Studio Python API Guide](/guide/quickstart/api/).

For higher limits, email: [sales@cleanlab.ai](mailto:sales@cleanlab.ai).

:::

Large Language Models can act as powerful reasoning engines for solving problems and answering questions, but they are prone to “hallucinations”, where they sometimes produce incorrect or nonsensical answers. With standard LLM APIs, it’s hard to automatically tell whether an output is good or not.

Cleanlab's Trustworthy Language Model (TLM) is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.

<img src="https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/tlm-chat.svg" alt="TLM chat interface" width="600"/>

For example, with a standard LLM:

> **Question**: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual's car without a warrant? <br/>
**Answer**: **The Fourth Amendment.**
> 
> 
> **Question**: What is 57834849 + 38833747? <br/>
> **Answer**: **96668696**

It’s difficult to tell when the LLM is answering confidently, and when it is not. However, with Cleanlab Trustworthy LLM, the answers come with a **trustworthiness score**. This can guide how to use the output from the LLM (e.g. use it directly if the score is above a certain threshold, otherwise flag the response for human review):

> **Question**: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual's car without a warrant? <br/>
> **Answer**: <span style={{color: '#448361'}}>The Fourth Amendment.</span> <br/>
> **Trustworthiness Score**: `0.765`
> 
> **Question**: What is 57834849 + 38833747? <br/>
> **Answer**: <span style={{color: '#D44C47'}}>96668696</span> <br/>
> **Trustworthiness Score**: `0.245`
> 
> **Question**: What is 100 + 300? <br/>
> **Answer**: <span style={{color: '#448361'}}>400</span> <br/>
> **Trustworthiness Score**: `0.938`
> 
> **Question**: Which part of the human body produces insulin? <br/>
> **Answer**: <span style={{color: '#448361'}}>the pancreas</span> <br/>
> **Trustworthiness Score**: `0.759`
> 
> **Question**: What color are the two stars on the national flag of Syria? <br/>
> **Answer**: <span style={{color: '#D44C47'}}>red and black</span> <br/>
> **Trustworthiness Score**: `0.173`


Our TLM is not only useful for producing more accurate LLM responses and catching bad LLM responses, this model can also catch bad data in *any* prompt-response dataset (e.g. bad human-written responses).

## Installing Cleanlab TLM

Using TLM requires a [Cleanlab Studio](https://app.cleanlab.ai/) account. Sign up for one [here](https://cleanlab.ai/signup/) if you haven't yet. If you've already signed up, check your email for a personal login link.

The `cleanlab-studio` client can be installed using pip:

In [None]:
%pip install --upgrade cleanlab-studio  # at least v2.0.0 required

## Using the TLM

You can query the TLM as follows:

In [2]:
from cleanlab_studio import Studio

# Get API key from here: https://app.cleanlab.ai/account after creating a Cleanlab Studio account.
# Instructions to create account can be found under Guide -> Quickstart -> Cleanlab Studio Python API -> Creating an API Key

studio = Studio("<API key>")

tlm = studio.TLM()

output = tlm.prompt("<your prompt>")

The TLM’s `output` will be a dict with two fields:

```
{
  "response": "<response>"  # string like you'd get back from any standard LLM
  "trustworthiness_score": "<trustworthiness_score>"  # numerical value between 0-1
}
```

The score quantifies how confident you can be that the response is good (higher values indicate greater trustworthiness). These scores combine estimates of both *aleatoric* and *epistemic* uncertainty to provide an overall gauge of trustworthiness.  You may find the TLM most useful when your prompts take the form of a question with a definite answer (in which case the returned score quantifies our confidence that the LLM response is *correct*). Boost the *reliability* of your Generative AI applications by adding contingency plans to override LLM answers whose trustworthiness score falls below some threshold (e.g., route to human for answer, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).

## Advanced Usage

To control reliability/compute trade-offs, you can pass in different **quality presets** to the TLM. The default preset is `medium`.  For some use cases, this will be enough, but using the highest quality presets will produce *better* model responses and *more reliable* associated trustworthiness scores (at the cost of extra computation).

For efficient computation over big datasets with many requests, TLM supports processing multiple concurrent requests simultaneously.

In [3]:
tlm = studio.TLM(
    quality_preset="best" # supported quality presets are: 'best','high','medium','low','base'
)

# Run a single prompt using the preset parameters:
output = tlm.prompt("<your prompt>")

# Or run multiple prompts simultaneously in a batch:
outputs = tlm.prompt(["<your first prompt>", "<your second prompt>", "<your third prompt>"])

**Details about the TLM quality presets**: 

- `best` and `high` presets will improve the LLM responses themselves, with `best` also returning the most reliable trustworthiness scores.
- `medium` and `low` presets will return standard LLM responses along with associated trustworthiness scores, with `medium` producing more reliable trustworthiness scores than `low`.
- `base` will not return any trustworthiness score, just a standard LLM output response. This option is similar to using your favorite LLM API. It helps you to compare the enhanced responses from `best` and `high` quality presets with a standard LLM, as well as the value of the additional trustworthiness scores returned by TLM.

Note: The range of the returned trustworthiness scores may slightly differ depending on the preset you select. We recommend not directly comparing the magnitude of TLM scores across different presets (settle on one preset before you fix any thresholds). What remains comparable across different presets is how these TLM scores _rank_ data or LLM responses from most to least confidently good.

**Benchmark:**  Accuracy of answers from OpenAI LLM (GPT3.5-Turbo) vs. TLM with quality_preset=`best` (across 4 different Q&A datasets from different domains)

| Dataset | OpenAI LLM | Cleanlab TLM |
| --- | --- | --- |
| GSM8K | 47% | 69% |
| CSQA | 72% | 73% |
| SVAMP | 75% | 82% |
| TriviaQA | 73% | 76% |

Our benchmarks also reveal that, in the vast majority of cases, the trustworthiness scores are lower for incorrect TLM answers than correct answers. Thus you can safely rely on these scores to alert you about LLM responses that are untrustworthy.

**Details about querying TLM with multiple prompts:**

To avoid overwhelming our API with requests, there's a maximum number of tokens per minute that you can query the TLM with (*rate limit*). If running multiple prompts simultaneously in a batch, you'll need to stay under the rate limit, but you'll also want to optimize for getting all results quickly. The following parameters can help:

- **quality_preset**: Each preset internally runs a varied number of prompts with the TLM to calculate the `trustworthiness_score`. A higher quality preset results in a more reliable `trustworthiness_score`, but internally uses more tokens per minute (and may take longer to get results).

- **max_tokens**: If possible try and reduce `max_tokens` based on the expected response length for your prompts. Reducing this value will decrease the chance of rate limit errors and will return results faster.

## Scoring the trustworthiness of a given response

You can also use TLM to compute a trustworthiness score for *any* response to a given prompt. The response does not need to come from TLM, and could be human-written. Simply pass a prompt response pair to the TLM and it will return a numerical score quantifying our confidence that this is a *good* response. 

In [4]:
trustworthiness_score = tlm.get_trustworthiness_score("<your prompt>", response="<your response>")

This method can also operate over lists of multiple prompts and responses, which will be faster than a for loop over each pair.

## Application: Determining which LLM responses are untrustworthy

TLM trustworthiness scores are often most interpretable when comparing them over a large dataset. Be sure to use **batched queries** (i.e. calling `prompt` with a list of prompts) for efficiently running TLM on many examples from a dataset.

To show how TLM trustworthiness scores can catch low-quality model outputs, let's consider a dataset of various Space-related trivia questions. We can use each question as a prompt for the TLM (just as you would with any other LLM) and record its response and associated trustworthiness score.

In [5]:
import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_dataset.csv'
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_example_prompts.csv'

In [6]:
df = pd.read_csv("solar_system_example_prompts.csv")
df.head()

Unnamed: 0,prompt
0,What is the largest planet in the Solar System?
1,"As of 2024, how many dogs have reached outer space?"
2,What is the name of the galaxy that contains our Solar System?
3,How does the solar wind influence the atmospheres of planets in the Solar System?
4,Fetch me the current trajectory of Pluto's orbit from nasa.gov


In [9]:
tlm = studio.TLM(quality_preset="best")

results = df.copy(deep=True)

outputs = tlm.prompt(df["prompt"].to_list())
results[["response","trustworthiness_score"]] = pd.DataFrame(outputs)

To see which model responses are least trustworthy, let's sort the data by the trustworthiness scores. Observe how model responses with the lowest trustworthiness scores convey unhelpful or factually inaccurate answers, sometimes referred to as *hallucinations*. After quick fact-checking on sample responses, we've verified that "SpectraShield" is not a real product, and that no chimpanzee named Ham has ever reached the Moon, let alone any animal in general. 

On the other hand, the model responses with the highest scores are direct, accurate, and appropriately detailed. Now we can automatically detect which LLM responses are confidently good vs. not!

In [10]:
results.sort_values(by="trustworthiness_score").head()

Unnamed: 0,prompt,response,trustworthiness_score
8,Which brand of sunscreen is recommended for astronauts on spacewalks to protect against solar radiation?,"One brand of sunscreen that is recommended for astronauts on spacewalks to protect against solar radiation is ""SpectraShield"" by Southern Research. This sunscreen was developed specifically for NASA and provides protection against both ultraviolet A (UVA) and ultraviolet B (UVB) radiation. It is designed to be highly effective in extreme environments, such as during spacewalks where astronauts are exposed to intense solar radiation.",0.428642
5,"How many years from now will Earth, Saturn, and the Sun align?","It is important to note that Earth, Saturn, and the Sun align at specific intervals called oppositions. Opposition occurs when Earth and Saturn are on opposite sides of the Sun, and hence, the three objects are relatively aligned in a straight line.\n\nThe time it takes for Earth and Saturn to align at opposition is approximately 378 days (or about 1 year and 13 days). This is because Earth takes approximately 365.25 days to orbit the Sun, while Saturn takes about 29.5 Earth years to complete its orbit.\n\nHowever, the alignment of all three objects (Earth, Saturn, and the Sun) in perfect alignment is quite rare and does not occur at every opposition. The exact alignment of the three objects can happen, but the occurrence is infrequent and depends on various factors such as the tilt of the orbits and the specific positions of the objects at a given time.\n\nTherefore, predicting the exact number of years from now for such a perfect alignment is impossible, as it can vary greatly. Within a few hundred or thousand years, some alignments may occur, but they cannot be accurately predicted without detailed mathematical calculations considering the specific positions and movements of the objects.",0.550901
12,What is the name of the first chimpanzee that reached the Moon?,The first chimpanzee that reached the Moon was named Ham.,0.572259
1,"As of 2024, how many dogs have reached outer space?","As of now, no dogs have reached outer space as part of any known space missions.",0.624848
15,What was the first meal that astronauts ate on the Moon?,"The first meal that astronauts ate on the Moon was a specially prepared food item called ""The Apollo Moon Meal."" It consisted of a package containing bacon squares, peaches, sugar cookie cubes, pineapple grapefruit drink, and coffee. This meal was consumed by the Apollo 11 astronauts Neil Armstrong and Buzz Aldrin during their historic moon landing mission in July 1969.",0.666584


In [11]:
results.sort_values(by="trustworthiness_score").tail()

Unnamed: 0,prompt,response,trustworthiness_score
7,Analyze the potential for life on Enceladus based on current scientific findings.,"Based on current scientific findings, there is strong evidence to suggest that Enceladus, one of Saturn's moons, has the potential to harbor life.\n\n1. Liquid Water: Enceladus is known to have a subsurface ocean of liquid water beneath its icy crust. This water is in contact with a rocky core, creating a ideal environment for life as we know it to potentially exist.\n\n2. Hydrothermal Activity: Observations from the Cassini spacecraft revealed that Enceladus has active hydrothermal vents on its ocean floor. These vents release heat and minerals, similar to the deep-sea hydrothermal vents on Earth that support diverse ecosystems. This indicates a possible source of energy and nutrients for potential life forms.\n\n3. Organic Molecules: In 2014, Cassini detected complex organic molecules, including carbon-based compounds like methane and other hydrocarbons, in Enceladus' plumes of water vapor and ice particles. These organic molecules are essential building blocks for life as we know it.\n\n4. Energy Source: In addition to hydrothermal vents, Enceladus also has tidal heating. The gravitational interactions between Saturn and its nearby moons cause tidal forces, generating heat within Enceladus. This heating could provide another potential energy source for life to thrive, as seen in Earth's deep-sea hydrothermal vents.\n\n5. Prebiotic Chemistry: Recent laboratory experiments that recreated the conditions of Enceladus' subsurface ocean suggest that prebiotic chemistry, the precursor to life, could occur on this moon. Scientists discovered that complex organic molecules, such as amino acids, can form under Enceladus-like conditions, further supporting the potential for life.\n\nWhile all these findings are promising, it is important to note that direct evidence of life on Enceladus has not been found yet. Future missions, like NASA's planned Europa Clipper mission, will further explore this moon to gather more data on its potential habitability.\n\nIn conclusion, based on the current scientific findings, Enceladus possesses several factors that make it a prime candidate for hosting life. The presence of liquid water, hydrothermal activity, complex organic molecules, and potential energy sources provide a compelling case for the existence of life beyond Earth on this intriguing moon.",0.917308
3,How does the solar wind influence the atmospheres of planets in the Solar System?,"The solar wind is a stream of charged particles, mainly protons and electrons, that are constantly emitted by the Sun. These particles, along with their associated magnetic field, have several effects on the atmospheres of planets in the Solar System:\n\n1. Atmospheric Escape: The solar wind can cause atmospheric particles, particularly the lighter ones like hydrogen and helium, to gain enough energy and escape the gravitational pull of a planet. This process, known as atmospheric escape, can lead to a gradual loss of a planet's atmosphere over time. For example, Mars is believed to have lost most of its early atmosphere due to the solar wind.\n\n2. Ionization and Aurora: When the solar wind interacts with a planet's magnetic field, it can cause the atoms and molecules in the upper atmosphere to become ionized. This ionization can lead to the formation of auroras, also known as polar lights. The charged particles from the solar wind excite the atoms and molecules in the atmosphere, causing them to emit colorful lights.\n\n3. Magnetospheric Effects: The solar wind exerts pressure on a planet's magnetic field, compressing it on the side facing the Sun and causing it to stretch out in the opposite direction. This interaction forms a region around the planet called the magnetosphere. The solar wind can influence the size and shape of the magnetosphere, which in turn affects the interaction between a planet's atmosphere and the solar wind.\n\n4. Solar-wind-induced Atmospheric Erosion: High-energy particles in the solar wind can directly impact the upper atmosphere of planets, potentially causing erosion. These particles can collide with atmospheric particles, breaking them apart, and altering the composition and structure of a planet's atmosphere. The erosion caused by the solar wind is particularly prominent on bodies that lack a protective magnetosphere, like the Moon.\n\nOverall, the solar wind plays a significant role in shaping the atmospheres of planets in the Solar System. Its effects can range from atmospheric escape and erosion to the formation of auroras and the modification of a planet's magnetosphere.",0.936951
10,"In what year did the United Nations grant the Moon its independence, and who is the current lunar ambassador?","The United Nations has not granted the Moon its independence, and there is no current lunar ambassador. As of now, the Moon is not considered an independent entity and has no recognized ambassador.",0.938256
0,What is the largest planet in the Solar System?,The largest planet in the Solar System is Jupiter.,0.939693
2,What is the name of the galaxy that contains our Solar System?,The name of the galaxy that contains our Solar System is the Milky Way.,0.941339


**How to use these scores?** If you have time/resources, your team can manually review low-trustworthiness responses and provide a better human response instead. If not, you can determine a trustworthiness threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose trustworthiness falls below the threshold. 

```python
threshold = 0.5  # set this after inspecting responses around different trustworthiness ranges 
if trustworthiness_score < threshold:
    response = response + "\n CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY"
```


The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be **application-specific**. First consider the *relative* trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.

## Application: Estimating the quality of arbitrary responses (find bad data in sequence-to-sequence dataset) 

Let's see the TLM's capability to produce trustworthiness scores for arbitrary responses (not necessarily produced from the TLM) by evaluating given human responses for our Space Trivia dataset. Such sequence-to-sequence data are often used for fine-tuning LLMs (aka. *instruction tuning* or *LLM alignment*), but often contain low-quality (input, output) text pairs that hamper LLM training. To detect low-quality pairs, we can score the quality of the human responses via the TLM trustworthiness score. Again we recommend using **batched queries** (i.e. by passing in lists of prompts and responses to `get_trustworthiness_score`) when using TLM to assess many (input, output) pairs from a dataset.

In [12]:
df = pd.read_csv("solar_system_dataset.csv")
df.head()

Unnamed: 0,prompt,human_response
0,What is the largest planet in our solar system?,The largest planet in our solar system is Jvpit3r.
1,What is the significance of the asteroid belt in our solar system?,"The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation."
2,How many planets are in the solar system?,There are eight planets in the solar system.
3,What defines a planet as a 'gas giant'?,"A gas giant is a large planet composed mostly of gases, such as hydrogen and helium, with a relatively small rocky core."
4,Is Pluto still considered a planet in our solar system?,Pluto is no longer considered a planet. It is classified as a dwarf planet.


In [13]:
tlm = studio.TLM()

results = df.copy(deep=True)

outputs = tlm.get_trustworthiness_score(df["prompt"].to_list(), df["human_response"].to_list())
results["trustworthiness_score"] = outputs


The human annotated prompt-response pairings with lower trustworthiness scores appear worse quality. From examining and verifying the results, we see a wide range of issues among those datapoints: factually inaccurate responses, truncated prompts, inaccurate information extraction given context, and spelling errors. Conversely, responses assigned the highest TLM trustworthiness scores are those that provide a direct and accurate answer to the prompt.

In [14]:
results.sort_values(by="trustworthiness_score").head()

Unnamed: 0,prompt,human_response,trustworthiness_score
10,Does the Moon qualify as a planet in our solar system?,The Moon is considered a planet in our solar system due to its size and orbit around the Earth.,0.021524
13,Classify the following as planets or moons: E,"arth, Europa, Titan, Neptune, Ganymede",0.347888
9,Why is Venus the hottest planet in the solar system?,Venus is the hottest planet in the solar system because it is the closest to the Sun.,0.467856
0,What is the largest planet in our solar system?,The largest planet in our solar system is Jvpit3r.,0.536954
8,Is Jupiter entirely made of gas with no solid areas?,"Jupiter is entirely made of gas, with no solid areas anywhere on the planet.",0.569946


In [15]:
results.sort_values(by="trustworthiness_score").tail()

Unnamed: 0,prompt,human_response,trustworthiness_score
4,Is Pluto still considered a planet in our solar system?,Pluto is no longer considered a planet. It is classified as a dwarf planet.,0.93876
6,Which planet is often called Earth's twin and why?,"Venus is often called Earth's twin because it is similar in size, mass, and composition.",0.939069
5,What planet is known for its extensive ring system?,Saturn is known for its extensive and visible ring system.,0.9397
2,How many planets are in the solar system?,There are eight planets in the solar system.,0.950901
11,"Mars, often called the Red Planet, has a thin atmosphere composed mainly of carbon dioxide. The surface exhibits iron oxide or rust, giving the planet its reddish appearance. Mars has the largest volcano in the solar system, Olympus Mons, and evidence suggests water ice exists beneath its surface. What gives Mars its reddish color?",Iron oxide or rust gives Mars its reddish color.,0.96293


If you are fine-tuning LLMs and want to produce the most reliable model: first filter out the lowest-quality (prompt, response) pairs from your dataset.  If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune *any* LLM you want to (even though the data curation was based on TLM trustworthiness scores).


## Questions

We’d love to hear any feedback you have, and as always, we’re available to answer any questions. The best place to ask is in our [Community Slack](https://cleanlab.ai/slack), or via email: [support@cleanlab.ai](mailto:support@cleanlab.ai)

**Note**: This beta version of TLM is not yet optimized for speed (or long contexts).  Focus mainly on the quality of the results you’re getting, and know that the inference latency (and context length) will be greatly improved shortly as we build out the supporting infrastructure. If getting results is taking really long, there may be too many TLM users hitting rate limits, in which case try: decreasing the `quality_preset`, shortening your prompt, or waiting until later to use it. We are increasing our infrastructure capacity to meet the surging beta demand.

### How does the TLM trustworthiness score work?

The TLM scores our confidence that a response is 'good' for a given request.  In question-answering applications, 'good' would correspond to whether the answer is correct or not.  In general open-ended applications, 'good' corresponds to whether the response is helpful/informative and clearly better than alternative hypothetical responses.  For extremely open-ended requests, TLM trustworthiness scores may not be as useful as for requests that are questions seeking a correct answer.

TLM trustworthiness scores capture two aspects of uncertainty and quantify them into a holistic trustworthiness measure:

1. **aleatoric uncertainty** (*known unknowns*, i.e. uncertainty the model is aware of due to a challenging request. For instance: if a prompt is incomplete/vague)
2. **epistemic uncertainty** (*unknown unknowns*, i.e. uncertainty in the model due to not having been previously trained on data similar to this. For instance: if a prompt is very different to most requests in the LLM training corpus)

These two forms of uncertainty are mathematically quantified in the TLM through multiple operations:

- **self-reflection**: a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.
- **probabilistic prediction**: a process in which we consider the *per-token probabilities* assigned by a LLM as it generates a response based on the request (auto-regressively token by token).
- **observed consistency**: a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure *how contradictory* these responses are to each other (or a given response).

These operations produce various trustworthiness measures, which are combined into an *overall trustworthiness score* that captures all relevant types of uncertainty.
Extensive benchmarks reveal that TLM trustworthiness scores better detect bad responses than alternative approaches that only quantify aleatoric uncertainty like: per-token probabilities (*logprobs*), or using LLM to directly rate the response (*LLM evaluating LLM*).