# Trustworthy Language Model (TLM)

:::info

This feature is in beta, and requires a Cleanlab Studio account to try out. For higher token limits, email: [sales@cleanlab.ai](mailto:sales@cleanlab.ai)

:::

Large Language Models can act as powerful reasoning engines for solving problems and answering questions, but they are prone to “hallucinations”, where they sometimes produce incorrect or nonsensical answers. With standard LLM APIs, it’s hard to automatically tell whether an output is good or not.

Cleanlab TLM is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.

For example, with a standard LLM:

> **Question**: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual's car without a warrant? <br/>
**Answer**: **The Fourth Amendment.**
> 
> 
> **Question**: What is 57834849 + 38833747? <br/>
> **Answer**: **96668696**

It’s difficult to tell when the LLM is answering confidently, and when it is not. However, with Cleanlab Trustworthy LLM, the answers come with a **confidence score**. This can guide how to use the output from the LLM (e.g. use it directly if the score is above a certain threshold, otherwise flag the response for human review):

> **Question**: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual's car without a warrant? <br/>
> **Answer**: <span style={{color: '#448361'}}>The Fourth Amendment.</span> <br/>
> **Confidence**: `0.765`
> 
> **Question**: What is 57834849 + 38833747? <br/>
> **Answer**: <span style={{color: '#D44C47'}}>96668696</span> <br/>
> **Confidence**: `0.245`
> 
> **Question**: What is 100 + 300? <br/>
> **Answer**: <span style={{color: '#448361'}}>400</span> <br/>
> **Confidence**: `0.938`
> 
> **Question**: Which part of the human body produces insulin? <br/>
> **Answer**: <span style={{color: '#448361'}}>the pancreas</span> <br/>
> **Confidence**: `0.759`
> 
> **Question**: What color are the two stars on the national flag of Syria? <br/>
> **Answer**: <span style={{color: '#D44C47'}}>red and black</span> <br/>
> **Confidence**: `0.173`

## Installing Cleanlab TLM

The `cleanlab-studio` client can be installed using pip:

```bash
pip install -U cleanlab_studio
```

## Using the TLM

To use Cleanlab TLM, you must have a [Cleanlab Studio](https://cleanlab.ai/#solutions) account. If you haven't yet signed up for an account, you can do so [here](https://cleanlab.typeform.com/to/NLnU1XZF). If you've already signed up, check your email for a personal login link.

After installing the client, you can query the TLM as follows:

```python
from cleanlab_studio import Studio

# get API key from here: https://app.cleanlab.ai/account
studio = Studio('<API key>')
tlm = studio.TLM()

output = tlm.prompt('<your prompt>')
```

The TLM’s `output` will be a dict with two fields:

```python
{
  "response": "<response>"  # string like you'd get back from traditional LLM
  "confidence_score": "<confidence_score>"  # numerical value between 0-1
}
```

The score quantifies how confident you can be that the response is good (higher values indicate greater confidence). These scores combine estimates of both *aleatoric* and *epistemic* uncertainty to provide an overall gauge of confidence.  You may find the TLM most useful when your prompts take the form of a question with a definite answer (in which case the returned score quantifies our confidence that the LLM response is *correct*). Boost the *reliability* of your Generative AI applications by adding contingency plans to override LLM answers whose confidence falls below some threshold (e.g., route to human for answer, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).

## Advanced Usage

To control reliability/compute trade-offs, you can pass in different **quality presets** to the TLM. The default preset is `low`.  For some use cases, this will be enough, but using the highest quality presets will produce *better* model responses and *more reliable* associated confidence scores (at the cost of extra computation).

```python
# supported quality presets are: 'best','high','medium','low','base'

tlm = studio.TLM(quality_preset='high')
output = tlm.prompt('<your prompt>')
```

**Details about the TLM quality presets**: 

- `best` and `high` will improve the LLM responses themselves, with `best` also returning the most reliable confidence scores.
- `medium` and `low` will return standard LLM responses along with associated confidence scores, with `medium` producing more reliable confidence scores than `low`.
- `base` will not return any confidence score, just an output response. This option is similar to using your favorite LLM. It helps you to compare the enhanced responses from `best` and `high` quality presets with a standard LLM, as well as the value of the additional confidence scores returned by TLM.

**Benchmark:**  Accuracy of answers from OpenAI LLM (GPT3.5-Turbo) vs. TLM with quality_preset=`best` (across 4 different Q&A datasets from different domains)

| Dataset | OpenAI LLM | Cleanlab TLM |
| --- | --- | --- |
| GSM8K | 47% | 69% |
| CSQA | 72% | 73% |
| SVAMP | 75% | 82% |
| TriviaQA | 73% | 76% |

These benchmarks also reveal that, in the vast majority of cases, the confidence scores are lower for incorrect TLM answers than correct answers. Thus you can safely rely on these scores to alert you about LLM responses that are untrustworthy.

## Scoring the confidence of a given response

You can also use TLM to compute a confidence score for any response to a given prompt. The response does not need to come from TLM, and could be human-written. Simply pass a prompt response pair to the TLM and it will return a numerical score quantifying our confidence that this is a good response. 

```python
confidence_score = tlm.get_confidence_score(
	'<your prompt>', 
	response='<your response>'
)
```

## Questions

We’d love to hear any feedback you have, and as always, we’re available to answer any questions. The best place to ask is in our [Community Slack](https://cleanlab.ai/slack), or via email: [support@cleanlab.ai](mailto:support@cleanlab.ai)

**Note**: This beta version of TLM is not yet optimized for speed (or long contexts).  Focus mainly on the quality of the results you’re getting, and know that the inference latency (and context length) will be greatly improved shortly as we build out the supporting infrastructure.  This beta version does not yet support batching, so for now use a `for` loop to iterate over a dataset of many prompts. If getting results is taking really long, there may be too many TLM users hitting rate limits, in which case try: decreasing the `quality_preset`, shortening your prompt, or waiting until later to use it. We are increasing our infrastructure capacity to meet the surging beta demand.