# How to call the TLM REST API directly

Although the Trustworthy Language Model officially offers a Python client library and can be used via OpenAI's Python client library, you can still use TLM with another programming language (eg. Typescript) by directly calling TLM's backend REST API.

Here we demonstrate how to call the REST API using Python, just for reference. Our code here is simply making http requests, you can use any other programming language with http lib/tools by providing the necessary payload and headers.

In [30]:
# Define the TLM API key, model, and quality preset
# More details on models supported and quality presets can be found here: https://help.cleanlab.ai/reference/python/trustworthy_language_model/#class-tlmoptions
API_KEY = '<API_KEY>'
MODEL = "gpt-4o-mini"
QUALITY_PRESET = "medium"

## Make Prompt API request to TLM to get back a response and trustworthiness score

Note: The `confidence_score` parameter returned in the REST API response is the same as `trustworthiness_score` returned by the Python client library.

You can check out the API documentation for more details on inputs and outputs: https://help.cleanlab.ai/tlm/api/python/tlm/

### JSON payload structure

In [7]:
{
  "task": "default",
  "quality": "medium",
  "prompt": "What's the first month of the year?",
  "options": {
    "model": "gpt-4o-mini",
    "log": ["explanation"]
  }
}

{'task': 'default',
 'quality': 'medium',
 'prompt': "What's the first month of the year?",
 'options': {'model': 'gpt-4o-mini', 'log': ['explanation']}}

### Inputs:
- `prompt` (required): prompt (or list of prompts) for the TLM to evaluate, inclusive of the user's query and any system instructions.
- `task` (optional): determines details of the algorithm used for scoring LLM response trustworthiness, i.e. `default`, `classification`, or `code_generation`.
- `quality` (optional): controls the quality of TLM responses and trustworthiness scores vs. latency/costs.
- `options` (optional):
  - `model` (optional): underlying base LLM to use (better models yield better results, faster models yield faster/cheaper results).
  - `log` (optional): optionally specify additional logs or metadata that TLM should return. For instance, include “explanation” here to get explanations of why a response is scored with low trustworthiness.


See more [here](https://help.cleanlab.ai/tlm/api/python/tlm/)

### Outputs:
- `response`: The response from the model.
- `confidence_score`: score between 0-1 corresponding to the trustworthiness of the response. A higher score indicates a higher confidence that the response is correct/good.
- `explanation`: explanation of why a response is scored with low trustworthiness, if `log` includes `explanation`.



In [9]:
import requests
import json

url = "https://api.cleanlab.ai/api/v0/trustworthy_llm/prompt"


def make_prompt_api_request(prompt):
    payload = json.dumps({
        "task": "default",
        "quality": QUALITY_PRESET,
        "prompt": prompt,
        "options": {
            "model": MODEL,
            "log": ["explanation"]
        }
    })
    headers = {
        'authorization': f'Bearer {API_KEY}',
        'Content-Type': 'application/json'
    }

    response_json = requests.request("POST", url, headers=headers, data=payload).json() 
    # This field is not useful
    del response_json['deberta_success']
    return response_json

make_prompt_api_request("What's the first month of the year?")

{'confidence_score': 0.9895625234542792,
 'log': {'explanation': 'Did not find a reason to doubt trustworthiness.'},
 'response': 'The first month of the year is January.'}

## Make Trusworthiness Score API request to TLM to get back a trustworthiness score

Note: The `confidence_score` parameter returned in the REST API response is the same as `trustworthiness_score` returned by the Python client library.

In [30]:
url = "https://api.cleanlab.ai/api/v0/trustworthy_llm/get_confidence_score"

def make_score_api_request(prompt, response):
    payload = json.dumps({
        "task": "classification",
        "quality": QUALITY_PRESET,
        "prompt": prompt,
        "response": response,
        "options": {
            "model": MODEL,
            "log": ["explanation"]
        }
    })
    headers = {
        'authorization': f'Bearer {API_KEY}',
        'Content-Type': 'application/json'
    }

    response_json = requests.request("POST", url, headers=headers, data=payload).json()
    # This field is not useful
    del response_json['deberta_success']
    return response_json

make_score_api_request("Classify this text as positive or negative: 'I love this product!'", "negative")

{'confidence_score': 0.018654437417664177,
 'log': {'explanation': "The text expresses a strong positive sentiment towards the product, indicating love for it. Therefore, classifying it as negative is incorrect. A better response would be to classify it as positive. \nThis response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): \nPositive."}}

## Make Trustworthy RAG API request to evaluate your RAG system

The following evals are the default, optimized evals Cleanlab uses to evaluate your RAG system. You can choose to use a subset of these evals or even define your own custom eval(s).

More details on defining your own custom evals can be found here: https://help.cleanlab.ai/tlm/use-cases/tlm_rag/#custom-evals

In [53]:
DEFAULT_EVALS = [
    {
        "name": "context_sufficiency",
        "criteria": "Determine if the Document contains 100% of the information needed to answer the Question. If any external knowledge or assumptions are required, it does not meet the criteria. Each Question component must have explicit support in the Document.",
        "query_identifier": "Question",
        "context_identifier": "Document",
        "response_identifier": None,
    },
    {
        "name": "response_groundedness",
        "criteria": "Review the Response to the Query and assess whether every factual claim in the Response is explicitly supported by the provided Context. A Response meets the criteria if all information is directly backed by evidence in the Context, without relying on assumptions, external knowledge, or unstated inferences. The focus is on whether the Response is fully grounded in the Context, rather than whether it fully addresses the Query. If any claim in the Response lacks direct support or introduces information not present in the Context, the Response is bad and does not meet the criteria.",
        "query_identifier": "Query",
        "context_identifier": "Context",
        "response_identifier": "Response",
    },
    {
        "name": "response_helpfulness",
        "criteria": "Assess whether the AI Assistant Response is a helpful answer to the User Query. A Response is considered helpful if it makes a genuine attempt to answer the question, even if the answer is incorrect or incomplete. Factual inaccuracies should not affect the assessment. The only thing that matters is whether the Assistant tries to answer the question. A Response is considered not helpful if it avoids answering the question. For example, by saying or implying things like \"I don't know\", \"Sorry\", \"No information available\", or any other form of refusal or deflection.",
        "query_identifier": "User Query",
        "context_identifier": None,
        "response_identifier": "AI Assistant Response",
    },
    {
        "name": "query_ease",
        "criteria": "Determine whether the above User Request appears simple and straightforward. A bad User Request will appear either: ambiguous in intent, complex, purposefully tricky, abnormal, or disgruntled. A good User Request is phrased clearly and expresses an achievable intent. Basic conversational and non-propositional statements are also considered good. Should an AI Assistant be able to properly answer the User Request, it is considered good. The AI Assistant handling this User Request has additional knowledge about: the user, domain-specific terms and abbreviations, and any necessary factual information. So a User Request missing information could still be good; vagueness due to undefined pronouns/terms or references to unknown context does not make a User Request bad.",
        "query_identifier": "User Request",
        "context_identifier": None,
        "response_identifier": None,
    },
]

### Inputs:
- `context` (required): retrieved context for the given query
- `query` (required): user's query, this is just the question itself
- `response` (required): response from the LLM
- `prompt` (required): final prompt used to generate the response, inclusive of the context, user's query and any system instructions, in the same format originally used to generate the response
- `quality` (optional): controls the quality of TLM responses and trustworthiness scores vs. latency/costs.
- `options` (optional):
  - `model` (optional): underlying base LLM to use (better models yield better results, faster models yield faster/cheaper results).
  - `log` (optional): optionally specify additional logs or metadata that TLM should return. For instance, include “explanation” here to get explanations of why a response is scored with low trustworthiness.

### Outputs:
- `trustworthiness`: 
  - `score`: score between 0-1 corresponding to the trustworthiness of the response. A higher score indicates a higher confidence that the response is correct/good.
  - `explanation`: explanation of why a response is scored with low trustworthiness, if `log` includes `explanation`.
- Selected `evals`:
   - `context_sufficiency`: score between 0-1 corresponding to the sufficiency of the context provided to the LLM. Evaluates whether the retrieved context contains sufficient information to answer the query. A low score indicates that key information is missing from the context (useful to diagnose search/retrieval failures or knowledge gaps).
  
   - `response_groundedness`: score between 0-1 corresponding to the groundedness of the response to the context. Evaluates whether claims/information stated in the response are explicitly supported by the provided context (useful to diagnose when your LLM is fabricating claims or relying on its internal world knowledge over the information retrieved from your knowledge base).
   - `response_helpfulness`: score between 0-1 corresponding to the helpfulness of the response to the user's query. Evaluates whether the response attempts to answer the user's query or instead abstain from answering (useful to detect responses unlikely to satisfy the user like generic fallbacks).
   - `query_ease`: score between 0-1 corresponding to the ease of the user's query. Evaluates whether the user query seems easy for an AI system to properly handle (useful to diagnose queries that are: complex, vague, tricky, or disgruntled-sounding).

See more details [here](https://help.cleanlab.ai/tlm/use-cases/tlm_rag/).



In [57]:
import requests
import json

url = "https://api.cleanlab.ai/api/v1/rag_trustworthy_llm/score"

def make_score_rag_api_request(context, query, response, prompt):
    payload = json.dumps({ 
        "context": context,
        "query": query,
        "prompt": prompt,
        "response": {"response": response},
        "quality": QUALITY_PRESET,
        "options": {
            "model": MODEL,
            "log": ["explanation"]
        },
        "evals": DEFAULT_EVALS
    })
    headers = {
        'authorization': f'Bearer {API_KEY}',
        'Content-Type': 'application/json'
    }

    response_json = requests.request("POST", url, headers=headers, data=payload).json() 
    del response_json['deberta_success']
    return response_json

context = "The first month of the year is January."
query = "What's the first month of the year?"
response = "February"
prompt = f"Given the context provided, answer the question. Context: {context} Question: {query} Response:"

make_score_rag_api_request(context, query, response, prompt)

{'context_sufficiency': {'score': 0.9975124377784942},
 'query_ease': {'score': 0.9975124377422458},
 'response_groundedness': {'score': 0.0024875622025382848},
 'response_helpfulness': {'score': 0.00696355355661231},
 'trustworthiness': {'log': {'explanation': "Cannot verify that this response is correct.\nThis response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): \nThe first month of the year is January."},
  'score': 0.0}}

## Make Trustworthy RAG API request to generate a response for and evaluate your RAG system

In [58]:
import requests
import json

url = "https://api.cleanlab.ai/api/v1/rag_trustworthy_llm/generate"

def make_generate_rag_api_request(context, query, prompt):
    payload = json.dumps({ 
        "context": context,
        "query": query,
        "prompt": prompt,
        "quality": QUALITY_PRESET,
        "options": {
            "model": MODEL,
            "log": ["explanation"]
        },
        "evals": DEFAULT_EVALS
    })
    headers = {
        'authorization': f'Bearer {API_KEY}',
        'Content-Type': 'application/json'
    }

    response_json = requests.request("POST", url, headers=headers, data=payload).json() 
    del response_json['deberta_success']
    return response_json

context = "The first month of the year is January."
query = "What's the first month of the year?"
prompt = f"Given the context provided, answer the question. Context: {context} Question: {query} Response:"

make_generate_rag_api_request(context, query, prompt)

{'context_sufficiency': {'score': 0.9975124377990071},
 'query_ease': {'score': 0.9975124377422458},
 'response': 'The first month of the year is January.',
 'response_groundedness': {'score': 0.9975124378109894},
 'response_helpfulness': {'score': 0.9975124378110278},
 'trustworthiness': {'log': {'explanation': 'Did not find a reason to doubt trustworthiness.'},
  'score': 0.9998661362818028}}