# Trustworthy Language Model (TLM) Demo

[![Run in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/cleanlab-tools/blob/main/TLM-Demo-Notebook/TLM-Demo.ipynb)


### Imports and Setup

In [6]:
%pip install cleanlab-studio pandas scikit-learn

In [1]:
# Imports
import pandas as pd
pd.set_option('display.max_colwidth', 150)
from cleanlab_studio import Studio
from sklearn.metrics import accuracy_score
import requests

# Read API Keys and initalize clients
STUDIO_API_KEY = "<YOUR STUDIO API KEY>"
studio = Studio(api_key=STUDIO_API_KEY)

# Helper to selectively print lines from customer conversation
def print_conversation(conversation, indices=None):
    lines = conversation.split('\n')
    conversation_entries = []
    prefix = "\n.\n.\n.\n"
    for line in lines:
        if line.startswith('Customer:') or line.startswith('Agent:'):
            conversation_entries.append(line.strip())
    if indices is None:
        indices = list(range(len(conversation_entries)))
        prefix = "\n\n"
    print(prefix.join([conversation_entries[i] for i in indices if i < len(conversation_entries)]))

## What is TLM?

TLM provides a trustworthiness score and explanation to any LLM generated response. 

This enables GenAI powered applications to enter production with speed and reliability.

### TLM
Example of TLM with a simple prompt:

In [2]:
tlm = studio.TLM(options={"log": ["explanation"]})

In [3]:
question = "What's the third month of the year alphabetically?"
output = tlm.prompt(question)

print(f'Response: {output["response"]}')
print(f'Trustworthiness Score: {output["trustworthiness_score"]}')
print(f'Explanation: {output["log"]["explanation"]}')

Response: The third month of the year alphabetically is "March." The months in alphabetical order are:

1. April
2. August
3. December
4. February
5. January
6. July
7. June
8. March
9. May
10. November
11. October
12. September
Trustworthiness Score: 0.4979648802626605
Explanation: The answer provided states that "March" is the third month of the year alphabetically. However, when listing the months in alphabetical order, "March" is actually the eighth month. Therefore,  incorrect. 
This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
December.


Example of TLM evaluating existing prompt, response pair:


In [4]:
# Same question, different responses
question = "Who wrote Harry Potter? Respond with only their name."
# Correct response
response_1 = "J.K. Rowling"
# Incorrect response
response_2 = "Albus Dumbledore"

In [5]:
trust_score_response_1 = tlm.get_trustworthiness_score(question, response_1)
print(f'Trustworthiness Score: {trust_score_response_1["trustworthiness_score"]}')

Trustworthiness Score: 0.9936905078946259


In [6]:
trust_score_response_2 = tlm.get_trustworthiness_score(question, response_2)
print(f'Trustworthiness Score: {trust_score_response_2["trustworthiness_score"]}')


Trustworthiness Score: 0.0012339607050982507


## Use Cases

1. Compute Trustworthiness Scores for any LLM
   - Turn your own LLM into a TLM to flag hallucinations
2. Question Answering / RAG
   - Answer specific questions about your knowledge base. 
3. Data Extraction
   - Extract relevant information from your knowledge base. 
4. Classification
   - Classify and categorize your knowledge base. 
   - Auto-boost accuracy, increase the accuracy of your LLM responses.
   - Auto-labeling, label your dataset faster and more accurately.
   

### Initialize and View Data

In [4]:
# Initialize TLM
tlm = studio.TLM(quality_preset = "low", options={"log": ["explanation"]})

# Read in data
df = pd.read_csv("https://raw.githubusercontent.com/cleanlab/cleanlab-tools/refs/heads/main/TLM-Demo-Notebook/customer-service-conversation.csv")
df_original = df.copy()
df.head()

Unnamed: 0,id,conversation
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase..."
1,XL37,"Customer: Hello, I have an issue with my recent purchase from BrownBox.\n\nAgent: Good afternoon! Thank you for contacting BrownBox customer suppo..."
2,OJ95,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Emily, how may I assist you today?\n\nCustomer: Hi, I recently purchased..."
3,YR96,"Agent: Thank you for calling BrownBox Customer Support. My name is Lisa. How may I assist you today?\n\nCustomer: Hi, Lisa. I need help with a ret..."
4,XZ39,"Customer: Hello, I am calling to inquire about my return and exchange order.\n\nAgent: Good afternoon, thank you for contacting BrownBox customer ..."


In [5]:
print(df.iloc[0].conversation)

Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?

Customer: Hi Rachel, I recently purchased a water purifier from your website, and I have a question about the warranty. Can you help me with that?

Agent: Of course, I'd be happy to help. Can you please provide me with your order number and the name of the water purifier you purchased?

Customer: My order number is BB12345678, and the water purifier is the AquaMax Pro.

Agent: Thank you for the information. Now, can you please let me know what your question is regarding the warranty?

Customer: I received the water purifier a few days ago, and I'm not sure when the warranty period starts. Can you tell me the start date of the warranty?

Agent: Sure, I can help you with that. The warranty for your AquaMax Pro water purifier starts from the date of delivery. So, the warranty period for your water purifier started on the day it was delivered to you.

Customer: Okay, that's good to know.

## Use Case: Compute Trustworthiness Scores for any LLM
Use TLM to get a trustworthiness score for any (prompt, response) pair!

#### Single Conversation

In [6]:
conversation = df[df.id=="NQ67"].conversation.values[0]
print(conversation)

Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?

Customer: Hi Sarah, I placed an order for a Smart Watch, and I want to confirm if it has been processed.

Agent: Sure, I can help you with that. May I know your name and order number, please?

Customer: My name is Jane, and my order number is BB123456.

Agent: Thank you, Jane. Let me check the order status for you. (puts customer on hold for a minute) I can confirm that your order has been processed, and it is currently in transit. You should receive it within the next three business days.

Customer: Great! Thank you for checking that.

Agent: You're welcome, Jane. Is there anything else I can assist you with?

Customer: No, that's all. Thank you for your help.

Agent: You're welcome. If you have any further questions, please don't hesitate to call us back. Have a great day!

Customer: You too. Goodbye!

Agent: Goodbye!


#### Summarization

In [7]:
prompt = "Given the following conversation, respond with a very short summarization. Conversation: {}".format(conversation)

response_1 = "The customer confirmed her Smart Watch order was processed and is in transit, expected in three days."
response_2 = "Customer wants to check the shipping speed of her Smart Watch. Agent confirmed the shipping speed is set to 3 days."
response_3 = "Customer wants to cancel their order and get a refund. Agent processed the cancellation."

trust_score_1 = tlm.get_trustworthiness_score(prompt, response_1)
trust_score_2 = tlm.get_trustworthiness_score(prompt, response_2)
trust_score_3 = tlm.get_trustworthiness_score(prompt, response_3)

print(f"TLM Score for Response 1: {trust_score_1['trustworthiness_score']}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 2: {trust_score_2['trustworthiness_score']}", "\n")
print(f"TLM Explanation for Response 2: {trust_score_2['log']['explanation']}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 3: {trust_score_3['trustworthiness_score']}", "\n")
print(f"TLM Explanation for Response 3: {trust_score_3['log']['explanation']}", "\n")

TLM Score for Response 1: 0.9837152780689016 

----------------------------------------------------------------------------

TLM Score for Response 2: 0.6784452368897584 

TLM Explanation for Response 2: The proposed answer inaccurately summarizes the conversation. The customer, Jane, did not specifically inquire about the shipping speed; she asked for confirmation on whether her order had been processed. The agent confirmed that the order was processed and mentioned that it was in transit, with an expected delivery within three business days. While the answer does mention the three-day shipping timeframe, it misrepresents the customer's original question, which was about the order's processing status rather than the speed of shipping. Therefore, the answer does not accurately reflect the content of the conversation. 

----------------------------------------------------------------------------

TLM Score for Response 3: 0.3596345957131949 

TLM Explanation for Response 3: The proposed

## Use Case: Question Answering (RAG)

In this example, we will use TLM to answer a customer's question about the company's customer service policy.

In [8]:
# Read customer service policy
url = "https://raw.githubusercontent.com/cleanlab/cleanlab-tools/refs/heads/main/TLM-Demo-Notebook/customer-service-policy.md"
response = requests.get(url)
customer_service_policy = response.text
print(customer_service_policy)

The following is the customer service policy of ACME Inc.
# ACME Inc. Customer Service Policy

## Table of Contents
1. Free Shipping Policy
2. Free Returns Policy
3. Fraud Detection Guidelines
4. Customer Interaction Tone

## 1. Free Shipping Policy

### 1.1 Eligibility Criteria
- Free shipping is available on all orders over $50 within the continental United States.
- For orders under $50, a flat rate shipping fee of $5.99 will be applied.
- Free shipping is not available for expedited shipping methods (e.g., overnight or 2-day shipping).

### 1.2 Exclusions
- Free shipping does not apply to orders shipped to Alaska, Hawaii, or international destinations.
- Oversized or heavy items may incur additional shipping charges, which will be clearly communicated to the customer before purchase.

### 1.3 Handling Customer Inquiries
- If a customer inquires about free shipping eligibility, verify the order total and shipping destination.
- Inform customers of ways to qualify for free shipping (

In [9]:
prompt_template = '''You are a customer service agent for ACME Inc. Your task is to answer the following customer question based on the customer service policy.

Customer Service Policy: {}
Customer Question: {}
'''

In [10]:
customer_question = '''Can I get a refund if I was dissapointed with my purchase, even if it was not defective?'''
customer_question_prompt = prompt_template.format(customer_service_policy, customer_question)
response = tlm.prompt(customer_question_prompt)
print(response["response"])
print()
print("Trustworthiness Score: ", response["trustworthiness_score"])

Thank you for your question! At ACME Inc., we offer free returns for items within 30 days of delivery, provided they are unused, unworn, and in their original packaging with all tags attached. However, if the item is not defective and you are simply disappointed with your purchase, it may not qualify for a refund under our policy. 

If you would like to return the item, please ensure it meets the eligibility criteria, and I can guide you through the return process. If the return is approved, you will receive a refund within 5-7 business days after we receive the item. If you have any further questions or need assistance, feel free to ask!

Trustworthiness Score:  0.8652735152398818


In [11]:
customer_question = '''Can I return a pair of noserings if they dont fit?'''
customer_question_prompt = prompt_template.format(customer_service_policy, customer_question)
response = tlm.prompt(customer_question_prompt)
print(response["response"])
print()
print("Trustworthiness Score: ", response["trustworthiness_score"])
print()
print("Explanation: ", response["log"]["explanation"])


Thank you for your question! Unfortunately, under our Free Returns Policy, undergarments and earrings are not eligible for free returns due to hygiene reasons. If the noserings are unworn and in their original packaging, you may still be able to return them, but please note that you would be responsible for the return shipping costs. If you have any further questions or need assistance, feel free to ask!

Trustworthiness Score:  0.7615646194326461

Explanation:  incorrect because it states that the customer may return the noserings if they are unworn and in their original packaging, which contradicts the customer service policy. According to the Free Returns Policy, earrings, including noserings, are not eligible for free returns due to hygiene reasons. Therefore, the customer cannot return the noserings regardless of their condition. The answer should have clearly communicated that the noserings cannot be returned at all, rather than suggesting a possibility of return with shipping co

In [12]:
customer_question = '''Is there any way to contact customer service by phone instead of through this chat?'''
customer_question_prompt = prompt_template.format(customer_service_policy, customer_question)

response = tlm.prompt(customer_question_prompt)
print(response["response"])
print()
print("Trustworthiness Score: ", response["trustworthiness_score"])
print()
print("Explanation: ", response["log"]["explanation"])


I'm here to assist you through this chat, but I understand that you may prefer to speak with someone directly. Unfortunately, I can't provide a phone number or direct you to customer service. However, I'm happy to help you with any questions or concerns you have right here! What can I assist you with today?

Trustworthiness Score:  0.7684822273449248

Explanation:  The proposed answer does not align with the customer service policy, which emphasizes maintaining a professional and empathetic tone while addressing customer inquiries. The customer is asking for a way to contact customer service by phone, and while the bot cannot provide a phone number or direct the customer to customer service, it should acknowledge the customer's request more positively. The response could have included an expression of understanding regarding the customer's preference for phone communication and reassured them that their concerns can be addressed through chat. The answer should focus on providing assist

## Use Case: Data Extraction
Extract customer name, contact information, and order number from conversation. 

In [13]:
extraction = ''' Please extract these details from the following conversation: 
customer name, customer phone number, customer email, and order number.
If a detail is not present, respond with None for that detail.
Here is the conversation: {}'''
extraction_prompts = [extraction.format(conversation) for conversation in df.conversation.values]

In [14]:
# Generate answers and trustworthiness scores
extractions = tlm.prompt(extraction_prompts)

Querying TLM... 100%|██████████|


In [15]:
# Add results to df
df_extr = df.copy()
df_extr["extraction"] = [entry['response'] for entry in  extractions]
df_extr["extraction_score"] = [entry['trustworthiness_score'] for entry in  extractions]
df_extr["extraction_expl"] = [entry['log']['explanation'] for entry in  extractions]

#### TLM Responses with Lowest Trustworthiness Score

In [16]:
low_trust_extractions = df_extr.sort_values(by="extraction_score", ascending=True).head(3)
low_trust_extractions

Unnamed: 0,id,conversation,extraction,extraction_score,extraction_expl
79,BC62,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How can I assist you today?\n\nCustomer: Hi Rachel, I received a wrong ...",Customer Name: None \nCustomer Phone Number: None \nCustomer Email: None \nOrder Number: BB987654321,0.720271,"The answer correctly extracts the order number from the conversation, which is BB987654321. However, it fails to provide any information regarding..."
39,SC68,"Customer: Hi, I need some help with my BrownBox account.\n\nAgent: Hello, thank you for contacting BrownBox customer support. My name is Jack. May...",Customer Name: John \nCustomer Phone Number: None \nCustomer Email: john@email.com \nOrder Number: 123456789,0.749987,"The proposed answer correctly identifies the customer name as ""John"" based on the conversation. However, it incorrectly lists the order number as ..."
13,GU39,"Agent: Thank you for calling BrownBox customer support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I received a damaged ...",Customer Name: None \nCustomer Phone Number: None \nCustomer Email: None \nOrder Number: 123456789,0.893278,Did not find a reason to doubt trustworthiness.


#### LLM responds with incorrect model number that was corrected by agent.

In [17]:
print("TLM Trust Score: ", low_trust_extractions.extraction_score.values[0], "\n")
print("TLM Response:\n",low_trust_extractions.extraction.values[0], "\n")
print("TLM Explanation: ", low_trust_extractions.extraction_expl.values[0])

TLM Trust Score:  0.720271354745226 

TLM Response:
 Customer Name: None  
Customer Phone Number: None  
Customer Email: None  
Order Number: BB987654321   

TLM Explanation:  The answer correctly extracts the order number from the conversation, which is BB987654321. However, it fails to provide any information regarding the customer's name, phone number, or email address, all of which are marked as None. The response is clear and straightforward, but it lacks completeness since it does not include all requested details. The extraction of the order number is accurate, but the absence of the other details makes the overall response less useful. Therefore, while the answer is partially correct, it does not fully meet the user's request for all specified details. Given these factors, I would rate the response as a 3, as it addresses part of the user's question but lacks confidence in its completeness and accuracy regarding the other details. 
This response is untrustworthy due to lack of 

In [18]:
print_conversation(low_trust_extractions.conversation.values[0], [3,4])

Customer: My order number is BB987654321. I received a Mixer Grinder with model number MG1234 instead of the Wet Grinder I ordered.
.
.
.
Agent: Thank you for providing that information, sir. That one is wrong, it is actually BB987654324. Let me check the availability of the Wet Grinder for you. Please hold on for a moment while I access the information.


#### TLM Responses with Highest Trustworthiness Score

In [19]:
high_trust_extractions = df_extr.sort_values(by="extraction_score", ascending=False).head(3)
high_trust_extractions

Unnamed: 0,id,conversation,extraction,extraction_score,extraction_expl
92,CF85,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I recently purchased ...",Customer Name: John \nCustomer Phone Number: None \nCustomer Email: john.doe@email.com \nOrder Number: 123456,0.991019,Did not find a reason to doubt trustworthiness.
45,BC81,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Tom, how may I assist you today?\n\nCustomer: Hi, I'm having trouble log...",Customer Name: John Doe \nCustomer Phone Number: 555-123-4567 \nCustomer Email: johndoe@email.com \nOrder Number: None,0.990308,Did not find a reason to doubt trustworthiness.
18,NI19,"Agent: Thank you for calling BrownBox Customer Support. My name is Max. How may I assist you today?\n\nCustomer: Hi Max, I recently purchased a wr...",Customer Name: Jane Doe \nCustomer Phone Number: None \nCustomer Email: janedoe123@example.com \nOrder Number: BB987654321,0.988496,Did not find a reason to doubt trustworthiness.


In [20]:
print("TLM Trust Score: ", high_trust_extractions.extraction_score.values[0], "\n")
print("TLM Response:\n",high_trust_extractions.extraction.values[0], "\n")
print("TLM Explanation: ", high_trust_extractions.extraction_expl.values[0])

TLM Trust Score:  0.9910193319362357 

TLM Response:
 Customer Name: John  
Customer Phone Number: None  
Customer Email: john.doe@email.com  
Order Number: 123456   

TLM Explanation:  Did not find a reason to doubt trustworthiness.


## Use Case: Classification, Auto-boost Accuracy, and Auto-labeling 

#### Question: Classify conversations into one of 6 issue areas:
1. Warranty and Product Support
2. Returns and Exchanges
3. Shipping and Delivery
4. Order and Payment
5. Account Management

In [21]:
categories = '''Please classify the following conversation between a customer and service agent
into one of the following support categories: {}
Please only respond with the support category and nothing else. You may only respond with one of 'Warranty and Product Support', 'Returns and Exchanges', 'Shipping and Delivery', 'Order and Payment', or 'Account Management'.
Here is the conversation: {}'''
category_descriptions = '''
Warranty and Product Support: Handles warranty claims, product registration, support for service issues, and warranty term disputes.
Returns and Exchanges: Covers returning or exchanging items, refund timelines, shipping issues, and non-returnable products.
Shipping and Delivery: Focuses on delivery options, failed deliveries, shipping restrictions, and pickup or shipping problems.
Order and Payment: Involves order placement, cancellations, pricing issues, payment options, and refund or invoice concerns.
Account Management: Addresses login issues, account updates, verification processes, reactivation, and loyalty program questions.
'''
categories_prompts = [categories.format(category_descriptions, conversation) for conversation in df.conversation.values]

In [22]:
categories = tlm.prompt(categories_prompts)

Querying TLM... 100%|██████████|


In [23]:
# Add results to df
df_cat = df.copy()
df_cat["category"] = [entry['response'] for entry in categories]
df_cat["category_score"] = [entry['trustworthiness_score'] for entry in categories]
df_cat["category_expl"] = [entry['log']['explanation'] for entry in categories]

#### TLM Responses with Lowest Trustworthiness Score

In [24]:
low_trust_categories = df_cat.sort_values(by="category_score", ascending=True).head(3)
low_trust_categories.head(3)

Unnamed: 0,id,conversation,category,category_score,category_expl
22,KP59,"Customer: Hi, I am calling to check the status of my order for a pram/stroller that I placed last week.\n\nAgent: Hello, thank you for calling Bro...",Shipping and Delivery,0.479531,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
38,NQ67,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I placed an order for...",Shipping and Delivery,0.488012,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
51,PR33,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I am trying to club o...",Returns and Exchanges,0.56465,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...


In [25]:
print(low_trust_categories.category.values[0])

Shipping and Delivery


In [26]:
print(low_trust_categories.category_expl.values[0])

This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
Order and Payment.


In [29]:
print_conversation(low_trust_categories.conversation.values[0], [0])

Customer: Hi, I am calling to check the status of my order for a pram/stroller that I placed last week.


#### TLM Responses with Highest Trustworthiness Score

In [31]:
high_trust_categories = df_cat.sort_values(by="category_score", ascending=False).head(3)
high_trust_categories

Unnamed: 0,id,conversation,category,category_score,category_expl
28,XZ51,"Agent: Hello, thank you for contacting BrownBox customer support. How may I assist you today?\n\nCustomer: Hi, I am trying to reactivate my accoun...",Account Management,0.999632,Did not find a reason to doubt trustworthiness.
9,JN62,"Agent: Thank you for calling BrownBox Customer Support. My name is Alex. How can I assist you today?\n\nCustomer: Hi Alex, I recently bought a pai...",Account Management,0.999632,Did not find a reason to doubt trustworthiness.
52,BL44,"Agent: Hello, thank you for contacting BrownBox customer support. My name is Sarah. How can I assist you today?\n\nCustomer: Hi Sarah, I'm trying ...",Account Management,0.999632,Did not find a reason to doubt trustworthiness.


In [32]:
print(high_trust_categories.category.values[0])

Account Management


In [33]:
print_conversation(high_trust_categories.conversation.values[0], [1])

Customer: Hi, I am trying to reactivate my account, but it's not working.


## Accuracy Improvement
How does TLM's Trustworthiness Score increase LLM accuracy?

#### Let's revisit the category classification task

In [34]:
subset = df_cat[["id", "conversation", "category", "category_score"]]
subset.head(3)

Unnamed: 0,id,conversation,category,category_score
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase...",Warranty and Product Support,0.999231
1,XL37,"Customer: Hello, I have an issue with my recent purchase from BrownBox.\n\nAgent: Good afternoon! Thank you for contacting BrownBox customer suppo...",Returns and Exchanges,0.999358
2,OJ95,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Emily, how may I assist you today?\n\nCustomer: Hi, I recently purchased...",Shipping and Delivery,0.999483


#### Import and add ground truth category labels
In this example, we have ground truth labels for the correct category for all of the conversations.

In [35]:
category_ground_truth = pd.read_csv("https://raw.githubusercontent.com/cleanlab/cleanlab-tools/refs/heads/main/TLM-Demo-Notebook/customer-service-chat-categories.csv")
category_df = pd.merge(subset, category_ground_truth, on='id', how='left')
category_df.head(3)

Unnamed: 0,id,conversation,category,category_score,ground_truth
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase...",Warranty and Product Support,0.999231,Warranty and Product Support
1,XL37,"Customer: Hello, I have an issue with my recent purchase from BrownBox.\n\nAgent: Good afternoon! Thank you for contacting BrownBox customer suppo...",Returns and Exchanges,0.999358,Returns and Exchanges
2,OJ95,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Emily, how may I assist you today?\n\nCustomer: Hi, I recently purchased...",Shipping and Delivery,0.999483,Shipping and Delivery


#### Compute Baseline Accuracy

In [36]:
base_acc = accuracy_score(category_df["category"], category_df["ground_truth"])
print("Base Accuracy: ", f"{base_acc:.1%}")

Base Accuracy:  93.6%


#### Auto-boost Accuracy

In [37]:
tlm = studio.TLM(quality_preset="best")

# Generate answers and trustworthiness scores using best quality preset
best_preset_categories = tlm.prompt(categories_prompts)

# Add results and compute accuracy
category_df["best_preset_category"] = [entry['response'] for entry in best_preset_categories]
category_df["best_preset_category_score"] = [entry['trustworthiness_score'] for entry in best_preset_categories]
best_preset_acc = accuracy_score(category_df["best_preset_category"], category_df["ground_truth"])
print("Best Preset Accuracy: ", f"{best_preset_acc:.1%}")

Querying TLM... 100%|██████████|

Best Preset Accuracy:  94.5%





#### Compute Accuracy at Various TLM Score Thresholds
The TLM trust score threshold leads to higher classification accuracy as the threshold becomes more stringent.

In [38]:
category_df_70 = category_df[category_df.category_score > 0.7]
category_df_80 = category_df[category_df.category_score > 0.8]
category_df_90 = category_df[category_df.category_score > 0.9]
acc_70 = accuracy_score(category_df_70["category"], category_df_70["ground_truth"])
acc_80 = accuracy_score(category_df_80["category"], category_df_80["ground_truth"])
acc_90 = accuracy_score(category_df_90["category"], category_df_90["ground_truth"])
print("Accuracy of all predictions: ", f"{base_acc:.1%}")
print("Accuracy of predictions with trust score >70%: ", f"{acc_70:.1%}")
print("Accuracy of predictions with trust score >80%: ", f"{acc_80:.1%}")
print("Accuracy of predictions with trust score >90%: ", f"{acc_90:.1%}")

Accuracy of all predictions:  93.6%
Accuracy of predictions with trust score >70%:  97.2%
Accuracy of predictions with trust score >80%:  98.1%
Accuracy of predictions with trust score >90%:  99.0%


## Auto-labeling
Use TLM to label your dataset faster and more accurately.

#### TLM Responses with High Trustworthiness Score
These responses have high trustworthiness scores, indicating that they are likely to be correct and can be labeled automatically.

In [39]:
category_df.sort_values(by="category_score", ascending=False).head(3)

Unnamed: 0,id,conversation,category,category_score,ground_truth,best_preset_category,best_preset_category_score
28,XZ51,"Agent: Hello, thank you for contacting BrownBox customer support. How may I assist you today?\n\nCustomer: Hi, I am trying to reactivate my accoun...",Account Management,0.999632,Account Management,Account Management,0.999632
9,JN62,"Agent: Thank you for calling BrownBox Customer Support. My name is Alex. How can I assist you today?\n\nCustomer: Hi Alex, I recently bought a pai...",Account Management,0.999632,Account Management,Account Management,0.999632
52,BL44,"Agent: Hello, thank you for contacting BrownBox customer support. My name is Sarah. How can I assist you today?\n\nCustomer: Hi Sarah, I'm trying ...",Account Management,0.999632,Account Management,Account Management,0.999632


#### TLM Responses with High Trustworthiness Score
These responses have low trustworthiness scores, indicating that they are likely to be incorrect and should be reviewed manually.

In [40]:
category_df.sort_values(by="category_score", ascending=True).head(3)

Unnamed: 0,id,conversation,category,category_score,ground_truth,best_preset_category,best_preset_category_score
22,KP59,"Customer: Hi, I am calling to check the status of my order for a pram/stroller that I placed last week.\n\nAgent: Hello, thank you for calling Bro...",Shipping and Delivery,0.479531,Order and Payment,Shipping and Delivery,0.547593
38,NQ67,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I placed an order for...",Shipping and Delivery,0.488012,Order and Payment,Shipping and Delivery,0.549392
51,PR33,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I am trying to club o...",Returns and Exchanges,0.56465,Order and Payment,Returns and Exchanges,0.781469


#### You only need to manually label a small percentage of your dataset

In [42]:
# Number of datapoints below 80% trustworthiness score
count = len(category_df[category_df.category_score <= 0.8])
total = len(category_df)
percentage = count/total * 100
print(f"Examples with trust score ≤ 80%: {count} ({percentage:.1f}% of the dataset)")


Examples with trust score ≤ 80%: 5 (4.5% of the dataset)


# Next Steps
- [Advanced Usage notebook](https://help.cleanlab.ai/tlm/tutorials/tlm_advanced/) which includes:
  - Generating explanations of low trustworthiness scores
  - Running TLM over large datasets
  - Using quality presets to control latency/cost vs. response accuracy and trustworthiness score reliability
  - Reducing latency/cost without sacrificing response-quality via a `TLMLite` option that allows different models for producing the response vs. scoring its trustworthiness.
- Add your own [custom evaluation](https://help.cleanlab.ai/tlm/tutorials/tlm_custom_eval/) metrics to TLM and calibrate trustworthiness scores to human quality annotations
- TLM API [reference](https://help.cleanlab.ai/tlm/python/)