# Trustworthy Language Model (TLM) Demo

### Imports and Setup

In [1]:
# Imports
import pandas as pd
pd.set_option('display.max_colwidth', 150)
from cleanlab_studio import Studio
from openai import OpenAI
from sklearn.metrics import accuracy_score

# Read API Keys and initalize clients
STUDIO_API_KEY = open('studio-key.txt', 'r').read().strip()
OPENAI_API_KEY = open('openai-key.txt', 'r').read().strip()
studio = Studio(api_key=STUDIO_API_KEY)
client = OpenAI(api_key=OPENAI_API_KEY)

### Helpers

In [2]:
# Helper to prompt GPT-4o
class ChatGPT:
    def __init__(self, client):
        self.client = client
    def prompt(self, prompt):
        completion = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        return completion.choices[0].message.content

# Helper to selectively print lines from customer conversation
def print_conversation(conversation, indices=None):
    lines = conversation.split('\n')
    conversation_entries = []
    prefix = "\n.\n.\n.\n"
    for line in lines:
        if line.startswith('Customer:') or line.startswith('Agent:'):
            conversation_entries.append(line.strip())
    if indices is None:
        indices = list(range(len(conversation_entries)))
        prefix = "\n\n"
    print(prefix.join([conversation_entries[i] for i in indices if i < len(conversation_entries)]))

# Helper to produce new aggregate df
# def save_and_reset_df(df_current, df_original, df_combined):
#     if df_combined is None:
#         df_combined = df_current.copy()
#     for col in [col for col in df_current.columns if col not in df_combined.columns]:
#         df_combined[col] = df_current[col]
#     df = df_original.copy()
#     return df, df_combined

## What is TLM?
- TLM provides a trustworthiness score and explanation to any LLM generated response
- TLM enables GenAI powered applications to enter production with speed and reliability

### ChatGPT

In [3]:
chatgpt = ChatGPT(client)
response = chatgpt.prompt("What is the third month of the year, alphabetically? Respond with just the month.")
print("Response:",response)

Response: August


### TLM

In [4]:
tlm = studio.TLM(options={"log": ["explanation"]})
output = tlm.prompt("What is the third month of the year, alphabetically? Respond with just the month.")
print(f'Response: {output["response"]}')
print(f'Trustworthiness Score: {output["trustworthiness_score"]}')
print(f'Explanation: {output["log"]["explanation"]}')

Response: March
Trustworthiness Score: 0.47499641722571456
Explanation: This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
December.


## Use Cases

1. Question Answering / RAG
   - Answer specific questions about your knowledge base. 
2. Data Extraction
   - Extract relevant information from your knowledge base. 
3. Classification
   - Classify and categorize your knowledge base. 
4. Accuracy Improvement
   - Increase the accuracy of your LLM responses.
5. Compute Trustworthiness Scores for any LLM
   - Turn your own LLM into a TLM

### Initialize and View Data

In [5]:
# Initialize TLM
tlm = studio.TLM(quality_preset = "low", options={"log": ["explanation"]})

# Read in data
df = pd.read_csv("customer-service-conversation.csv")
df_original = df.copy()
df.head()

Unnamed: 0,id,conversation
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase..."
1,XL37,"Customer: Hello, I have an issue with my recent purchase from BrownBox.\n\nAgent: Good afternoon! Thank you for contacting BrownBox customer suppo..."
2,OJ95,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Emily, how may I assist you today?\n\nCustomer: Hi, I recently purchased..."
3,YR96,"Agent: Thank you for calling BrownBox Customer Support. My name is Lisa. How may I assist you today?\n\nCustomer: Hi, Lisa. I need help with a ret..."
4,XZ39,"Customer: Hello, I am calling to inquire about my return and exchange order.\n\nAgent: Good afternoon, thank you for contacting BrownBox customer ..."


## Use Case: Question Answering (RAG)

### Question: Was the customer's problem resolved?

In [6]:
resolution = '''Based on the following conversation between a customer and a support agent, your task is to determine 
if the conversation ended with a resolution that solves the customers problem.
If you determine the customers problem to be resolved, respond with "Resolved. {{explanation}}".
If you determine the customers problem to be unresolved, respond with "Unresolved. {{explanation}}"
The {{explanation}} should explain the how or how not the problem was resolved, in as few words as possible.
Here is the conversation: {}'''
resolution_prompts = [resolution.format(conversation) for conversation in df.conversation.values]

In [7]:
# Generate answers and trustworthiness scores
resolutions = tlm.prompt(resolution_prompts)

Querying TLM... 100%|███████████████████████████████████████████████████████████████████████|


In [8]:
# Add results to df
df_res = df.copy()
df_res["resolution"] = [entry['response'] for entry in  resolutions]
df_res["resolution_score"] = [entry['trustworthiness_score'] for entry in  resolutions]
df_res["resolution_expl"] = [entry['log']['explanation'] for entry in  resolutions]

#### TLM Responses with Lowest Trustworthiness Score

In [9]:
low_trust_resolutions = df_res.sort_values(by="resolution_score", ascending=True).head(3)
low_trust_resolutions

Unnamed: 0,id,conversation,resolution,resolution_score,resolution_expl
51,PR33,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I am trying to club o...","Unresolved. The customer wanted to club orders for combined delivery, which is not possible, and ultimately decided to cancel the orders but was i...",0.40421,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
72,US01,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Sarah, how may I assist you today?\n\nCustomer: Hi Sarah, I received my ...",Resolved. The customer was informed on how to initiate the return process and understood the steps to take for returning the shoes.,0.441692,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
18,NI19,"Agent: Thank you for calling BrownBox Customer Support. My name is Max. How may I assist you today?\n\nCustomer: Hi Max, I recently purchased a wr...","Resolved. The agent confirmed the ability to update the email address and provided instructions to the customer via email, which addresses the cus...",0.574765,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...


#### Customers original request was unresolved, but ended up with a resolution.

In [10]:
print("TLM Trust Score: ", low_trust_resolutions.resolution_score.values[0], "\n")
print("TLM Response: ", low_trust_resolutions.resolution.values[0], "\n")
print("TLM Explanation: ", low_trust_resolutions.resolution_expl.values[0])

TLM Trust Score:  0.4042104639014995 

TLM Response:  Unresolved. The customer wanted to club orders for combined delivery, which is not possible, and ultimately decided to cancel the orders but was informed they had already shipped, leading to a situation where the customer must refuse delivery to get a refund. 

TLM Explanation:  This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
Resolved. The customer was provided an alternative solution to refuse delivery and receive a refund since clubbing orders was not possible.


In [11]:
print_conversation(low_trust_resolutions.conversation.values[0], [1,7,10,12,13])

Customer: Hi Sarah, I am trying to club orders from different sellers for combined delivery, but it's not working. Can you help me with that?
.
.
.
Agent: I understand your concern. However, clubbing orders from different sellers is not possible as they are shipped separately from different warehouses.
.
.
.
Customer: No, I want to cancel the orders. I don't want to pay for the shipping charges twice.
.
.
.
Agent: I'm sorry, but the orders have already been shipped. However, you can refuse the delivery when it arrives, and we will process your refund once we receive the returned items.
.
.
.
Customer: Okay, I'll do that. Can you tell me how long it will take to receive the refund?


#### TLM Responses with Highest Trustworthiness Score

In [12]:
high_trust_resolutions = df_res.sort_values(by="resolution_score", ascending=False).head(3)
high_trust_resolutions

Unnamed: 0,id,conversation,resolution,resolution_score,resolution_expl
9,JN62,"Agent: Thank you for calling BrownBox Customer Support. My name is Alex. How can I assist you today?\n\nCustomer: Hi Alex, I recently bought a pai...","Resolved. The customer successfully received the OTP and completed the verification process, allowing them to proceed with their order.",0.977626,Did not find a reason to doubt trustworthiness.
10,RH04,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I am having trouble w...","Resolved. The customer's account was successfully reactivated, and an appointment for servicing the water purifier was scheduled.",0.976664,Did not find a reason to doubt trustworthiness.
39,SC68,"Customer: Hi, I need some help with my BrownBox account.\n\nAgent: Hello, thank you for contacting BrownBox customer support. My name is Jack. May...","Resolved. The agent successfully updated the customer's email address, allowing him to log in to his account.",0.974477,Did not find a reason to doubt trustworthiness.


In [13]:
print("TLM Trust Score: ", high_trust_resolutions.resolution_score.values[0], "\n")
print("TLM Response: ", high_trust_resolutions.resolution.values[0], "\n")
print("TLM Explanation: ", high_trust_resolutions.resolution_expl.values[0])

TLM Trust Score:  0.9776264273461617 

TLM Response:  Resolved. The customer successfully received the OTP and completed the verification process, allowing them to proceed with their order. 

TLM Explanation:  Did not find a reason to doubt trustworthiness.


In [14]:
print_conversation(high_trust_resolutions.conversation.values[0], [1, 4, 8])

Customer: Hi Alex, I recently bought a pair of jeans from your website, and I received a message that I need to verify my mobile number and email address to get the OTP or verification code. Can you help me with that?
.
.
.
Agent: Thank you for providing that information. I have sent the OTP or verification code to your mobile number and email address. Please check your phone and email and let me know if you have received it.
.
.
.
Customer: It worked! Thank you so much for your help, Alex.


##### Question: Was the customer put on hold and for how long?
##### Question: What techniques did the agent use to build rapport with the customer?
##### Question: What do you think the customer’s expectations were before initiating the chat, and how well did the agent manage these expectations?
##### Question: In what ways could the company improve its systems or policies to prevent the issue from happening in the first place, based on this conversation?

Jonas I can add another question to this section, thoughts? I like the last one above ^

## Use Case: Data Extraction
Extract customer name, contact information, and order number from conversation. 

In [15]:
extraction = ''' Please extract these details from the following conversation: 
customer name, customer phone number, customer email, and order number.
If a detail is not present, respond with None for that detail.
Here is the conversation: {}'''
extraction_prompts = [extraction.format(conversation) for conversation in df.conversation.values]

In [16]:
# Generate answers and trustworthiness scores
extractions = tlm.prompt(extraction_prompts)

Querying TLM... 100%|███████████████████████████████████████████████████████████████████████|


In [17]:
# Add results to df
df_extr = df.copy()
df_extr["extraction"] = [entry['response'] for entry in  extractions]
df_extr["extraction_score"] = [entry['trustworthiness_score'] for entry in  extractions]
df_extr["extraction_expl"] = [entry['log']['explanation'] for entry in  extractions]

#### TLM Responses with Lowest Trustworthiness Score

In [18]:
low_trust_extractions = df_extr.sort_values(by="extraction_score", ascending=True).head(3)
low_trust_extractions

Unnamed: 0,id,conversation,extraction,extraction_score,extraction_expl
79,BC62,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How can I assist you today?\n\nCustomer: Hi Rachel, I received a wrong ...",- Customer Name: None\n- Customer Phone Number: None\n- Customer Email: None\n- Order Number: BB987654321,0.655313,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
89,BO55,"Agent: Hello, thank you for contacting BrownBox customer support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi, my name is Emily,...",- Customer Name: Emily\n- Customer Phone Number: None\n- Customer Email: support@brownbox.com\n- Order Number: BB987654321,0.793047,"The proposed answer correctly identifies the customer name as ""Emily,"" which is explicitly mentioned in the conversation. The order number ""BB9876..."
7,QV90,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I bought an air condi...",Customer Name: None \nCustomer Phone Number: None \nCustomer Email: None \nOrder Number: 123456,0.888119,Did not find a reason to doubt trustworthiness.


#### LLM responds with incorrect model number that was corrected by agent.

In [20]:
print("TLM Trust Score: ", low_trust_extractions.extraction_score.values[0], "\n")
print("TLM Response:\n",low_trust_extractions.extraction.values[0], "\n")
print("TLM Explanation: ", low_trust_extractions.extraction_expl.values[0])

TLM Trust Score:  0.6553132742899915 

TLM Response:
 - Customer Name: None
- Customer Phone Number: None
- Customer Email: None
- Order Number: BB987654321 

TLM Explanation:  This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
Customer name: None 
Customer phone number: None 
Customer email: None 
Order number: BB987654324.


In [21]:
print_conversation(low_trust_extractions.conversation.values[0], [3,4])

Customer: My order number is BB987654321. I received a Mixer Grinder with model number MG1234 instead of the Wet Grinder I ordered.
.
.
.
Agent: Thank you for providing that information, sir. That one is wrong, it is actually BB987654324. Let me check the availability of the Wet Grinder for you. Please hold on for a moment while I access the information.


#### TLM Responses with Highest Trustworthiness Score

In [22]:
high_trust_extractions = df_extr.sort_values(by="extraction_score", ascending=False).head(3)
high_trust_extractions

Unnamed: 0,id,conversation,extraction,extraction_score,extraction_expl
33,GR70,"Customer: Hi, I had recently ordered an Oven Toaster Grills (OTG) from your website, and I wanted to confirm the order status.\n\nAgent: Hello, th...",- Customer Name: Emily\n- Customer Phone Number: None\n- Customer Email: emily123@gmail.com\n- Order Number: BB987654321,0.98563,Did not find a reason to doubt trustworthiness.
81,HS06,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How can I assist you today?\n\nCustomer: Hi Sarah, I purchased an electr...",Customer Name: None \nCustomer Phone Number: 555-1234 \nCustomer Email: jane.doe@email.com \nOrder Number: 123456,0.985371,Did not find a reason to doubt trustworthiness.
84,SI65,"Customer: Hi, I have a concern about my recent purchase of a speaker from your website.\n\nAgent: Hello, thank you for calling BrownBox customer s...",- Customer Name: Alex\n- Customer Phone Number: None\n- Customer Email: alex123@gmail.com\n- Order Number: BB123456789,0.984027,Did not find a reason to doubt trustworthiness.


In [23]:
print("TLM Trust Score: ", high_trust_extractions.extraction_score.values[0], "\n")
print("TLM Response:\n",high_trust_extractions.extraction.values[0], "\n")
print("TLM Explanation: ", high_trust_extractions.extraction_expl.values[0])

TLM Trust Score:  0.985629710696906 

TLM Response:
 - Customer Name: Emily
- Customer Phone Number: None
- Customer Email: emily123@gmail.com
- Order Number: BB987654321 

TLM Explanation:  Did not find a reason to doubt trustworthiness.


In [24]:
print_conversation(high_trust_extractions.conversation.values[0], [2,4])

Customer: Yes, my name is Emily, and my email address is emily123@gmail.com.
.
.
.
Customer: Yes, my order number is BB987654321.


## Use Case: Classification

#### Question: Determine the sentiment of the overall conversation.

In [25]:
sentiment = '''Please classify the overall sentiment of the customer in the following conversation.
You can choose from 'neutral', 'frustrated', 'negative', or 'positive'.
Please only respond with the sentiment only with no leading or trailing text.
Here is the conversation: {}'''

sentiment_prompts = [sentiment.format(conversation) for conversation in df.conversation.values]

In [26]:
# Generate answers and trustworthiness scores
sentiments = tlm.prompt(sentiment_prompts)

Querying TLM... 100%|███████████████████████████████████████████████████████████████████████|


In [27]:
# Add results to df
df_sent = df.copy()
df_sent["sentiment"] = [entry['response'] for entry in  sentiments]
df_sent["sentiment_score"] = [entry['trustworthiness_score'] for entry in  sentiments]
df_sent["sentiment_expl"] = [entry['log']['explanation'] for entry in  sentiments]

#### TLM Responses with Lowest Trustworthiness Score

In [28]:
low_trust_sentiment = df_sent.sort_values(by="sentiment_score", ascending=True).head(3)
low_trust_sentiment

Unnamed: 0,id,conversation,sentiment,sentiment_score,sentiment_expl
47,UW82,"Customer: Hi, I am trying to find my invoice for the mobile I purchased from BrownBox, but I can't seem to locate it.\n\nAgent: Hello, thank you f...",neutral,0.433383,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
30,TR30,"Agent: Thank you for calling BrownBox Customer Support. My name is John. How can I assist you today?\n\nCustomer: Hello John, my name is Mike and ...",neutral,0.452695,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
20,GW48,Agent: Thank you for contacting BrownBox customer support. My name is Sarah. How can I assist you today?\n\nCustomer: Hi Sarah. I received a wrong...,neutral,0.473304,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...


#### Conversation is somewhat amiguous, could be neutral or positive.

In [29]:
print("TLM Trust Score: ", low_trust_sentiment.sentiment_score.values[0], "\n")
print("TLM Response:\n",low_trust_sentiment.sentiment.values[0], "\n")
print("TLM Explanation: ", low_trust_sentiment.sentiment_expl.values[0])

TLM Trust Score:  0.433383202539946 

TLM Response:
 neutral 

TLM Explanation:  This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
positive.


In [30]:
print(low_trust_sentiment.sentiment.values[0])

neutral


In [31]:
print(low_trust_sentiment.sentiment_expl.values[0])

This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
positive.


In [32]:
print_conversation(low_trust_sentiment.conversation.values[0], [5,8,10,12])

Customer: I have checked my account, but I can't seem to find the invoice. Can you please help me locate it?
.
.
.
Customer: Okay, thank you. But I need the invoice urgently as I have to submit it to my company.
.
.
.
Customer: Yes, that would be great. Thank you.
.
.
.
Customer: No, that's all for now. Thank you for your help.


#### TLM Responses with Highest Trustworthiness Score

In [33]:
high_trust_sentiment = df_sent.sort_values(by="sentiment_score", ascending=False).head(3)
high_trust_sentiment

Unnamed: 0,id,conversation,sentiment,sentiment_score,sentiment_expl
97,SF57,"Customer: Hi, I have a query regarding the loyalty points for my recent purchase of a washing machine.\n\nAgent: Hello, thank you for contacting B...",positive,0.999913,Did not find a reason to doubt trustworthiness.
10,RH04,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I am having trouble w...",positive,0.999913,Did not find a reason to doubt trustworthiness.
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase...",positive,0.999913,Did not find a reason to doubt trustworthiness.


In [34]:
print_conversation(high_trust_sentiment.conversation.values[0], [7,11,13])

Customer: Okay, that's helpful. Can you please guide me on how to redeem these points?
.
.
.
Customer: Okay, that's great. Thank you for your help, Sarah.
.
.
.
Customer: No, that's all. Thank you again.


#### Question: Classify conversations into one of 6 issue areas:
1. Cancellations and returns
2. Login and Account
3. Order
4. Shipping
5. Shopping
6. Warranty

In [35]:
categories = '''Please classify the following conversation between a customer and service agent
into one of the following support categories: {}
Please only respond with the support category and nothing else. You may only respond with one of 'Warranty and Product Support', 'Returns and Exchanges', 'Shipping and Delivery', 'Order and Payment', or 'Account Management'.
Here is the conversation: {}'''
category_descriptions = '''
Warranty and Product Support: Handles warranty claims, product registration, support for service issues, and warranty term disputes.
Returns and Exchanges: Covers returning or exchanging items, refund timelines, shipping issues, and non-returnable products.
Shipping and Delivery: Focuses on delivery options, failed deliveries, shipping restrictions, and pickup or shipping problems.
Order and Payment: Involves order placement, cancellations, pricing issues, payment options, and refund or invoice concerns.
Account Management: Addresses login issues, account updates, verification processes, reactivation, and loyalty program questions.
'''
categories_prompts = [categories.format(category_descriptions, conversation) for conversation in df.conversation.values]

In [36]:
# Generate answers and trustworthiness scores
categories = tlm.prompt(categories_prompts)

Querying TLM... 100%|███████████████████████████████████████████████████████████████████████|


In [37]:
# Add results to df
df_cat = df.copy()
df_cat["category"] = [entry['response'] for entry in categories]
df_cat["category_score"] = [entry['trustworthiness_score'] for entry in categories]
df_cat["category_expl"] = [entry['log']['explanation'] for entry in categories]

#### TLM Responses with Lowest Trustworthiness Score

In [38]:
low_trust_categories = df_cat.sort_values(by="category_score", ascending=True).head(3)
low_trust_categories.head(3)

Unnamed: 0,id,conversation,category,category_score,category_expl
22,KP59,"Customer: Hi, I am calling to check the status of my order for a pram/stroller that I placed last week.\n\nAgent: Hello, thank you for calling Bro...",Shipping and Delivery,0.486655,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
51,PR33,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I am trying to club o...",Returns and Exchanges,0.56244,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...
38,NQ67,"Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?\n\nCustomer: Hi Sarah, I placed an order for...",Shipping and Delivery,0.615868,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...


In [39]:
print(low_trust_categories.category.values[0])

Shipping and Delivery


In [40]:
print(low_trust_categories.category_expl.values[0])

This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
Order and Payment.


In [41]:
print_conversation(low_trust_categories.conversation.values[0], [0])

Customer: Hi, I am calling to check the status of my order for a pram/stroller that I placed last week.


#### TLM Responses with Highest Trustworthiness Score

In [42]:
high_trust_categories = df_cat.sort_values(by="category_score", ascending=False).head(3)
high_trust_categories

Unnamed: 0,id,conversation,category,category_score,category_expl
25,IH81,Agent: Thank you for contacting BrownBox customer support. My name is Alex. How can I assist you today?\n\nCustomer: Hi Alex. I recently changed m...,Account Management,0.999632,Did not find a reason to doubt trustworthiness.
28,XZ51,"Agent: Hello, thank you for contacting BrownBox customer support. How may I assist you today?\n\nCustomer: Hi, I am trying to reactivate my accoun...",Account Management,0.999632,Did not find a reason to doubt trustworthiness.
56,KI33,"Agent: Thank you for calling BrownBox Customer Support. My name is Mark. How can I assist you today?\n\nCustomer: Hi Mark, I'm calling because I n...",Account Management,0.999632,Did not find a reason to doubt trustworthiness.


In [43]:
print(high_trust_categories.category.values[0])

Account Management


In [44]:
print_conversation(high_trust_categories.conversation.values[0], [1])

Customer: Hi Alex. I recently changed my email address and I'm having trouble logging into my account to track my Shorts order. Can you help me with that?


## Use Case: Accuracy Improvement
How does TLM's Trustworthiness Score increase LLM accuracy?

#### Let's revisit the category classification task

In [45]:
subset = df_cat[["id", "conversation", "category", "category_score"]]
subset.head(3)

Unnamed: 0,id,conversation,category,category_score
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase...",Warranty and Product Support,0.999231
1,XL37,"Customer: Hello, I have an issue with my recent purchase from BrownBox.\n\nAgent: Good afternoon! Thank you for contacting BrownBox customer suppo...",Returns and Exchanges,0.993865
2,OJ95,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Emily, how may I assist you today?\n\nCustomer: Hi, I recently purchased...",Shipping and Delivery,0.999483


#### Import and add ground truth category labels
In this example, we have ground truth labels for the correct category for all of the conversations.

In [46]:
category_ground_truth = pd.read_csv("customer-service-chat-categories.csv")
category_df = pd.merge(subset, category_ground_truth, on='id', how='left')
category_df.head(3)

Unnamed: 0,id,conversation,category,category_score,ground_truth
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase...",Warranty and Product Support,0.999231,Warranty and Product Support
1,XL37,"Customer: Hello, I have an issue with my recent purchase from BrownBox.\n\nAgent: Good afternoon! Thank you for contacting BrownBox customer suppo...",Returns and Exchanges,0.993865,Returns and Exchanges
2,OJ95,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Emily, how may I assist you today?\n\nCustomer: Hi, I recently purchased...",Shipping and Delivery,0.999483,Shipping and Delivery


#### Compute Baseline Accuracy

In [47]:
base_acc = accuracy_score(category_df["category"], category_df["ground_truth"])
print("Base Accuracy: ", f"{base_acc:.1%}")

Base Accuracy:  94.5%


#### Compute Accuracy at Various TLM Score Thresholds
The TLM trust score threshold leads to higher classification accuracy as the threshold becomes more stringent.

In [48]:
category_df_70 = category_df[category_df.category_score > 0.7]
category_df_80 = category_df[category_df.category_score > 0.8]
category_df_90 = category_df[category_df.category_score > 0.9]
acc_70 = accuracy_score(category_df_70["category"], category_df_70["ground_truth"])
acc_80 = accuracy_score(category_df_80["category"], category_df_80["ground_truth"])
acc_90 = accuracy_score(category_df_90["category"], category_df_90["ground_truth"])
print("Accuracy of all predictions: ", f"{base_acc:.1%}")
print("Accuracy of predictions with trust score >70%: ", f"{acc_70:.1%}")
print("Accuracy of predictions with trust score >80%: ", f"{acc_80:.1%}")
print("Accuracy of predictions with trust score >90%: ", f"{acc_90:.1%}")

Accuracy of all predictions:  94.5%
Accuracy of predictions with trust score >70%:  97.2%
Accuracy of predictions with trust score >80%:  98.1%
Accuracy of predictions with trust score >90%:  100.0%


## Use Case: Compute Trustworthiness Scores for any LLM
Use TLM to get a trustworthiness score for any (prompt, response) pair!

#### Single Conversation

In [49]:
conversation = df[df.id=="NQ67"].conversation.values[0]
print(conversation)

Agent: Thank you for calling BrownBox Customer Support. My name is Sarah. How may I assist you today?

Customer: Hi Sarah, I placed an order for a Smart Watch, and I want to confirm if it has been processed.

Agent: Sure, I can help you with that. May I know your name and order number, please?

Customer: My name is Jane, and my order number is BB123456.

Agent: Thank you, Jane. Let me check the order status for you. (puts customer on hold for a minute) I can confirm that your order has been processed, and it is currently in transit. You should receive it within the next three business days.

Customer: Great! Thank you for checking that.

Agent: You're welcome, Jane. Is there anything else I can assist you with?

Customer: No, that's all. Thank you for your help.

Agent: You're welcome. If you have any further questions, please don't hesitate to call us back. Have a great day!

Customer: You too. Goodbye!

Agent: Goodbye!


#### Question Answering

In [50]:
prompt = "Given the following conversation, respond with a very short summarization. Conversation: {}".format(conversation)

response_1 = "The customer confirmed her Smart Watch order was processed and is in transit, expected in three days."
response_2 = "Customer wants to check the shipping speed of her Smart Watch. Agent confirmed the shipping speed is set to 3 days."
response_3 = "Customer wants to cancel their order and get a refund. Agent processed the cancellation."

trust_score_1 = tlm.get_trustworthiness_score(prompt, response_1)
trust_score_2 = tlm.get_trustworthiness_score(prompt, response_2)
trust_score_3 = tlm.get_trustworthiness_score(prompt, response_3)

print(f"TLM Score for Response 1: {trust_score_1["trustworthiness_score"]}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 2: {trust_score_2["trustworthiness_score"]}", "\n")
print(f"TLM Explanation for Response 2: {trust_score_2["log"]["explanation"]}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 3: {trust_score_3["trustworthiness_score"]}", "\n")
print(f"TLM Explanation for Response 3: {trust_score_3["log"]["explanation"]}", "\n")

TLM Score for Response 1: 0.9772217758109922 

----------------------------------------------------------------------------

TLM Score for Response 2: 0.6733846174627715 

TLM Explanation for Response 2: The proposed answer inaccurately summarizes the conversation. The customer, Jane, did not specifically inquire about the shipping speed; she asked for confirmation of her order's processing status. The agent confirmed that the order was processed and mentioned that it is in transit, with an expected delivery within three business days. While the answer does mention the three-day delivery timeframe, it misrepresents the customer's original question as being about shipping speed rather than order confirmation. Therefore, the proposed answer does not accurately reflect the content of the conversation. 

----------------------------------------------------------------------------

TLM Score for Response 3: 0.3746824115079491 

TLM Explanation for Response 3: incorrect because the conversat

#### Data Extraction

In [51]:
prompt = "Given the following conversation, what is the name of the customer service agent? Conversation: {}".format(conversation)

response_1 = "Sarah"
response_2 = "Sara"
response_3 = "Jane"

trust_score_1 = tlm.get_trustworthiness_score(prompt, response_1)
trust_score_2 = tlm.get_trustworthiness_score(prompt, response_2)
trust_score_3 = tlm.get_trustworthiness_score(prompt, response_3)

print(f"TLM Score for Response 1: {trust_score_1["trustworthiness_score"]}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 2: {trust_score_2["trustworthiness_score"]}", "\n")
print(f"TLM Explanation for Response 2: {trust_score_2["log"]["explanation"]}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 3: {trust_score_3["trustworthiness_score"]}", "\n")
print(f"TLM Explanation for Response 3: {trust_score_3["log"]["explanation"]}", "\n")

TLM Score for Response 1: 0.9965960642671847 

----------------------------------------------------------------------------

TLM Score for Response 2: 0.6417846821840848 

TLM Explanation for Response 2: The proposed answer "Sara" is incorrect because the name of the customer service agent mentioned in the conversation is "Sarah," not "Sara." The conversation clearly states, "Agent: Thank you for calling BrownBox Customer Support. My name is Sarah." Therefore, the correct name is explicitly provided in the text, and any variation, such as "Sara," does not match the original name given. This makes the proposed answer factually incorrect. 

----------------------------------------------------------------------------

TLM Score for Response 3: 0.0017456943853970091 

TLM Explanation for Response 3: The question asks for the name of the customer service agent in the provided conversation. The agent introduces herself at the beginning of the conversation by saying, "My name is Sarah." There

#### Classification

In [52]:
prompt = '''Given the following conversation, classify the complexity of their request.
Choose from "Easy", "Medium", or "Hard".
Conversation: {}'''.format(conversation)

response_1 = "Easy"
response_2 = "Medium"
response_3 = "Hard"

trust_score_1 = tlm.get_trustworthiness_score(prompt, response_1)
trust_score_2 = tlm.get_trustworthiness_score(prompt, response_2)
trust_score_3 = tlm.get_trustworthiness_score(prompt, response_3)

print(f"TLM Score for Response 1: {trust_score_1["trustworthiness_score"]}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 2: {trust_score_2["trustworthiness_score"]}", "\n")
print(f"TLM Explanation for Response 2: {trust_score_2["log"]["explanation"]}", "\n")
print("----------------------------------------------------------------------------\n")
print(f"TLM Score for Response 3: {trust_score_3["trustworthiness_score"]}", "\n")
print(f"TLM Explanation for Response 3: {trust_score_3["log"]["explanation"]}", "\n")

TLM Score for Response 1: 0.9998346058989727 

----------------------------------------------------------------------------

TLM Score for Response 2: 0.634548595070959 

TLM Explanation for Response 2: This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 
Easy. 

----------------------------------------------------------------------------

TLM Score for Response 3: 0.18352104920638546 

TLM Explanation for Response 3: The request made by the customer in the conversation is straightforward and involves confirming the status of an order, which is a common task in customer support. The agent was able to quickly retrieve the information and provide a clear answer regarding the order's status. The complexity of the request is low because it involves a simple inquiry that does not require extensive problem-solving or technical knowledge. Therefo

## View Final Dataset

In [61]:
df_final = pd.merge(df, df_res, on=['id', 'conversation'])
df_final = pd.merge(df_final, df_extr, on=['id', 'conversation'])
df_final = pd.merge(df_final, df_sent, on=['id', 'conversation'])
df_final = pd.merge(df_final, df_cat, on=['id', 'conversation'])
df_final.head()

Unnamed: 0,id,conversation,resolution,resolution_score,resolution_expl,extraction,extraction_score,extraction_expl,sentiment,sentiment_score,sentiment_expl,category,category_score,category_expl
0,RU91,"Agent: Thank you for calling BrownBox Customer Support. My name is Rachel. How may I assist you today?\n\nCustomer: Hi Rachel, I recently purchase...","Resolved. The customer received clear information about the warranty start date and duration, addressing their question satisfactorily.",0.968643,Did not find a reason to doubt trustworthiness.,Customer Name: None \nCustomer Phone Number: None \nCustomer Email: None \nOrder Number: BB12345678,0.934317,Did not find a reason to doubt trustworthiness.,positive,0.999913,Did not find a reason to doubt trustworthiness.,Warranty and Product Support,0.999231,Did not find a reason to doubt trustworthiness.
1,XL37,"Customer: Hello, I have an issue with my recent purchase from BrownBox.\n\nAgent: Good afternoon! Thank you for contacting BrownBox customer suppo...","Unresolved. The customer was unable to click the 'Cancel' button and could not initiate the return process, and while the agent provided troublesh...",0.956172,Did not find a reason to doubt trustworthiness.,- Customer Name: John\n- Customer Phone Number: None\n- Customer Email: None\n- Order Number: BB123456789,0.977936,Did not find a reason to doubt trustworthiness.,neutral,0.999689,Did not find a reason to doubt trustworthiness.,Returns and Exchanges,0.993865,Did not find a reason to doubt trustworthiness.
2,OJ95,"Agent: Hello, thank you for calling BrownBox Customer Support. My name is Emily, how may I assist you today?\n\nCustomer: Hi, I recently purchased...","Resolved. The customer was able to opt for next-day delivery, which addressed their request for a faster delivery option.",0.959487,Did not find a reason to doubt trustworthiness.,- Customer Name: None\n- Customer Phone Number: None\n- Customer Email: None\n- Order Number: 123456789,0.931483,Did not find a reason to doubt trustworthiness.,neutral,0.719918,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...,Shipping and Delivery,0.999483,Did not find a reason to doubt trustworthiness.
3,YR96,"Agent: Thank you for calling BrownBox Customer Support. My name is Lisa. How may I assist you today?\n\nCustomer: Hi, Lisa. I need help with a ret...",Resolved. The customer was informed about the return fee and accepted a $10 discount on their next purchase as compensation.,0.969288,Did not find a reason to doubt trustworthiness.,Customer Name: None \nCustomer Phone Number: None \nCustomer Email: None \nOrder Number: BB67890,0.931934,Did not find a reason to doubt trustworthiness.,neutral,0.596097,This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that th...,Returns and Exchanges,0.999358,Did not find a reason to doubt trustworthiness.
4,XZ39,"Customer: Hello, I am calling to inquire about my return and exchange order.\n\nAgent: Good afternoon, thank you for contacting BrownBox customer ...","Unresolved. The customer is still waiting for their refund due to complications with the return label, and the issue has not been resolved during ...",0.953289,Did not find a reason to doubt trustworthiness.,Here are the extracted details from the conversation:\n\n- Customer Name: John\n- Customer Phone Number: 9876543210\n- Customer Email: john@email....,0.973175,Did not find a reason to doubt trustworthiness.,frustrated,0.999861,Did not find a reason to doubt trustworthiness.,Returns and Exchanges,0.999358,Did not find a reason to doubt trustworthiness.


## Appendix + Advanced Usage

### Batch Prompting for Large Data
If your datasets have over several thousand examples, we recommend running TLM in mini-batches to checkpoint intermediate results.

This helper function allows you to run TLM in mini-batches. We recommend batch sizes of approximately 1000, but feel free to tinker with this number to best suit your use case. You can re-execute this function in the case of any failures and it will resume from the previous checkpoint.

In [62]:
import os

def batch_prompt(tlm: studio.TLM, input_path: str, output_path: str, prompt_col_name: str, batch_size: int = 1000):
    if os.path.exists(output_path):
        start_idx = len(pd.read_csv(output_path))
    else:
        start_idx = 0

    df_batched = pd.read_csv(input_path, chunksize=batch_size)
    curr_idx = 0

    for curr_batch in df_batched:
        # if results already exist for the entire batch
        if curr_idx + len(curr_batch) <= start_idx:
            curr_idx += len(curr_batch)
            continue

        # if results exist for half the batch
        elif curr_idx < start_idx:
            curr_batch = curr_batch[start_idx - curr_idx:]
            curr_idx = start_idx

        results = tlm.try_prompt(curr_batch[prompt_col_name].to_list())
        results_df = pd.DataFrame(results)
        results_df.to_csv(output_path, mode='a', index=False, header=not os.path.exists(output_path))
        
        curr_idx += len(curr_batch)

### Quality Presets
You can trade-off compute vs. quality via the `quality_presets` argument. Higher quality presets produce better LLM responses and trustworthiness scores, but require more computation.
| Quality Preset | LLM Response Quality | Trustworthiness Score Quality |
|----------------|----------------------|-------------------------------|
| Best           | Best                 | Good                          |
| High           | Improved             | Good                          |
| Medium         | Standard             | Good                          |
| Low            | Standard             | Fair                          |
| Base           | Standard             | Lowest latency                |


In [None]:
tlm = studio.TLM(
    quality_preset="best"  # supported quality presets are: 'best','high','medium','low','base'
)

# Run a single prompt using the preset parameters:
output = tlm.prompt("<your prompt>")

# Or run multiple prompts simultaneously in a batch:
outputs = tlm.prompt(["<your first prompt>", "<your second prompt>", "<your third prompt>"])

### Trustworthy Language Model Lite
Using a `TLMLite` object in place of a `TLM` enables the use of different LLMs for generating the response vs scoring its trustworthiness. Consider this hybrid approach to get high-quality responses (from a more expensive model), but cheaper trustworthiness score evaluations (via a smaller model).

In [None]:
tlm_lite = studio.TLMLite(response_model="gpt-4o", quality_preset="low", options={"model": "gpt-4o-mini"})

output = tlm_lite.prompt("<your prompt>")

### Additional Arguments
- **model** (str, default = "gpt-4o-mini"): underlying LLM to use (better models will yield better results). Models currently supported include "gpt-4o-mini", "gpt-3.5-turbo-16k", "gpt-4", "gpt-4o", "claude-3-haiku".

- **max_tokens** (int, default = 512): the maximum number of tokens to generate in the TLM response. This number will impact the maximum number of tokens you will see in the output response, and also the number of tokens that can be generated internally within the TLM (to estimate the trustworthiness score). Higher values here can produce better (more reliable) TLM responses and trustworthiness scores, but at higher costs/runtimes. If you are experiencing token limit errors while using the TLM (especially on higher quality presets), consider lowering this number. This parameter must be between 64 and 512.

- **num_candidate_responses** (int, default = 1): how many alternative candidate responses are internally generated by TLM. TLM scores the trustworthiness of each candidate response, and then returns the most trustworthy one. Higher values here can produce better (more accurate) responses from the TLM, but at higher costs/runtimes (and internally consumes more tokens). This parameter must be between 1 and 20. When it is 1, TLM simply returns a standard LLM response and does not attempt to improve further.

- **num_consistency_samples** (int, default = 8): the amount of internal sampling to evaluate LLM-response-consistency. This consistency forms a big part of the returned trustworthiness score, helping quantify the epistemic uncertainty associated with strange prompts or prompts that are too vague/open-ended to receive a clearly defined 'good' response. Higher values here produce better (more reliable) TLM trustworthiness scores, but at higher costs/runtimes. This parameter must be between 0 and 20.

- **use_self_reflection** (bool, default = True): whether the LLM is asked to self-reflect upon the response it generated and self-evaluate this response. This self-reflection forms a big part of the trustworthiness score, helping quantify aleatoric uncertainty associated with challenging prompts and helping catch answers that are obviously incorrect/bad for a prompt asking for a well-defined answer that LLMs should be able to handle. Setting this to False disables the use of self-reflection and may produce worse TLM trustworthiness scores, but will reduce costs/runtimes.

- **log** (List[str], default = []): optionally specify additional logs or metadata to return. For instance, include "explanation" here to get explanations of why a response is scored with low trustworthiness.
