In [1]:
# Install dependencies
!pip install pandas gspread oauth2client
!pip install --upgrade pip
!pip uninstall -y cleanlab-studio
!pip install git+https://github.com/cleanlab/cleanlab-studio.git@tlm
# !pip install --upgrade cleanlab-studio
import time
from IPython import display
display.clear_output()

from cleanlab_studio import Studio

In [2]:
with open('api.txt') as file: 
    key = file.read()
studio = Studio(key)  # Cleanlab Studio API key from https://app.cleanlab.ai/account?tab=General
# tlm = studio.TLM(quality_preset="high", options={"model": "gpt-4"}) # on high quality, with gpt-4
tlm = studio.TLM(quality_preset="high")

## Quick comparison: OpenAI's GPT API (no reliability scores) vs Cleanlab TLM API (trust for every output)

In [3]:
# Runs Open-AI chatGPT
chatgpt = studio.TLM(quality_preset='base')
chatgpt.prompt("How many Ns are there in the word enter?")

{'response': 'There are two Ns in the word "enter".',
 'trustworthiness_score': None}

In [4]:
# Runs the Cleanlab TLM with confidence reliablity scores
tlm.prompt("How many Ns are there in the word enter?")

{'response': 'There are 1 N in the word "enter".',
 'trustworthiness_score': 0.5719891948560684}

## Use `quality_preset='low'` to optimize for speed/cost over accuracy

In [5]:
# Runs the Cleanlab TLM with confidence reliablity scores
tlm_fast = studio.TLM(quality_preset='low')
tlm_fast.prompt("Cuantos anos tienes el presidente de estados unidos?")

{'response': 'No puedo responder a esa pregunta ya que mi conocimiento se basa en información hasta septiembre de 2021 y no tengo acceso a información en tiempo real. Además, el presidente de Estados Unidos puede cambiar con el tiempo.',
 'trustworthiness_score': 0.6824641149190876}

In [6]:
# Runs the Cleanlab TLM with confidence reliablity scores
tlm_fast = studio.TLM(quality_preset='low')
tlm_fast.prompt("How many Ns are there in the word enter?")

{'response': 'There are two Ns in the word "enter".',
 'trustworthiness_score': 0.5575863269621077}

# Now let's see reliable data enrichment for solving tasks on documents using Cleanlab. (trustworthiness scores for every output built-in)

We'll see the power of built-in trustworthiness scores for every output when solving for text/document workflows on arbitrary datasets.

# Read in the data

In [18]:
# Prefer spreadsheet for auto-update, but if you have trouble, just pd.read_csv
# Link to dataset: https://docs.google.com/spreadsheets/d/1V7_fOlmixgC70W2TU6s1Kz_N0Vz0lHr29ogt9j5yQMU/edit?usp=sharing

import pandas as pd
import gspread
from oauth2client.service_account import ServiceAccountCredentials
pd.set_option('display.max_colwidth', None)
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('tlm_demo_credentials.json', scope)
gc = gspread.authorize(credentials)    #use gspread library to extract spredsheet
wks = gc.open("Document Compliance Dataset").sheet1    #Mention the required sheet
df = pd.DataFrame(wks.get_all_values(), columns=["document_text", "issue"])
print(f'Number of documents in this compliance dataset: {len(df)}')

Number of documents in this compliance dataset: 109


# Let's look at some examples of these documents.
* Some of these documents have compliance issues, many don't. This task is often outsourced entirely to experts who review 100% of documents by hand.
* Here we'll automate the entire workflow and provide trustworthiness scores that help you determine what needs human review.

In [19]:
df.sample(10, random_state=0)

Unnamed: 0,document_text,issue
84,"That means there are a limited number of new opportunities for workers. India, which attends the G7 meeting of seven leading industrialised nations on Friday, is unlikely to be cowed by its newcomer status.",none
10,"Although the chip makers registration form includes marketing consent checkboxes, this method of consent is not considered freely-given because the boxes are pre-ticked by default:",GDPR
75,"Nevertheless it was enough to push down the unemployment rate to 5.2%, its lowest level since September 2001.",none
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.",HIPAA
24,"The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates.",none
100,"Nevertheless, 2.2 million Ethiopians will still need emergency assistance.",none
108,About 80% of Ethiopians depend directly or indirectly on agriculture.,none
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”",FERPA
16,"AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband.",none
86,He objected to subsidies on agriculture that make it hard for developing nations like India to compete.,none


## But in practice, you dont have the "issue". You need to find that out automatically. All you have is this.

In [20]:
df.sample(10, random_state=0)[['document_text']]

Unnamed: 0,document_text
84,"That means there are a limited number of new opportunities for workers. India, which attends the G7 meeting of seven leading industrialised nations on Friday, is unlikely to be cowed by its newcomer status."
10,"Although the chip makers registration form includes marketing consent checkboxes, this method of consent is not considered freely-given because the boxes are pre-ticked by default:"
75,"Nevertheless it was enough to push down the unemployment rate to 5.2%, its lowest level since September 2001."
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed."
24,"The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates."
100,"Nevertheless, 2.2 million Ethiopians will still need emergency assistance."
108,About 80% of Ethiopians depend directly or indirectly on agriculture.
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”"
16,"AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband."
86,He objected to subsidies on agriculture that make it hard for developing nations like India to compete.


# Example 1 - Data Enrichment: find compliance issues in documents with trust for every output.

In [21]:
compliance_prompt = \
'''What type of compliance issue is most likely present in the following document?
Please restrict your answer to a one word answer and nothing else.
Your answer should be selected from the following options: HIPAA, FERPA, GDPR, none.
Please be as accurate as possible, the world depends on it.\n\nDocument below here:\n\n'''
print(compliance_prompt + df.at[0, 'document_text'])

What type of compliance issue is most likely present in the following document?
Please restrict your answer to a one word answer and nothing else.
Your answer should be selected from the following options: HIPAA, FERPA, GDPR, none.
Please be as accurate as possible, the world depends on it.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [48]:
response_col = 'Cleanlab_TLM_response (Compliance Issue)'
trustworthiness_col = 'Cleanlab_Trustworthiness_Score (Compliance Issue)'

def reliabilty_data_enrichment(prompt, response_col, trustworthiness_col):
    """Answer prompts with trustworthiness scores for lots of documents."""
    
    df[response_col], df[trustworthiness_col] = None, None
    cols = ['document_text', response_col, trustworthiness_col]
    display.display(df[cols].head(20))
    for i, row in df.iterrows():

        # Only one line required to obtain prompt response and Trustworthiness score
        answer = tlm.prompt(prompt + row['document_text'])

        # Add results to dataset
        df.at[i, response_col] = answer['response']
        df.at[i, trustworthiness_col] = answer['trustworthiness_score']  # originally confidence_score        
        display.display(df[cols].iloc[(i // 20) * 20:(i // 20) * 20 + 20], clear=True)
        time.sleep(0.5)  # slow the results so you can see them happening live
    display.display(df[cols].head(20), clear=True)

In [39]:
reliabilty_data_enrichment(compliance_prompt, response_col, trustworthiness_col)

Unnamed: 0,document_text,Cleanlab_TLM_response (Compliance Issue),Cleanlab_Trustworthiness_Score (Compliance Issue)
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,HIPAA,0.984028
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,HIPAA,0.532907
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.",HIPAA,0.998945
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.",HIPAA,0.957431
4,"Fayette county public schools office may release student records in certain situations without student consent, including: Accidentally or purposefully emailing student information to unauthorized parties, Sharing a student athlete’s academic status, Sharing a student’s grades or identifying information with unauthorized parties, or including a student’s social security number in shared documents.",FERPA,0.998279
5,"The faculty and proceeding council of Marlborough Schoolare is responsible for protecting student records, whether they are stored electronically or in paper form. It certain situations when the board deems appropriate the schools may stored student records even after they are no longer needed.",FERPA,0.861947
6,"Schools are obligated to inform parents and students of their rights at least once a year. They are also required to announce any changes to the school’s FERPA policy. We adhere to this policy for the most part, although last year we did not announce when we changed the policy to parents.",FERPA,0.988822
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”",FERPA,0.97571
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,none,0.586701
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account",GDPR,0.750798


In [42]:
print('Increase LLM Reliability using Cleanlab Trustworthiness scores:\n')
df[response_col] = df[response_col].apply(lambda x: x.lower() if x == 'None' else x)  # Regex fix outputs
print(f"Base Accuracy: {sum(df[response_col] == df['issue']) / len(df):.1%}\t\t\t\t\t({len(df)} examples)")
sdf = df[df[trustworthiness_col] > 0.7]
print(f"LLM Accuracy when (Trustworthiness > 0.7): {sum(sdf['issue'] == sdf[response_col]) / len(sdf):.1%}\t({len(sdf)} examples)")
sdf = df[df[trustworthiness_col] > 0.8]
print(f"LLM Accuracy when (Trustworthiness > 0.8): {sum(sdf['issue'] == sdf[response_col]) / len(sdf):.1%}\t({len(sdf)} examples)")
sdf = df[df[trustworthiness_col] > 0.9]
print(f"LLM Accuracy when (Trustworthiness > 0.9): {sum(sdf['issue'] == sdf[response_col]) / len(sdf):.1%}\t({len(sdf)} examples)")

Increase LLM Reliability using Cleanlab Trustworthiness scores:

Base Accuracy: 97.2%					(109 examples)
LLM Accuracy when (Trustworthiness > 0.7): 98.1%	(106 examples)
LLM Accuracy when (Trustworthiness > 0.8): 99.0%	(103 examples)
LLM Accuracy when (Trustworthiness > 0.9): 100.0%	(85 examples)


# Example 2 - Stock Analysis: find the most likely buy/sell signals from thousands of documents

In [43]:
stock_prompt = \
'''Imagine you are the best stock broker on wall street. You have read the last thirty
years of stock reports and you are the most accurate stock broker in the world at
determining if a publicly traded comany is a buy, hold, or sell.
Based on the information in the document, name the most relevant publicly traded
company and whether you believe the company is a buy, hold, or sell.
Please restrict your answer to the name of a single company, followed by a comma,
followed by one of the following words: buy, hold, sell
Your answer should have no punctuation.
Please be as accurate as possible, the world depends on it.\n\nDocument below here:\n\n'''
print(stock_prompt + df.at[0, 'document_text'])

Imagine you are the best stock broker on wall street. You have read the last thirty
years of stock reports and you are the most accurate stock broker in the world at
determining if a publicly traded comany is a buy, hold, or sell.
Based on the information in the document, name the most relevant publicly traded
company and whether you believe the company is a buy, hold, or sell.
Please restrict your answer to the name of a single company, followed by a comma,
followed by one of the following words: buy, hold, sell
Your answer should have no punctuation.
Please be as accurate as possible, the world depends on it.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [50]:
response_col, trustworthiness_col = 'Cleanlab_TLM_response (Stock Analysis)', 'Cleanlab_Trustworthiness_Score (Stock Analysis)'
reliabilty_data_enrichment(stock_prompt, response_col, trustworthiness_col)

In [51]:
df.sort_values(by=trustworthiness_col, ascending=False)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Stock Analysis),Cleanlab_Trustworthiness_Score (Stock Analysis)
18,"For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. ""Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,"" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins. T","TimeWarner, buy",0.95677
41,"Looking ahead to its full year results to March 2005, BA warned that yields - average revenues per passenger - were expected to decline as it continues to lower prices in the face of competition from low-cost carriers.","British Airways, sell",0.878759
51,Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.,"Allied Domecq, buy",0.812066


In [52]:
df.sort_values(by=trustworthiness_col, ascending=True)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Stock Analysis),Cleanlab_Trustworthiness_Score (Stock Analysis)
50,"For example, we have taken delivery of six Airbus A321 aircraft and next month we will start further improvements to our Club World flat beds. BA's shares closed up four pence at 274.5 pence.","British Airways, hold",0.310073
44,BA had previously forecast a 2% to 3% rise in full-year revenue.,"Boeing, sell",0.320741
58,"Last year Pernod tried to buy Glenmorangie, one of Scotland's premier whisky firms, but lost out to luxury goods firm LVMH.","Glenmorangie, hold",0.333402


# Example 3 - Find the best article and clickbait title for to optimize marketing
* You need to figure out which article is most likely to go viral across a team of marketers.
* You send each paragraph from all their articles through the TLM.
* Use the TLM to find the best title across everyone's work to optimize for prob(success) and triage all articles.

In [53]:
title_prompt = \
'''Imagine you are the best click-bait title writer in the world. You have been the CMO
for The New York Times and Verve and in all of those years, you learned the secret
to writing titles of online articles that have the highest click through rates among
all your peers. Your titles are so captivating, even people who normally don't click
on news articles can't help but click on yours when they see your headline.
Based on the information in the document, please create a short click-bait title that
is sure to make the article go viral. This title will be posted online and cannot be
longer than 8 words.
Your answer should have no punctuation.
Please be as accurate as possible, the world depends on it.\n\nDocument below here:\n\n'''
print(title_prompt + df.at[0, 'document_text'])

Imagine you are the best click-bait title writer in the world. You have been the CMO
for The New York Times and Verve and in all of those years, you learned the secret
to writing titles of online articles that have the highest click through rates among
all your peers. Your titles are so captivating, even people who normally don't click
on news articles can't help but click on yours when they see your headline.
Based on the information in the document, please create a short click-bait title that
is sure to make the article go viral. This title will be posted online and cannot be
longer than 8 words.
Your answer should have no punctuation.
Please be as accurate as possible, the world depends on it.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [54]:
response_col, trustworthiness_col = 'Cleanlab_TLM_response (Clickbait Title)', 'Cleanlab_Trustworthiness_Score (Clickbait Title)'
reliabilty_data_enrichment(title_prompt, response_col, trustworthiness_col)

Unnamed: 0,document_text,Cleanlab_TLM_response (Clickbait Title),Cleanlab_Trustworthiness_Score (Clickbait Title)
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,"""Shocking: Your Medical Records Exposed to the World!""",0.73423
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,"""Shocking! Unauthorized employees accessing sensitive patient data!""",0.534361
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.","""Shocking Revelation: Your Private Info in Wrong Hands!""",0.437543
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.","""Shocking Secrets! Your Personal Data at Risk!""",0.468939
4,"Fayette county public schools office may release student records in certain situations without student consent, including: Accidentally or purposefully emailing student information to unauthorized parties, Sharing a student athlete’s academic status, Sharing a student’s grades or identifying information with unauthorized parties, or including a student’s social security number in shared documents.","""Shocking: Leaked Student Records Exposed in Fayette County!""",0.695368
5,"The faculty and proceeding council of Marlborough Schoolare is responsible for protecting student records, whether they are stored electronically or in paper form. It certain situations when the board deems appropriate the schools may stored student records even after they are no longer needed.","""School Secrets: Shocking Truth of Hidden Student Records""",0.602517
6,"Schools are obligated to inform parents and students of their rights at least once a year. They are also required to announce any changes to the school’s FERPA policy. We adhere to this policy for the most part, although last year we did not announce when we changed the policy to parents.","""Schools' Secret Policy Change Shocks Parents!""",0.683649
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”","""This loophole allows schools to share confidential letters""",0.541069
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,"""Shocking News: Star Quarterback Banned from Playing!""",0.806697
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account","""McDonald's shocking hidden agenda will leave you speechless!""",0.697473


In [55]:
df.sort_values(by=trustworthiness_col, ascending=False)[['document_text', response_col, trustworthiness_col]].head(5)

Unnamed: 0,document_text,Cleanlab_TLM_response (Clickbait Title),Cleanlab_Trustworthiness_Score (Clickbait Title)
45,It also reported on Friday that passenger numbers rose 8.1% in January.,"""Mind-blowing surge: January passenger numbers skyrocket!""",0.827429
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,"""Shocking News: Star Quarterback Banned from Playing!""",0.806697
97,"Ethiopia produced 14.27 million tonnes of crops in 2004, 24% higher than in 2003 and 21% more than the average of the past five years, a report says.","""Mind-Blowing Ethiopian Crop Production Breaks All Records!""",0.803345
48,"Since the 11 September 2001 attacks in the United States, BA has cut 13,000 jobs as part of a major cost-cutting drive.","""Shocking! BA slashes 13k jobs after 9/11""",0.803007
103,"In eastern and southern Ethiopia, a prolonged drought has killed crops and drained wells.","""Devastating Drought Ravages Ethiopian Agriculture""",0.796238


In [56]:
df.sort_values(by=trustworthiness_col, ascending=True)[['document_text', response_col, trustworthiness_col]].head(5)

Unnamed: 0,document_text,Cleanlab_TLM_response (Clickbait Title),Cleanlab_Trustworthiness_Score (Clickbait Title)
11,"TechTarget's Cookies Policy includes the following terminology: ""By continuing to use the site, you agree to the use of cookies.""","""Secret Cookie Conspiracy: The Mind-Blowing Truth Revealed!""",0.297664
42,"However, it said sales would be better than previously forecast.","""Mind-Blowing Sales Update Defies All Expectations!""",0.374175
12,"This is the old, previous intro to the Privacy Policy for USA Citizen and Immigration Services. The language is unnecessarily complex and dense","""Unbelievable! Immigration Secrets Revealed for USA Citizens!""",0.374505
49,"Our focus remains on reducing controllable costs and debt whilst continuing to invest in our products, Mr Eddington said.","""Insider Reveals Unbelievable Cost-Cutting Secrets!""",0.40808
25,"The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments.","""US assets soar as half-point window opens""",0.41683


# Example 4 - Summarize more accurately with a focus on concerns
You need brief summaries that highlight any concerns in a large database of documents.

In [57]:
summary_prompt = \
'''Imagine you are the chief of staff for the President of the United States.
The President has asked you to give a five word summary of the following
document. It is extremely important that your five word summary is as accurate
as possible. In your summary, include the most likely compliance, policy, security,
or legal issue that would be important to a national leader to be aware of.
Please answer in five words or less. For every word you go over five words, it will cost
the United States government ten trillion dollars and you will be fired. Five words max.
Please be as accurate as possible, the world depends on it.\n\nDocument below here:\n\n'''
print(summary_prompt + df.at[0, 'document_text'])

Imagine you are the chief of staff for the President of the United States.
The President has asked you to give a five word summary of the following
document. It is extremely important that your five word summary is as accurate
as possible. In your summary, include the most likely compliance, policy, security,
or legal issue that would be important to a national leader to be aware of.
Please answer in five words or less. For every word you go over five words, it will cost
the United States government ten trillion dollars and you will be fired. Five words max.
Please be as accurate as possible, the world depends on it.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [58]:
response_col, trustworthiness_col = 'Cleanlab_TLM_response (Summarization)', 'Cleanlab_Trustworthiness_Score (Summarization)'
reliabilty_data_enrichment(summary_prompt, response_col, trustworthiness_col)

Unnamed: 0,document_text,Cleanlab_TLM_response (Summarization),Cleanlab_Trustworthiness_Score (Summarization)
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,Sensitive patient data stored insecurely.,0.727696
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,Potential unauthorized access to data.,0.769024
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.","Data security risks, potential PHI breaches.",0.701168
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.",Clinicians accessing PHI on personal computers. (Legal issue: Data security),0.653208
4,"Fayette county public schools office may release student records in certain situations without student consent, including: Accidentally or purposefully emailing student information to unauthorized parties, Sharing a student athlete’s academic status, Sharing a student’s grades or identifying information with unauthorized parties, or including a student’s social security number in shared documents.",Student records release risks privacy.,0.771439
5,"The faculty and proceeding council of Marlborough Schoolare is responsible for protecting student records, whether they are stored electronically or in paper form. It certain situations when the board deems appropriate the schools may stored student records even after they are no longer needed.",Secure storage of student records.,0.791961
6,"Schools are obligated to inform parents and students of their rights at least once a year. They are also required to announce any changes to the school’s FERPA policy. We adhere to this policy for the most part, although last year we did not announce when we changed the policy to parents.",Compliance issue: FERPA policy change.,0.835549
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”",Letters of recommendation exempt from consent.,0.660759
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,Quarterback ineligible due to academics.,0.831261
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account",Consent issue in McDonald's registration.,0.675935


In [59]:
df.sort_values(by=trustworthiness_col, ascending=False)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Summarization),Cleanlab_Trustworthiness_Score (Summarization)
95,"Separately, the IMF warned on Thursday that India's budget deficit was too large and would hamper the country's economic growth, which it forecast to be around 6.5% in the year to March 2005.",India's budget deficit hampers growth.,0.90046
33,British Airways has blamed high fuel prices for a 40% drop in profits.,Fuel prices impact British Airways.,0.895884
98,"In 2003, crop production totalled 11.49 million tonnes, the joint report from the Food and Agriculture Organisation and the World Food Programme said.",Crop production: 11.49 million tonnes.,0.883299


In [60]:
df.sort_values(by=trustworthiness_col, ascending=True)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Summarization),Cleanlab_Trustworthiness_Score (Summarization)
12,"This is the old, previous intro to the Privacy Policy for USA Citizen and Immigration Services. The language is unnecessarily complex and dense",Privacy Policy needs simplification.,0.453261
19,"TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC.","Account restatement, settlement, SEC review.",0.522837
25,"The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments.","US assets attractive, dollar support.",0.542221


# Example 5 - Extraction: Identify potential legal liabilities (usage of company names)

In [61]:
extraction_prompt = \
'''From the following document, please extract the names of all companies and institutions, separated by commas.
Example document 1: "Google launches the new GBQ TLM in partnership with Cleanlab." Answer 1: "Google, Cleanlab"
Example document 2: "Data will stored in s3 buckets." Answer 2: "none"
Example document 3: "My favorite song is where in the world is Carmen San Diego. I used to sing it a bunch when I worked at IBM." Answer 3: "IBM"
Please be as accurate as possible, several thousand employees will effected if you make a mistake.\n\nDocument below here:\n\n'''
print(extraction_prompt + df.at[0, 'document_text'])

From the following document, please extract the names of all companies and institutions, separated by commas.
Example document 1: "Google launches the new GBQ TLM in partnership with Cleanlab." Answer 1: "Google, Cleanlab"
Example document 2: "Data will stored in s3 buckets." Answer 2: "none"
Example document 3: "My favorite song is where in the world is Carmen San Diego. I used to sing it a bunch when I worked at IBM." Answer 3: "IBM"
Please be as accurate as possible, several thousand employees will effected if you make a mistake.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [62]:
response_col, trustworthiness_col = 'Cleanlab_TLM_response (Extraction)', 'Cleanlab_Trustworthiness_Score (Extraction)'
reliabilty_data_enrichment(extraction_prompt, response_col, trustworthiness_col)

Unnamed: 0,document_text,Cleanlab_TLM_response (Extraction),Cleanlab_Trustworthiness_Score (Extraction)
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,none,0.636075
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,none,0.875754
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.",none,0.94085
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.",none,0.969395
4,"Fayette county public schools office may release student records in certain situations without student consent, including: Accidentally or purposefully emailing student information to unauthorized parties, Sharing a student athlete’s academic status, Sharing a student’s grades or identifying information with unauthorized parties, or including a student’s social security number in shared documents.",Fayette county public schools,0.695265
5,"The faculty and proceeding council of Marlborough Schoolare is responsible for protecting student records, whether they are stored electronically or in paper form. It certain situations when the board deems appropriate the schools may stored student records even after they are no longer needed.",Marlborough School,0.905725
6,"Schools are obligated to inform parents and students of their rights at least once a year. They are also required to announce any changes to the school’s FERPA policy. We adhere to this policy for the most part, although last year we did not announce when we changed the policy to parents.",none,0.896085
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”",none,0.732381
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,Minnesota High Buckaneers,0.529974
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account",McDonald's,0.962148


In [63]:
df.sort_values(by=trustworthiness_col, ascending=False)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Extraction),Cleanlab_Trustworthiness_Score (Extraction)
51,Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.,"Allied Domecq, Pernod Ricard",0.997943
55,"Pernod's last major purchase was a third of US giant Seagram in 2000, the move which propelled it into the global top three of drinks firms.","Pernod, Seagram",0.995927
15,"TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL. Time Warner said on Friday that it now owns 8% of search-engine Google.","TimeWarner, Warner Bros, AOL, Google",0.994203


In [64]:
df.sort_values(by=trustworthiness_col, ascending=True)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Extraction),Cleanlab_Trustworthiness_Score (Extraction)
25,"The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments.",none,0.455267
21,"The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. And Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday.","Federal Reserve, US government",0.491506
52,"Reports in the Wall Street Journal and the Financial Times suggested that the French spirits firm is considering a bid, but has yet to contact its target.",none,0.50916


# Example 6 - Misinformation Detection
Identify the documents that are most likely to contain misinformation.

In [65]:
misinformation_prompt = \
'''Estimate the likelihood the following document contains content that is factually untrue or contains misinformation.
Please answer one of the following categories: definitley no misinformation, likely no misinformation, may contain misinformation, definitely contains misinformation, not applicable
follwed by a comma, and then a one sentence explanation for why you gave that answer.
Be as accurate as possible. You can't afford to make mistakes because someone's life is on the line.\n\nDocument below here:\n\n'''
print(misinformation_prompt + df.at[0, 'document_text'])

Estimate the likelihood the following document contains content that is factually untrue or contains misinformation.
Please answer one of the following categories: definitley no misinformation, likely no misinformation, may contain misinformation, definitely contains misinformation, not applicable
follwed by a comma, and then a one sentence explanation for why you gave that answer.
Be as accurate as possible. You can't afford to make mistakes because someone's life is on the line.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [67]:
response_col, trustworthiness_col = 'Cleanlab_TLM_response (Misinformation)', 'Cleanlab_Trustworthiness_Score (Misinformation)'
reliabilty_data_enrichment(misinformation_prompt, response_col, trustworthiness_col)

Unnamed: 0,document_text,Cleanlab_TLM_response (Misinformation),Cleanlab_Trustworthiness_Score (Misinformation)
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,"Definitely contains misinformation, because it is highly unlikely and against standard practices for medical health records to be stored on unencrypted public servers.",0.776845
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,"May contain misinformation, because the statement contradicts itself by stating that patient data will be stored securely but also acknowledging the possibility of unauthorized access by employees.",0.80068
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.","May contain misinformation, because the statement ""we do not guarantee to delete/shred it if the document itself is of no more use"" contradicts standard practices for handling sensitive patient information.",0.710294
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.","May contain misinformation, the document states that clinicians may use their personal computer to access PHI without guaranteeing that the personal computer is located in a secure location, which raises concerns about the privacy and security of patient information.",0.752011
4,"Fayette county public schools office may release student records in certain situations without student consent, including: Accidentally or purposefully emailing student information to unauthorized parties, Sharing a student athlete’s academic status, Sharing a student’s grades or identifying information with unauthorized parties, or including a student’s social security number in shared documents.","May contain misinformation, the statement about the Fayette county public schools office releasing student records without student consent needs to be fact-checked with official sources.",0.669899
5,"The faculty and proceeding council of Marlborough Schoolare is responsible for protecting student records, whether they are stored electronically or in paper form. It certain situations when the board deems appropriate the schools may stored student records even after they are no longer needed.","May contain misinformation, because the sentence is grammatically incorrect and lacks clarity, making it difficult to determine the accuracy of the information provided.",0.681066
6,"Schools are obligated to inform parents and students of their rights at least once a year. They are also required to announce any changes to the school’s FERPA policy. We adhere to this policy for the most part, although last year we did not announce when we changed the policy to parents.","May contain misinformation.\n\nThe document states that schools are obligated to inform parents and students of their rights at least once a year, which is generally true. However, it also claims that schools are required to announce any changes to the school's FERPA policy, which is not accurate. The FERPA policy does not specifically require schools to announce changes to parents.",0.696848
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”",Likely no misinformation. The information provided about the exception for sharing letters of recommendation between educational institutions under FERPA is accurate and consistent with the regulations outlined in section 34 CFR § 99.31.,0.760303
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,Likely no misinformation. This information can be easily fact-checked by referring to other local news sources or contacting the Minnesota High school to confirm the eligibility status of their star quarterback.,0.737433
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account","Likely no misinformation.\n\nThe statement mentions a specific issue with the McDonald's registration form regarding the lack of an option for users to explicitly consent to marketing communications, which can be verified by reviewing the form itself.",0.626564


In [89]:
filt = ['May contain misinformation' in x or 'Definitely contains misinformation' in x for x in df[response_col]]
df[filt].sort_values(by=trustworthiness_col, ascending=False)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Misinformation),Cleanlab_Trustworthiness_Score (Misinformation)
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,"May contain misinformation, because the statement contradicts itself by stating that patient data will be stored securely but also acknowledging the possibility of unauthorized access by employees.",0.80068
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,"Definitely contains misinformation, because it is highly unlikely and against standard practices for medical health records to be stored on unencrypted public servers.",0.776845
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.","May contain misinformation, the document states that clinicians may use their personal computer to access PHI without guaranteeing that the personal computer is located in a secure location, which raises concerns about the privacy and security of patient information.",0.752011


In [90]:
df.sort_values(by=trustworthiness_col, ascending=True)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Misinformation),Cleanlab_Trustworthiness_Score (Misinformation)
13,"Jubel's website used Google Analytics cookies, which require consent under EU law. However, the site had no cookie consent mechanism. The company attempted to argue that it did not require consent, but the DPA disagreed.","Likely no misinformation, because it is a statement about a legal dispute between Jubel's website and the DPA regarding the requirement of cookie consent under EU law, which can be verified through legal records and news sources.",0.560316
41,"Looking ahead to its full year results to March 2005, BA warned that yields - average revenues per passenger - were expected to decline as it continues to lower prices in the face of competition from low-cost carriers.","Likely no misinformation, as the statement is referring to an expected decline in average revenues per passenger due to competitive pressure from low-cost carriers, which is a common industry trend.",0.569804
11,"TechTarget's Cookies Policy includes the following terminology: ""By continuing to use the site, you agree to the use of cookies.""","Definitely no misinformation, this statement simply explains TechTarget's Cookies Policy without making any factual claims.",0.581311


# Example 7 - Source Estimation: (GPT API often hallucinates sources, but Cleanlab TLM detects it)
Estimate the source of the document and use TLM to find sources with high confidence.

In [91]:
source_prompt = \
'''Estimate the source of the following text data document. Your answer should just be the name of the source.
Do not give long answers. Provide only the name the source only. Do not proceed your answer with "The name of the source is" instead just simply tell me the source and nothing else.
Be as accurate as possible. You can't afford to make mistakes because someone's life is on the line.\n\nDocument below here:\n\n'''
print(source_prompt + df.at[0, 'document_text'])

Estimate the source of the following text data document. Your answer should just be the name of the source.
Do not give long answers. Provide only the name the source only. Do not proceed your answer with "The name of the source is" instead just simply tell me the source and nothing else.
Be as accurate as possible. You can't afford to make mistakes because someone's life is on the line.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [93]:
response_col, trustworthiness_col = 'Cleanlab_TLM_response (Data source)', 'Cleanlab_Trustworthiness_Score (Data source)'
reliabilty_data_enrichment(source_prompt, response_col, trustworthiness_col)

Unnamed: 0,document_text,Cleanlab_TLM_response (Data source),Cleanlab_Trustworthiness_Score (Data source)
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,Unknown,0.527162
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,Unknown,0.468483
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.",HIPAA,0.521381
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.",Unknown,0.5393
4,"Fayette county public schools office may release student records in certain situations without student consent, including: Accidentally or purposefully emailing student information to unauthorized parties, Sharing a student athlete’s academic status, Sharing a student’s grades or identifying information with unauthorized parties, or including a student’s social security number in shared documents.",Fayette County Public Schools,0.947432
5,"The faculty and proceeding council of Marlborough Schoolare is responsible for protecting student records, whether they are stored electronically or in paper form. It certain situations when the board deems appropriate the schools may stored student records even after they are no longer needed.",Marlborough School,0.9483
6,"Schools are obligated to inform parents and students of their rights at least once a year. They are also required to announce any changes to the school’s FERPA policy. We adhere to this policy for the most part, although last year we did not announce when we changed the policy to parents.","U.S. Department of Education, FERPA",0.696823
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”","The source of the document is ""Rooker"".",0.263043
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,local newspaper,0.858963
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account",McDonald's,0.931782


In [94]:
df.sort_values(by=trustworthiness_col, ascending=False)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Data source),Cleanlab_Trustworthiness_Score (Data source)
70,"It's painting a picture of a recovery... much patchier than previously thought, said Paul Sheard, economist at Lehman Brothers in Tokyo.",Lehman Brothers,0.983076
61,"The WSJ said that the two were ripe for consolidation, having each dealt with problematic parts of their portfolio.",The Wall Street Journal,0.972213
46,"Aviation analyst Nick Van den Brul of BNP Paribas described BA's latest quarterly results as ""pretty modest"".",BNP Paribas,0.957661


In [95]:
df.sort_values(by=trustworthiness_col, ascending=True)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Data source),Cleanlab_Trustworthiness_Score (Data source)
16,"AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband.",CNN,0.104621
20,"The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.",The Wall Street Journal,0.16929
60,"Allied Domecq's big names include Malibu rum, Courvoisier brandy, Stolichnaya vodka and Ballantine's whisky - as well as snack food chains such as Dunkin' Donuts and Baskin-Robbins ice cream.",Forbes,0.181826


# Example 8 - Auto-Feature Prioritization: find low hanging fruit for business impact
* Every business has more bugs and feature requests than time/engineers. Product has to review every feature by hand and triage it based on their own confidence of how long it will take to solve.
* With TLM, you can auto-triage/prioritization by sorting by largest trustworthiness score on a prompt that asks for low hanging fruit solutions that are easy to solve.

In [96]:
solution_prompt = \
'''Given the problem to the business in the following document, can you identify a low hanging fruit solution that could greatly improve the business using only 10 minutes of 1 employees time?
Do not share any solutions that would take more than 10 minutes to implement. Your answer should be one short sentence.
Be as accurate as possible. You can't afford to make mistakes because someone's life is on the line.\n\nDocument below here:\n\n'''
print(solution_prompt + df.at[0, 'document_text'])

Given the problem to the business in the following document, can you identify a low hanging fruit solution that could greatly improve the business using only 10 minutes of 1 employees time?
Do not share any solutions that would take more than 10 minutes to implement. Your answer should be one short sentence.
Be as accurate as possible. You can't afford to make mistakes because someone's life is on the line.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [97]:
response_col, trustworthiness_col = 'Cleanlab_TLM_response (Low Hanging Fruit)', 'Cleanlab_Trustworthiness_Score (Low Hanging Fruit)'
reliabilty_data_enrichment(solution_prompt, response_col, trustworthiness_col)

Unnamed: 0,document_text,Cleanlab_TLM_response (Low Hanging Fruit),Cleanlab_Trustworthiness_Score (Low Hanging Fruit)
0,All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.,Implement a policy to store patient's medical data on encrypted servers instead of unencrypted public servers.,0.659562
1,Patient data will be stored in secure s3 buckets with access only by the company's employees. It is possible that employees who are not authorized to access the data can still access the data.,Implement strict access controls and permissions for the S3 buckets to ensure that only authorized employees can access the patient data.,0.593903
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed.",Implement a standard procedure to remind employees to regularly check for and securely dispose of any documents or patient information that is no longer needed.,0.731268
3,"To ensure you receive timely feedback, our clinicians may work after-hours and use their personal computer to access PHI. We cannot guarantee that their personal compute is located in a secure location.",Require all clinicians to use a secure virtual private network (VPN) when accessing PHI on their personal computers.,0.774511
4,"Fayette county public schools office may release student records in certain situations without student consent, including: Accidentally or purposefully emailing student information to unauthorized parties, Sharing a student athlete’s academic status, Sharing a student’s grades or identifying information with unauthorized parties, or including a student’s social security number in shared documents.",Implement a mandatory double-check process for all outgoing emails containing student information to prevent accidental release of student records.,0.576572
5,"The faculty and proceeding council of Marlborough Schoolare is responsible for protecting student records, whether they are stored electronically or in paper form. It certain situations when the board deems appropriate the schools may stored student records even after they are no longer needed.",Implement a reminder system for faculty and staff to regularly review and dispose of student records that are no longer needed.,0.821505
6,"Schools are obligated to inform parents and students of their rights at least once a year. They are also required to announce any changes to the school’s FERPA policy. We adhere to this policy for the most part, although last year we did not announce when we changed the policy to parents.",Send an email to all parents and students notifying them of the FERPA policy change from last year.,0.840244
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”",Implement a process to obtain consent from students (if 18 or older) or parents (if students are under 18) for sharing letters of recommendation with potential employers.,0.673956
8,The coach of Minnesota High shared today that the star quarterback of the Minnesota High Buckaneers is not eligible to play because of academic failing. I read this in the local newspaper this morning.,Solution: The employee could quickly reach out to the coach or school administration to offer assistance in finding a tutor or academic support for the star quarterback.,0.577028
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account",Add a checkbox to the registration form allowing users to explicitly opt-in for marketing communications.,0.886491


In [98]:
df.sort_values(by=trustworthiness_col, ascending=False)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Low Hanging Fruit),Cleanlab_Trustworthiness_Score (Low Hanging Fruit)
9,"The McDonald's registration form does not give users an opportunity to provide their express and unambiguous consent for marketing communications; In this form, consent is assumed when a user registers for an account",Add a checkbox to the registration form allowing users to explicitly opt-in for marketing communications.,0.886491
13,"Jubel's website used Google Analytics cookies, which require consent under EU law. However, the site had no cookie consent mechanism. The company attempted to argue that it did not require consent, but the DPA disagreed.",Add a cookie consent mechanism to Jubel's website.,0.883154
10,"Although the chip makers registration form includes marketing consent checkboxes, this method of consent is not considered freely-given because the boxes are pre-ticked by default:",Remove the pre-ticked checkboxes on the chip makers registration form.,0.872447


In [99]:
df.sort_values(by=trustworthiness_col, ascending=True)[['document_text', response_col, trustworthiness_col]].head(3)

Unnamed: 0,document_text,Cleanlab_TLM_response (Low Hanging Fruit),Cleanlab_Trustworthiness_Score (Low Hanging Fruit)
22,"Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. ""I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time,"" said Robert Sinche, head of currency strategy at Bank of America in New York. ""He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next.""","There is no specific problem mentioned in the document, so it is not possible to identify a low hanging fruit solution.",0.395381
31,Yukos had filed for bankruptcy protection in a US court in an attempt to prevent the forced sale of its main production arm. The sale went ahead in December and Yugansk was sold to a little-known shell company which in turn was bought by Rosneft.,Implement a reminder system for employees to regularly check and update bankruptcy protection status of key business assets.,0.443234
90,"At a conference on developing enterprise hosted by UK finance minister Gordon Brown on Friday, he said that he was in favour of floating exchange rates because they help countries cope with economic shocks.",There is not enough information provided in the document to identify a specific problem or solution for the business.,0.443487
