In [1]:
# Install dependencies
!pip install pandas gspread oauth2client
!pip install --upgrade pip
!pip install --upgrade cleanlab-studio
from IPython import display
display.clear_output()

In [2]:
from cleanlab_studio import Studio

In [3]:
with open('api.txt') as file: 
    key = file.read()
studio = Studio(key)  # Cleanlab Studio API key from https://app.cleanlab.ai/account?tab=General
tlm = studio.TLM()

### Let's first see the results on https://chat.openai.com

Now let's see the same thing programmatically using Open-AI's private API for chatgpt.

In [4]:
# Runs Open-AI GPT-3.5
chatgpt = studio.TLM(quality_preset='base')
chatgpt.prompt("How many Ns are there in the word enter?")

{'response': 'There are two Ns in the word "enter".', 'confidence_score': nan}

In [5]:
# Runs the Cleanlab TLM with confidence reliablity scores
tlm = studio.TLM(quality_preset='best')
tlm.prompt("How many Ns are there in the word enter?")

{'response': 'There are 1 N in the word "enter".',
 'confidence_score': 0.5591737680569699}

In [6]:
# Runs the Cleanlab TLM with confidence reliablity scores
tlm_fast = studio.TLM(quality_preset='low')
tlm_fast.prompt("How many Ns are there in the word enter?")

{'response': 'There are two Ns in the word "enter".',
 'confidence_score': 0.3528202582001686}

#### Alright, now let's see an example of reliable data enrichment for solving tasks on documents. (trustworthiness scores for every output built-in)

We'll see the power of built-in trustworthiness scores for every output when solving for text/document workflows on arbitrary datasets.

In [32]:
import pandas as pd
import gspread
from oauth2client.service_account import ServiceAccountCredentials
pd.set_option('display.max_colwidth', None)
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('tlm_demo_credentials.json', scope)
gc = gspread.authorize(credentials)    #use gspread library to extract spredsheet
wks = gc.open("Document Compliance Dataset").sheet1    #Mention the required sheet
df = pd.DataFrame(wks.get_all_values(), columns=["document_text", "issue"])

#### Let's look at some examples of these documents.
* Some of these documents have compliance issues, many don't. This task is often outsourced entirely to experts who review 100% of documents by hand.
* Here we'll automate the entire workflow and provide trustworthiness scores that help you determine what needs human review.

In [33]:

print(f'Number of documents in this demo dataset: {len(df)}')
## Let's look at some examples of these documents. Some of these have compliance issues
df.sample(10, random_state=0)[["document_text"]]
## Uncomment this line of code if you want to see what the true labels are (which you don't have).
# df.sample(10, random_state=0)[["document_text"]]

Number of documents in this demo dataset: 109


Unnamed: 0,document_text
84,"That means there are a limited number of new opportunities for workers. India, which attends the G7 meeting of seven leading industrialised nations on Friday, is unlikely to be cowed by its newcomer status."
10,"Although the chip makers registration form includes marketing consent checkboxes, this method of consent is not considered freely-given because the boxes are pre-ticked by default:"
75,"Nevertheless it was enough to push down the unemployment rate to 5.2%, its lowest level since September 2001."
2,"If someone forgets a document on a table somewhere or leaves patient information on their desktop, it might end up getting into the wrong hands. If information is no longer used, we do not guarantee to delete/shred it if the document itself is of no more use. It is possible that the not all PHI information is disposed of after it is no longer needed."
24,"The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates."
100,"Nevertheless, 2.2 million Ethiopians will still need emergency assistance."
108,About 80% of Ethiopians depend directly or indirectly on agriculture.
7,"Letters of recommendation typically qualify as student records. In order to send a letter from a teacher at one school to the registrar at another, you might expect that schools would need signed consent from parents (if students are under 18) or students themselves (if 18 or older) to comply with FERPA. But under section 34 CFR § 99.31 of the Act, there’s an exception for this sort of record sharing. During potential transfers, educational institutions don’t need consent to send letters of recommendation to the destination school. This exception, however, doesn’t apply to sharing letters of recommendation outside of the educational system. “If [a school official] were sending a letter of recommendation to a potential employer, that official would need consent,” Rooker says. “There’s not an exception that lets [school staff] provide information from the student’s record to a potential employer.”"
16,"AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband."
86,He objected to subsidies on agriculture that make it hard for developing nations like India to compete.


In [63]:
prompt_template = '''What type of compliance issue is most likely present in the following document? Please restrict your answer to a one word answer and nothing else. Your answer should be selected from the following options: HIPAA, FERPA, GDPR, none. Please be as accurate as possible, the world depends on it.\n\nDocument below here:\n\n'''
print(prompt_template + df.at[0, 'document_text'])

What type of compliance issue is most likely present in the following document? Please restrict your answer to a one word answer and nothing else. Your answer should be selected from the following options: HIPAA, FERPA, GDPR, none. Please be as accurate as possible, the world depends on it.

Document below here:

All medical health records will be accessed one way only. The patient's medical data will be stored on unencrypted public servers at the discretion of the enterprise customer.


In [51]:
import time

In [56]:
results0 = []
for i in range(len(df)):
    time.sleep(3)
    answer = tlm.prompt(prompt_template + df.at[i, 'document_text'])
    print(i, answer)
    results0.append(answer)

Rate limit exceeded on https://api.cleanlab.ai/api/v0/trustworthy_llm/prompt 39


APIError: ('TLM failed after 1 attempts. Try setting a smaller max_concurrent_requests or using a shorter prompt.', -1)