# Reliable Zero-Shot Classification with the Trustworthy Language Model

<head>
  <meta name="title" content="Reliable zero-shot classification with the Trustworthy Language Model"/>
  <meta property="og:title" content="Reliable zero-shot classification with the Trustworthy Language Model"/>
  <meta name="twitter:title" content="Reliable zero-shot classification with the Trustworthy Language Model" />
  <meta name="description" content="Use the TLM to gauge the trustworthiness of each zero-shot classification for more reliable results."  />
  <meta property="og:description" content="Use the TLM to gauge the trustworthiness of each zero-shot classification for more reliable results." />
  <meta name="twitter:description" content="Use the TLM to gauge the trustworthiness of each zero-shot classification for more reliable results." />
</head>



In zero-shot classification, we use a Foundation model to classify input data into predefined categories (aka. *classes*), without having to train this model on a dataset manually annotated with these categories. This utilizes the pre-trained model's world knowledge to accomplish tasks that would require much more work training classical machine learning models from scratch. The problem with zero-shot classification of text with LLMs is **we don't know which LLM classifications we can trust**. Most LLMs are prone to *hallucination* and will often predict a category even when their world knowledge does not suffice to justify this prediction.

This tutorial demonstrates how you can easily replace any LLM with Cleanlab's [Trustworthy Language Model (TLM)](/tutorials/tlm/) to **gauge the trustworthiness of each zero-shot classification**. Use the TLM to ensure **reliable classification** where you which model predictions cannot be trusted. Before this tutorial, we recommend completing the [TLM quickstart tutorial](/tutorials/tlm/).

## Setup

Using TLM requires a [Cleanlab](https://app.cleanlab.ai/) account. Sign up for one [here](https://cleanlab.ai/signup/) if you haven't yet. If you've already signed up, check your email for a personal login link.

The Python client package can be installed using pip:

In [None]:
%pip install cleanlab-studio

In [1]:
import re
import pandas as pd
from tqdm import tqdm
from difflib import SequenceMatcher

from cleanlab_studio import Studio

In Python, launch your Cleanlab Studio client using your [API key](https://app.cleanlab.ai/account).

In [None]:
# Get your API key from https://app.cleanlab.ai/account after creating an account.
studio = Studio("<insert your API key>")

Let's load an example classification dataset. Here we consider legal documents from the "US" Jurisdiction of the [Multi_Legal_Pile](https://arxiv.org/abs/2306.02069), a large-scale multilingual legal dataset that spans over 24 languages. We aim to classify each document into one of three categories: `[caselaw, contracts, legislation]`.
We'll prompt our TLM to categorize each document and record its response and associated trustworthiness score. You can use the ideas from this tutorial to improve LLMs for *any* other text classification task! 

First download our example dataset and then load it into a DataFrame.

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/zero_shot.csv'

In [4]:
df = pd.read_csv('zero_shot.csv')
df.head(2)

Unnamed: 0,index,text
0,0,Probl2B\n0/NV Form\nRev. June 2014\n\n\n\n ...
1,1,UNITED STATES DI...


## Perform Zero Shot Classification with TLM

Let's initalize a `TLM` object. Here we use default TLM settings, but check out the [TLM quickstart tutorial](/tutorials/tlm/) for configuration options that can produce better results.

In [None]:
tlm = studio.TLM()

Next, let's define a prompt template to instruct the TLM on how to classify each document's text. Write your prompt just as you would with any other LLM when adapting it for zero-shot classification. A good prompt template might contain all the possible categories a document can be classified as, as well as formatting instructions for the LLM response. Of course the text of the document is crucial.

```python
'You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: {categories}. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Category: <category document belongs to>". \nDocument: {document}'

```

If you have a couple labeled examples from different classes, you may be able to get better LLM predictions via *few-shot* prompting (where these examples + their classes are embedded within the prompt). Here we'll stick with zero-shot classification for simplicity, but note that TLM can also be used for few-shot classification just like any other LLM.

Lets apply the above prompt template to all documents in our dataset and form the list of prompts we want to run. For one arbitrary document, we print the actual corresponding prompt fed into the TLM below. 

In [7]:
zero_shot_prompt_template = 'You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: {categories}. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Cateogry: <category document belongs to>". \nDocument: {document}'
categories = ['caselaw', 'contracts', 'legislation']
string_categories = str(categories).replace('\'', '')

# Create a DataFrame to store results and apply the prompt template to all examples
results_df = df.copy()
results_df['prompt'] = results_df['text'].apply(lambda x: zero_shot_prompt_template.format(categories=string_categories, document=x))

print(f"{results_df.at[7, 'prompt']}")

You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: [caselaw, contracts, legislation]. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Cateogry: <category document belongs to>". 
Document: UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF NEW YORK

UNITED STATES OF AMERICA,

                 v.                                                ORDER

JOSE DELEON,                                                    14 Cr. 28 (PGG)

                         Defendant.


PAUL G. GARDEPHE, U.S.D.J.:

              It is hereby ORDERED that the violation of supervised release hearing currently

scheduled for January 8, 2020 is adjourned to January 15, 2020 at 3:30 p.m. in Courtroom 705

of the Thurgood Marshall United States Courthouse, 40 Foley Square, New York, New York.

Dated: New York, New York
       January 8, 20

Now we prompt the TLM and save the output responses and their associated trustworthiness scores for all examples. We recommend the `try_prompt()` method to run TLM over datasets with many examples.

In [8]:
outputs = tlm.try_prompt(results_df['prompt'].to_list())

results_df[["response","trustworthiness_score"]] = pd.DataFrame(outputs)

Querying TLM... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|


### Parse raw LLM Responses into Category Predictions

Our prompt template asks the LLM to provide explain it's predictions, which can boost their accuracy. We now parse out the classification prediction, which should be exactly one of the categories for each document. Because LLMs don't necessarily follow output formatting instructions perfectly, the parser we define here can fuzzily map raw LLM outputs to predicted categories.

**Optional: Define helper methods to parse categories and better display results.**



In [9]:

import warnings

def parse_category(response: str, categories: list):
    """Takes in a response and parses out the category. If no category out of the possible `categories` is directly mentioned in the response, the category with greatest string similarity to the response is returned (along with a warning).
    This parser assumes the LLM was instructed to return "Category: <category example belongs to>" on a new line.
    
    Params
    ------
    response: Response from the LLM
    categories: List of possible categories examples in the dataset could be classified as
    """

    # Parse category if LLM response is properly formatted
    pattern = r'Category: (' + '|'.join(categories) + ')'
    exact_matches = re.findall(pattern, response, re.IGNORECASE)
    if len(exact_matches) > 0:
        return exact_matches[-1].lower()  # Return the last match since our zero shot prompt template asks the TLM to list the category last

    # If there are no exact matches to a specific category, return the closest category based on string similarity.
    pattern = r'Category: (.+)'
    matches = re.findall(pattern, response)
    if len(matches) > 0:        
        category = matches[-1].lower()
    else:
        category = response.lower()  # If the LLM did not follow the response format we requested "Category: ..." in the prompt template, consider the whole response.
    
    closest_match = max(categories, key=lambda x: SequenceMatcher(None, category, x).ratio())
    similarity_score = SequenceMatcher(None, category, closest_match).ratio()
    str_warning = "matches"
    if similarity_score < 0.5:
        str_warning = "remotely matches"
    
    warnings.warn(f"None of the categories {str_warning} raw LLM output: {category}")
    return closest_match


def display_result(results_df: pd.DataFrame, index: int):
    """Displays the TLM result for the example from the dataset whose `index` is provided."""
    
    print(f"TLM predicted category: {results_df.iloc[index].predicted_category}")
    print(f"TLM trustworthiness score: {results_df.iloc[index].trustworthiness_score}\n")
    print(results_df.iloc[index].text)

In [10]:
results_df['predicted_category'] = results_df['response'].apply(lambda x: parse_category(x, categories))

### Analyze Classification Results

Let's first inspect the most trustworthy predictions from our model. We sort the TLM outputs over our documents to see which predictions received the highest trustworthiness scores.

In [11]:
results_df = results_df.sort_values(by='trustworthiness_score', ascending=False)
display_result(results_df, index=0)

TLM predicted category: legislation
TLM trustworthiness score: 0.9214038143792375


      
        ENVIRONMENTAL PROTECTION AGENCY
        40 CFR Parts 86 and 600
        DEPARTMENT OF TRANSPORTATION
        National Highway Traffic Safety Administration
        49 CFR Parts 531, 533, 537, and 538
        [EPA-HQ-OAR-2009-0472; FRL-8966-9; NHTSA-2009-0059]
        RIN 2060-AP58; 2127-AK90
        Public Hearing Locations for the Proposed Rulemaking To Establish Light-Duty Vehicle Greenhouse Gas Emission Standards and Corporate Average Fuel Economy Standards
        
          AGENCY:
          Environmental Protection Agency (EPA) and National Highway Traffic Safety Administration (NHTSA).
        
        
          ACTION:
          Notice of public hearings.
        
        
          SUMMARY:

          EPA and NHTSA are announcing the location addresses for the joint public hearings to be held for the “Proposed Rulemaking to Establish Light-Duty Vehicle Greenhouse Gas Emission St

A document titled "Public Hearing Locations for the Proposed Rulemaking To Establish Light-Duty Vehicle Greenhouse Gas Emission Standards and Corporate Average Fuel Economy Standards" is very clearly belonging to some legislative measure so it makes sense the TLM classifies it into the "legislation" category with a high trustworthiness score.

The two documents below discuss Stock Option Grant rules for a company and Control Employment Agreements. They are quite clearly contracts, which the TLM correctly classifies with high confidence.

In [12]:
display_result(results_df, index=1)

TLM predicted category: contracts
TLM trustworthiness score: 0.9208593621290568

Exhibit 10.1

 

Re: Stock Option Grant

In recognition of your significant responsibilities at Airgas, I am pleased to
inform you that pursuant to the Airgas, Inc. Amended and Restated 2006 Equity
Incentive Plan (the “Plan”), effective May ##, 20## you have been granted a
non-qualified stock option (the “Option”) to purchase #,### shares of common
stock, at a price of $##.## per share.

This Option is subject to the applicable terms and conditions of the Plan which
are incorporated herein by reference, and in the event of any contradiction,
distinction or differences between this letter and the terms of the Plan, the
terms of the Plan will control. Please go to
http://airnet/page.asp?7000000000610 to review the prospectus for the Plan and
the Plan.

Subject to your continued employment with the Company, the Option may be
exercised in cumulative equal installments of 25% of the shares on each of the
first 

In [13]:
display_result(results_df, index=2)

TLM predicted category: contracts
TLM trustworthiness score: 0.9201449921200638

Exhibit 10(k)A

SCHEDULE OF CHANGE IN CONTROL EMPLOYMENT AGREEMENTS

In accordance with the Instructions to Item 601 of Regulation S-K, the
Registrant has omitted filing Change in Control Employment Agreements by and
between P. H. Glatfelter Company and the following employees as exhibits to this
Form 10-K because they are substantially identical to the Form of Change in
Control Employment Agreement by and between P. H. Glatfelter Company and certain
employees, which is filed as Exhibit 10 (j) to our Form 10-K for the year ended
December 31, 2008.

David C. Elder

John P. Jacunski

Michael L. Korniczky

Debabrata Mukherjee

Dante C. Parrini

Martin Rapp

Mark A. Sullivan

William T. Yanavitch II


### Least Trustworthy Predictions

Now let's see which classifications predicted by the model are least trustworthy. We sort the data by trustworthiness scores in the opposite order to see which predictions received the lowest scores. Observe how model classifications with the lowest trustworthiness scores are often incorrect, corresponding to examples with vague/irrelevant text or documents possibly belonging to more than one category.

In [14]:
results_df = results_df.sort_values(by='trustworthiness_score')
display_result(results_df, index=0)

TLM predicted category: contracts
TLM trustworthiness score: 0.6625658934123568

 1   RENE L. VALLADARES
     Federal Public Defender
 2   State Bar No. 11479
     KATHERINE A. TANAKA
 3   Assistant Federal Public Defender
     Nevada State Bar No. 14655C
 4   411 E. Bonneville, Ste. 250
     Las Vegas, Nevada 89101
 5   (702) 388-6577/Phone
     (702) 388-6261/Fax
 6   Katherine_Tanaka@fd.org

 7   Attorney for Joseph A. Strickland

 8
                                 UNITED STATES DISTRICT COURT
 9
                                       DISTRICT OF NEVADA
10
11   UNITED STATES OF AMERICA,                           Case No. 2:16-cr-155-JCM-CWH

12                  Plaintiff,                               STIPULATION TO CONTINUE
                                                               REVOCATION HEARING
13          v.
                                                                    (First Request)
14   JOSEPH A. STRICKLAND,

15                  Defendant.

16
17          IT IS

This is a case between Joseph A Strickland and the United States of America. Here the LLM mis-categorized this document as a contract where it should be caselaw. The contents of the document discuss an agreement between the State and Joseph, so perhaps that confused the model.

In [15]:
display_result(results_df, index=1)

TLM predicted category: legislation
TLM trustworthiness score: 0.6705137038662226

       Case 2:20-cv-00012 Document 1 Filed 01/08/20 Page 1 of 5 PageID 1




                         UNITED STATES DISTRICT COURT
                          MIDDLE DISTRICT OF FLORIDA
                             FORT MYERS DIVISION

 MERCEDES MUNOZ,

           Plaintiff,

 vs.                                               Case No.:
 SOPHIA OF GATEWAY, LLC d/b/a
 SUBWAY, and MOHAMMAD
 SULEMAN, Individually.

           Defendants.
                 COMPLAINT AND DEMAND FOR JURY TRIAL

         Plaintiff, MERCEDES MUNOZ (“Munoz” or “Plaintiff”), by and through her

undersigned attorneys, sues Defendants, SOPHIA OF GATEWAY, LLC d/b/a Subway,

a Florida Limited Liability Company (“Subway”), and MOHAMMAD SULEMAN

(“SULEMAN”), individually, (collectively referred to as “Defendants”) and states as

follows:
                                 NATURE OF ACTION

         1.       Plaintiff brings this action agains

This document also clearly a caselaw, but the model predicted it to be legislation, perhaps confused by the contents of the case which discuss rules.

In [16]:
display_result(results_df, index=3)

TLM predicted category: contracts
TLM trustworthiness score: 0.7303548621607547

 

[exaa_001.jpg] 

 



 

 

 

 [exaa_002.jpg]



 

 

 



 

 [exaa_003.jpg]



 

 

 



 

 [exaa_004.jpg]

 

 

 



 

 [exaa_005.jpg]

 

 

 



 

 [exaa_006.jpg]

 

 

 



 

 [exaa_007.jpg]

 

 

 



 

 [exaa_008.jpg]

 

 

 



 

 [exaa_009.jpg]

 

 

 



 

 [exaa_010.jpg]

 

 

 



 

 [exaa_011.jpg]

 

 

 



 

 [exaa_012.jpg]

 

 

 



 

 [exaa_013.jpg]

 

 

 



 

 [exaa_014.jpg]

 

 

 



 

 [exaa_015.jpg]

 

 

 



 

 [exaa_016.jpg]

 

 

 



 

 [exaa_017.jpg]

 

 

 



 

 [exaa_018.jpg]

 

 

 



 


This document clearly does not belong in any of the three categories as it is just a series of image titles. It makes sense why the TLM is unsure what category to classify it under.

Low trustworthiness scores like this can also help us identify confusing examples for the LLM and to catch its mistakes. Without reliable trustworthiness scores, we don't know when we can rely on AI and when not.

### How to use Trustworthiness Scores?

If you have time/resources, your team can manually review the LLM classifications of low-trustworthiness responses and provide a better human classification instead. If not, you can determine a trustworthiness threshold below which responses seem too unreliable to use, and have the model abstain from predicting in such cases (i.e. outputting "I don't know" instead).

The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be **application-specific**. First consider the *relative* trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.

## Measuring Classification Accuracy with Ground Truth Labels

Our example dataset happens to have labels for each document, so we can load them in to assess the accuracy of our model predictions. We'll study the impact on accuracy as we abstain from making predictions for examples receiving lower trustworthiness scores.

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/zero_shot_labels.csv'

In [18]:
df_ground_truth = pd.read_csv('zero_shot_labels.csv')
df = pd.merge(results_df, df_ground_truth, on=['index'], how='outer')
df['is_correct'] = df['type'] == df['predicted_category']

df.head()

Unnamed: 0,index,text,prompt,response,trustworthiness_score,predicted_category,type,is_correct
0,0,Probl2B\n0/NV Form\nRev. June 2014\n\n\n\n ...,You are an expert Legal Document Auditor. Clas...,Category: legislation\n\nExplanation: This doc...,0.840864,legislation,caselaw,False
1,1,UNITED STATES DI...,You are an expert Legal Document Auditor. Clas...,Category: caselaw,0.906702,caselaw,caselaw,True
2,2,\n \n FEDERAL COMMUNICATIONS COMMI...,You are an expert Legal Document Auditor. Clas...,Category: Legislation\n\nExplanation: This doc...,0.854045,legislation,legislation,True
3,3,\n \n DEPARTMENT OF COMMERCE\n ...,You are an expert Legal Document Auditor. Clas...,Category: Legislation\n\nExplanation: This doc...,0.913332,legislation,legislation,True
4,4,EXHIBIT 10.14\n\nAMENDMENT NO. 1 TO\n\nCHANGE ...,You are an expert Legal Document Auditor. Clas...,Category: Contracts,0.887005,contracts,contracts,True


In [20]:
print('TLM zero-shot classification accuracy over all documents: ', df['is_correct'].sum() / df.shape[0])

TLM zero-shot classification accuracy over all documents:  0.7784431137724551


Next suppose we instead abstain from making predictions on 20% of the documents flagged with the lowest trustworthiness scores (e.g. having experts manually categorize these documents instead).

In [21]:
quantile = 0.2  # Play with value to observe the accuracy vs. number of abstained examples tradeoff

filtered_df = df[df['trustworthiness_score'] > df['trustworthiness_score'].quantile(quantile)]
acc = filtered_df['is_correct'].sum() / filtered_df.shape[0]
print(f'TLM zero-shot classification accuracy over the documents within the top-{(1-quantile) * 100}% of trustworthiness scores: {acc}')

TLM zero-shot classification accuracy over the documents within the top-80.0% of trustworthiness scores: 0.8195488721804511


This shows the benefit of considering the TLM's trustworthiness score for zero-shot classification over having to rely on results from a standard LLM.