# How to detect low quality data in Instruction Tuning Datasets

This notebook demonstrates how to use Cleanlab TLM and Cleanlab Studio to detect bad data in a LLM instruction tuning dataset. Specifically we showcase the databricks-dolly-15k dataset here.

## Import Dependencies

In [None]:
!pip install cleanlab-tlm cleanlab_studio pandas tqdm

In [None]:
# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>"  # Get your API key from: https://tlm.cleanlab.ai/

In [1]:
from cleanlab_tlm import TLM
from cleanlab_studio import Studio
import pandas as pd
from tqdm import tqdm

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

## Initializing TLM

You can query the TLM as follows:

In [2]:
# Get Studio API key from here: https://studio.cleanlab.ai/account after creating a Cleanlab Studio account.
# Instructions to create account can be found under Guide -> Quickstart -> Cleanlab Studio Python API -> Creating an API Key

studio = Studio("<API key>")
batch_size = 64
tlm = TLM(quality_preset="medium", max_concurrent_requests=batch_size)

## Fetching Data

Download the dataset here: https://huggingface.co/datasets/databricks/databricks-dolly-15k

The fine-tuning dataset contains 15000 instruction-response pairings covering 7 categories. Three of these categories (open q&a, classification, creative writing, brainstorming) do not incorporate context in their input while the remaining four categories (closed q&a, information extraction, summarization, brainstorming) do. For the sake of this tutorial, we will work with a subset of the dataset by sampling datapoints from each category. After you are familiar with TLM by the end of this tutorial, we encourage you to investigate the rest of the dataset on your own, or **any other dataset you are interested in**! 

In [3]:
data = pd.read_json("databricks-dolly-15k.jsonl", lines=True)
# For datapoints that incoporate context, we can simply prepend the context to the instruction
data.loc[data['context'].str.len() > 0, 'instruction'] = 'context: ' + data['context'].astype(str) + '\nquestion: ' + data['instruction'].astype(str)
data = data.drop(columns=['context'])

sample_size_per_category = 200
grouped = data.groupby('category')
df = pd.DataFrame(columns=data.columns)

for category, group in grouped:
    sampled_group = group.sample(sample_size_per_category)
    df = pd.concat([df, sampled_group])

df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,instruction,response,category
0,What are three crafts with a wire coat hanger?,"- Create a mobile by cutting up the coat hanger and joining the pieces together.\n- Make a wire sculpture of your favorite animal.\n- Fashion a spring and affix a pompom on top of it, finishing your creation with googly eyes",brainstorming
1,What are some good body weight exercises for the lower body?,"Some good exercises to strengthen the lower body include box squats, lunges, side lunges, one leg Romanian dead lifts, glute marches, and curtsey squats.",brainstorming
2,What are the top 10 suggestions for fillings in Mexican street tacos?,1.\tSteak (carne asada)\n2.\tChicken (pollo)\n3.\tBeef Cheeks (cabeza)\n4.\tBraised Beef (barbacoa)\n5.\tBraised Pork (carnitas)\n6.\tPork Marinated with Pineapple (al pastor)\n7.\tShrimp (camaron)\n8.\tSausage (chorizo)\n9.\tFish (pescado)\n10. Beans (frijoles),brainstorming
3,List as many types of craft beer as you can,"IPA, pale ale, wheat beer, lager, pilsner, stout, porter, brown ale, sour, saison, red ale, imperial stout, mild ale, english special bitter, milkshake IPA, sour IPA, cream ale",brainstorming
4,Provide a dot point list of the last six artists that won a Grammy Award for Best Album. Include the title of the album and the year they won.,"Here is a list of the last six artists that won a Grammy Award for Best Album.\n- Harry Styles won for the album 'Harry's House' in 2023 \n- Jon Batiste won for the album 'We Are' in 2022 \n- Taylor Swift won for 'Folklore' in 2021 \n- Billie Eilish won for 'When We All Fall Asleep, Where Do We Go?' in 2020 \n- Kacey Musgraves won for the album 'Golden Hour' in 2019 \n- Bruno Mars won for '24K Magic' in 2018",brainstorming


## Obtaining Confidence Scores on Datapoints
We leverage ```batch_get_confidence_score``` to efficiently obtain the confidence scores. Each call can process up to a batch of 64 (instruction, response) pairs at once, and takes around a minute to execute. The resulting scores will be saved in our ```results``` dataframe.

In [None]:
idx_start = 0
results = df.copy(deep=True) 
results["confidence_score"] = None

for i in tqdm(range(0, len(results), batch_size)):
    end_index = min(i + batch_size, len(results))
    batch_instructions = results.loc[i:end_index]["instruction"]
    batch_responses = results.loc[i:end_index]["response"]
    # Each call can process batch_size (maximum 64) prompt-response pairs at once
    batch_scores = tlm.batch_get_confidence_score(
        batch_instructions, batch_responses, retries=1
    )
    
    results.loc[i:end_index, "confidence_score"] = batch_scores

Now that TLM has finished running, let's inspect some low-confidence datapoints for one of our categories:

In [6]:
results = results.sort_values(by="confidence_score")
# results.to_csv('results.csv', index=False)

closed_qa_results = results[results['category'] == 'closed_qa']
closed_qa_results.head()

Unnamed: 0,instruction,response,category,confidence_score
540,"context: From February 2019 – 2020, average monthly temperatures in Germany ranged from a low of 3.3 °C (37.9 °F) in January 2020 to a high of 19.8 °C (67.6 °F) in June 2019. Average monthly precipitation ranged from 30 litres per square metre in February and April 2019 to 125 litres per square metre in February 2020. Average monthly hours of sunshine ranged from 45 in November 2019 to 300 in June 2019.\nquestion: Based on this text, what average temperature in degree celsius can I expect in April in Germany?",The average temperature in April in Germany is 13.2 °C.,closed_qa,0.077344
491,"context: The relative size of a Foley catheter is described using French units (F). Alternatively, the size of a 10 F catheter might be expressed as 10 Ch (Charriere units – named after a 19th century French scientific instrument maker, Joseph-Frédéric-Benoît Charrière). The most common sizes are 10 F to 28 F. 1 F is equivalent to 0.33 mm = .013"" = 1/77"" of diameter. Foley catheters are usually color coded by size with a solid color band at the external end of the balloon inflation tube, allowing for easy identification of the size. Note: Colors for French sizes 5, 6, 8, 10 may vary significantly if intended for pediatric patients. Color for French size 26 may also be pink instead of black.\nhttps://en.wikipedia.org/wiki/Foley_catheter\nquestion: Based on this paragraph, what is the diameter of a 10 F Foley catheter in mm?","According to this paragraph, the diameter of a 1 F catheter is 1/77mm so the diameter of a 10 F catheter is 10/77mm.",closed_qa,0.113954
481,"context: The Open Era is the current era of professional tennis. It began in 1968 when the Grand Slam tournaments allowed professional players to compete with amateurs, ending the division that had persisted since the dawn of the sport in the 19th century. The first open tournament was the 1968 British Hard Court Championships held in April, followed by the inaugural open Grand Slam tournament, the 1968 French Open, a month later. Unless otherwise sourced, all records are based on data from the Association of Tennis Professionals (ATP), the International Tennis Federation (ITF), and the official websites of the four Grand Slam tournaments. All rankings-related records are based on ATP rankings, which began in 1973.\nquestion: Who has won the most grand slams in men's singles tennis in open era?",Rafael Nadal and Novak Djokovic won 22 times.,closed_qa,0.168728
554,"context: Time travel is the concept of movement between certain points in time, analogous to movement between different points in space by an object or a person, typically with the use of a hypothetical device known as a time machine. Time travel is a widely recognized concept in philosophy and fiction, particularly science fiction. The idea of a time machine was popularized by H. G. Wells' 1895 novel The Time Machine.\nquestion: Given this paragraph about time travel, how practical is it","Yes, time travel is indeed a real thing. But it's not quite what you've probably seen in the movies. Under certain conditions, it is possible to experience time passing at a different rate than 1 second per second. And there are important reasons why we need to understand this real-world form of time travel.",closed_qa,0.395864
504,"context: Scholars generally acknowledge six cradles of civilization. Mesopotamia, Ancient Egypt, Indus Valley, and Ancient China are believed to be the earliest in the Old World. Cradles of civilization in the New World are the Caral-Supe civilization of coastal Peru and the Olmec civilization of Mexico. All of the cradles of civilization depended upon agriculture for sustenance (except possibly Caral-Supe which may have depended initially on marine resources). All depended upon farmers producing an agricultural surplus to support the centralized government, political leaders, priests, and public works of the urban centers of the civilization.\n\nLess formally, the term ""cradle of civilization"" is often used to refer to other historic ancient civilizations, such as Greece or Rome, which have both been called the ""cradle of Western civilization"".\nquestion: What is the oldest known civilization?","The oldest known human civilizations are known to exist in Mesopotamia (Modern day Iraq), Indus Valley (a region located in modern-day Pakistan), Ancient Egypt and Ancient China.",closed_qa,0.399626


At this point, we would manually verify each datapoint, and choose to either correct or remove the low-quality ones. Try to find mistakes in the other categories on your own!

## Using Cleanlab Studio to Detect Further Text Issues

In this section, we demonstrate how Cleanlab Studio can automatically generate metadata for any text dataset. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity. Currently, Cleanlab Studio mainly supports classification tasks, and thus requires a label column and only processes one designated text column per dataset. However, we can work around this by simply combining the ```instruction``` and ```response``` columns and using ```category``` as the label. If your dataset does not have an appropriate label category, you can artificially generate labels by randomly assigning 0s and 1s to datapoints. The labeling issues metadata do not matter to us as we are only concerned about text issues.


After preparing our dataset, we load the data into Cleanlab Studio.



In [14]:
df['text'] = 'Prompt: ' + df['instruction'] + ' Response: ' + df['response']
df = df.drop(columns=['instruction', 'response'])

dataset_id = studio.upload_dataset(df, dataset_name="dolly-subset")
print(f"Dataset ID: {dataset_id}")

Uploading dataset...: 100%|████████████████████████████████████████████████|
Generating schema...: 100%|████████████████████████████████████████████████|
Ingesting Dataset...: 100%|████████████████████████████████████████████████|

Dataset ID: 6237e7d3664b4a1492e9f76eb675b9bf





Now that we have a dataset_id, we can use it to launch a Cleanlab Studio project. A project automatically trains ML models to provide AI-based analysis of your dataset.

In [16]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="dolly-text-issues",
    modality="text",
    label_column="category",
    text_column="text"
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Project successfully created and training has begun! project_id: 93ebea807443433e915392dc772b89e8


Once the project has been launched successfully and you see your ```project_id``` you can feel free to close this notebook. You should only execute the above cell once per dataset -- it will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.

Once your results are ready, you can optionally view your project in the [Cleanlab Studio web interface](https://app.cleanlab.ai/) and interactively improve your dataset. In the following section, we will walk through further programmatic workflows for Cleanlab Studio.

## Review Data Issues

Once the results are ready, we fetch the Cleanlab columns that store metadata for our cleanset using its cleanset_id. These columns have the same length as our original dataset and provide metadata about each individual data point, like what types of text issues it exhibits (PII, toxic, non english, informal) and how severely.

In [38]:
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
# Combine the dataset with the cleanlab columns
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_on="cleanlab_row_ID")

cleanset_id: 51a2dc957af54d4aa2c6ad6be80e9404


As an example, let's investigate the potential occurences of personally identifiable information (PII) in our dataset. 

In [40]:
PII_samples = combined_dataset_df.query("is_PII").sort_values("PII_score", ascending=False)
columns_to_display = ["cleanlab_row_ID", "text", "PII_score", "is_PII", "PII_types", "PII_items"]
display(PII_samples.head(3)[columns_to_display])

Unnamed: 0,cleanlab_row_ID,text,PII_score,is_PII,PII_types,PII_items
383,383,"Prompt: I bought two bottles of 2L coca-cola when they were on special, but when I checked my receipt, I was charged the full price! Please contact me on 0410787999 or iratecustomer@hotmail.com to discuss a refund! Response: I've found two pieces of personally identifiable information (PII) in the provided text:\n- A person's Australian mobile phone number: 0410787999\n- An email address: iratecustomer@hotmail.com\n\nThe Australian government defines PII as ""Information that can be used on its own or with other information to identify, contact or locate a single person, or to identify an individual in context"" (source: https://www.cyber.gov.au/acsc/view-all-content/glossary/personally-identifiable-information-pii)",0.5,True,"[""URL"", ""email""]","[""iratecustomer@hotmail.com"", ""iratecustomer@hotmail.com"", ""https://www.cyber.gov.au/acsc/vi""]"
1105,1105,"Prompt: context: Taylor Alison Swift (born December 13, 1989) is an American singer-songwriter. Her genre-spanning discography, songwriting abilities and artistic reinventions have received critical praise and wide media coverage. Born in West Reading, Pennsylvania, Swift moved to Nashville at age 14 to become a country artist. She signed a songwriting deal with Sony/ATV Music Publishing in 2004 and a recording contract with Big Machine Records in 2005. Her 2006 self-titled debut album made her the first female country artist to write a U.S. platinum-certified album.\nSwift's next albums, Fearless (2008) and Speak Now (2010), explored country pop. The former's ""Love Story"" and ""You Belong with Me"" were the first country songs to top the U.S. pop and all-genre airplay charts, respectively. She experimented with rock and electronic styles on Red (2012), which featured her first Billboard Hot 100 number-one song, ""We Are Never Ever Getting Back Together"", and eschewed her country image in her synth-pop album, 1989 (2014), supported by chart-topping songs ""Shake It Off"", ""Blank Space"", and ""Bad Blood"". Media scrutiny inspired the urban-flavored Reputation (2017) and its number-one single ""Look What You Made Me Do"".\nExiting Big Machine, Swift signed with Republic Records in 2018 and released her seventh studio album, Lover (2019), followed by the autobiographical documentary Miss Americana (2020). She ventured into indie folk and alternative rock in her 2020 albums Folklore and Evermore, whose singles ""Cardigan"" and ""Willow"" topped the Hot 100. Swift began re-recording her first six albums after a dispute over their masters, re-releasing two in 2021—Fearless (Taylor's Version) and Red (Taylor's Version). The latter's ""All Too Well (10 Minute Version)"" became the longest song to top the Hot 100. Her 2022 album Midnights and single ""Anti-Hero"" broke all-time streaming records. Swift has directed music videos and films, such as All Too Well: The Short Film (2021), and played supporting roles in others.\nHaving sold over 200 million records globally, Swift is one of the best-selling musicians in history. She is the most streamed woman on Spotify and the only act to have five albums open with over one million copies sold in the US. Among her accolades are 12 Grammy Awards, including three Album of the Year wins; an Emmy Award; 40 American Music Awards; 29 Billboard Music Awards; three IFPI Global Recording Artist of the Year awards; and 92 Guinness World Records. Swift has been featured in rankings such as Rolling Stone's 100 Greatest Songwriters of All Time, Billboard's Greatest of All Time Artists, the Time 100 and Forbes Celebrity 100. Honored with titles such as Artist of the Decade and Woman of the Decade, Swift is an advocate for artists' rights and women's empowerment. Her music is credited with influencing a generation of singers and songwriters.\nquestion: Mention the Record labels Taylor was associated with. Response: Taylor Swift was associated with the following record labels:\n1. Sony/ATV Music Publishing (2004) \n2. Big Machine Records (2005)\n3. Republic Records (2018)",0.4,True,"[""Uncategorized PII"", ""date of birth""]","[""December 13, 1989"", ""in 2004 and a"", ""in 2005"", ""2006"", ""1989"", ""2004"", ""2005""]"
1172,1172,"Prompt: context: Matilde Ciccia (born 6 October 1952 in Monasterace) is an Italian former ice dancer. Competing with Lamberto Ceserani, she won the gold medal at the Italian Figure Skating Championships several times. They finished fifth at the 1975 World Championships and sixth at the 1976 Winter Olympics. Ciccia had a role in the 1980 movie Zappatore.\nquestion: Extract from the text the name of the movie in which Matilde Ciccia had a role Response: Zappatore",0.4,True,"[""date of birth""]","[""6 October 1952 in"", ""at the 1975"", ""at the 1976"", ""in the 1980""]"


We now manually verify each datapoint, and choose to either censor or remove the ones that do contain PII. On your own, try to filter for other text issue categories in our cleanset! For a full walkthrough of using Cleanlab Studio's Python API for detecting text issues, check out our [tutorial](https://help.cleanlab.ai/tutorials/text_data_quickstart/).