# How to Filter Bad Data in Instruction Tuning Datasets

This notebook demonstrates how to use Cleanlab TLM to detect low-quality prompt-response datapoints in a fine-tuning dataset. 

## Import Dependencies

In [None]:
!pip install cleanlab_studio pandas tqdm

In [1]:
from cleanlab_studio import Studio
import pandas as pd
from tqdm import tqdm

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

## Initialize TLM using Cleanlab Studio API Key

Replace with your own key. If you do not yet have an account, sign up for a free one here: https://app.cleanlab.ai/

In [11]:
api_key = # Replace with your key
batch_size = 64
studio = Studio(api_key)
tlm = studio.TLM(quality_preset="medium", max_concurrent_requests=batch_size)

## Fetching Data

Download the dataset here: https://huggingface.co/datasets/databricks/databricks-dolly-15k

The fine-tuning dataset contains 15000 instruction-response pairings covering 7 categories. Three of these categories (open q&a, classification, creative writing, brainstorming) do not incorporate context in their input while the remaining four categories (closed q&a, information extraction, summarization, brainstorming) do. For the sake of this tutorial, we will work with a subset of the dataset by sampling datapoints from each category. After you are familiar with TLM by the end of this tutorial, we encourage you to investigate the rest of the dataset on your own, or any other dataset you are interested in! 

In [5]:
data = pd.read_json("databricks-dolly-15k.jsonl", lines=True)
# For datapoints that incoporate context, we can simply prepend the context to the instruction
data.loc[data['context'].str.len() > 0, 'instruction'] = 'context: ' + data['context'].astype(str) + '\nquestion: ' + data['instruction'].astype(str)
data = data.drop(columns=['context'])

sample_size_per_category = 200
grouped = data.groupby('category')
df = pd.DataFrame(columns=data.columns)

for category, group in grouped:
    sampled_group = group.sample(sample_size_per_category)
    df = pd.concat([df, sampled_group])

df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,instruction,response,category
0,Give me a list of the best movies I've never seen.,"Top 10 movies I've ever seen, in no particular order. The Godfather, Dead Poet's Society, Rear Window, Scarface, Inception, Top Gun, Jaws, Skyfall, The Shawshank Redemption, A Few Good Men",brainstorming
1,Can you show me a list of all the major league baseball teams in the MLB for 2023?,"Yes, see below:\nAtlanta Braves\nArizona Diamondbacks\nMilwaukee Brewers\nLos Angeles Dodgers\nCincinnati Reds\nPittsburg Pirates\nNew York Mets\nSan Diego Padres\nSan Francisco Giants\nChicago Cubs\nSt. Louis Cardinals\nColorado Rockies\nMiami Marlins\nPhiledelphia Phillies\nWashington Nationals\nTampa Bay Rays\nMinnesota Twins\nLos Angeles Angels\nCleveland Guardians\nBaltimore Orioles\nNew York Yankees\nTexas Rangers\nBoston Red Sox\nChicago White Sox\nDetroit Tigers\nOakland Athletics\nToronto Blue Jays\nHouston Astros\nSeattle Mariners\nKansas City Royals",brainstorming
2,Give me a list of the best high fantasy books or series that I should read,"The Farseer trilogy, The Lord of the Rings, Mistborn series, A Song of Ice and Fire series, The Name of the Wind, Stardust, The Witcher.",brainstorming
3,What are some great comets named after people?,Great comets named after people: \n- Aristotle's Comet\n- Caesar's Comet\n- Donati's Comet\n- Halley's Comet\n- Comet Skjellerup–Maristany\n- Comet Arend–Roland\n- Comet Mrkos\n- Comet Ikeya–Seki\n- Comet Bennett\n- Comet Kohoutek\n- Comet West\n- Comet Hyakutake\n- Comet Hale–Bopp\n- Comet McNaught\n- Comet Lovejoy,brainstorming
4,Give me a bunch of excuses about why I don't have my homework to turn in,"Your dog ate your homework, you forget it at home, you spilled water all over it, your printer broke, you had a test in another class and were too busy studying",brainstorming


## Obtaining Confidence Scores on Datapoints
We leverage ```batch_get_confidence_score``` to efficiently obtain the confidence scores. Each call can process up to a batch of 64 input-output pairs at once, and takes around a minute to execute. The resulting scores will be saved in our ```open_qa_results``` dataframe.

In [None]:
idx_start = 0
results = df.copy(deep=True) 
results["confidence_score"] = None

for i in tqdm(range(0, len(results), batch_size)):
    end_index = min(i + batch_size, len(results))
    
    try:
        batch_instructions = results.iloc[i:end_index]["instruction"].tolist()
        batch_responses = results.iloc[i:end_index]["response"].tolist()
    
        batch_scores = tlm.batch_get_confidence_score(
            batch_instructions, batch_responses, retries=1
        )
        
        batch_scores = list(batch_scores)
        results.iloc[i:end_index, results.columns.get_loc("confidence_score")] = batch_scores
    
    except Exception as error:
        print("Failed at index ", i, " Error: ", str(error))

Now that TLM has finished running, let's inspect some low-confidence datapoints for one of our categories:

In [21]:
results = results.sort_values(by="confidence_score")
# results.to_csv('results.csv', index=False)

closed_qa_results = results[results['category'] == 'closed_qa']
closed_qa_results.head()

Unnamed: 0,instruction,response,category,confidence_score
585,"context: From February 2019 – 2020, average monthly temperatures in Germany ranged from a low of 3.3 °C (37.9 °F) in January 2020 to a high of 19.8 °C (67.6 °F) in June 2019. Average monthly precipitation ranged from 30 litres per square metre in February and April 2019 to 125 litres per square metre in February 2020. Average monthly hours of sunshine ranged from 45 in November 2019 to 300 in June 2019.\nquestion: Based on this text, what average temperature in degree celsius can I expect in April in Germany?",The average temperature in April in Germany is 13.2 °C.,closed_qa,0.077344
599,"context: ""It's a Great Day to Be Alive"" is a song written by Darrell Scott. It was originally recorded by American country music artist Jon Randall, of whose version was to have been included on an album titled Great Day to Be Alive, which would have been released in the late 1990s via BNA Records. Scott released his own version of his song on his 1997 album Aloha from Nashville.\n\nThe song was also recorded by The Sky Kings, an American country-rock supergroup consisting of Bill Lloyd (Foster & Lloyd), Rusty Young (Poco), and John Cowan (New Grass Revival) around 1995/1996. Their version sat unreleased until Rhino Handmade released the compilation ""From Out Of The Blue"" in 2000.\n\nThe song was later covered more famously by American country music artist Travis Tritt. It was released in December 2000 as the second single from his album Down the Road I Go. It peaked at number 2 on the U.S. Billboard Hot Country Singles & Tracks chart and at number 33 on the U.S. Billboard Hot 100 chart.\n\nIt was subsequently covered by Pat Green and Cory Morrow on their March 2001 album Songs We Wish We'd Written and by Niko Moon for his 2021 album Good Time.\nquestion: Given the reference text below, who sang ""It's a Great Day to Be Alive""?","""It's a Great Day to Be Alive"" is a song written by Darrell Scott which was released in December 2000.",closed_qa,0.334004
439,"context: The Big Bang event is a physical theory that describes how the universe expanded from an initial state of high density and temperature. Various cosmological models of the Big Bang explain the evolution of the observable universe from the earliest known periods through its subsequent large-scale form. These models offer a comprehensive explanation for a broad range of observed phenomena, including the abundance of light elements, the cosmic microwave background (CMB) radiation, and large-scale structure. The overall uniformity of the Universe, known as the flatness problem, is explained through cosmic inflation: a sudden and very rapid expansion of space during the earliest moments. However, physics currently lacks a widely accepted theory of quantum gravity that can successfully model the earliest conditions of the Big Bang.\nquestion: How was the universe created?",We can't know for sure but a popular theory called the Big Bang suggests that the universe was created from an infinitesimally small point of infinite energy some four and a half billion years ago.,closed_qa,0.35459
436,"context: Octavia Estelle Butler (June 22, 1947 – February 24, 2006) was an American science fiction author and a multiple recipient of the Hugo and Nebula awards. In 1995, Butler became the first science-fiction writer to receive a MacArthur Fellowship. Born in Pasadena, California, Butler was raised by her widowed mother. Extremely shy as a child, Butler found an outlet at the library reading fantasy, and in writing. She began writing science fiction as a teenager. She attended community college during the Black Power movement. While participating in a local writer's workshop, she was encouraged to attend the Clarion Workshop, then held in Pennsylvania, which focused on science fiction.\nquestion: How old was Octavia Butler when she died?","At the time of her death in 2006, Octavia Butler was 58 years old.",closed_qa,0.390879
521,"context: Time travel is the concept of movement between certain points in time, analogous to movement between different points in space by an object or a person, typically with the use of a hypothetical device known as a time machine. Time travel is a widely recognized concept in philosophy and fiction, particularly science fiction. The idea of a time machine was popularized by H. G. Wells' 1895 novel The Time Machine.\nquestion: Given this paragraph about time travel, how practical is it","Yes, time travel is indeed a real thing. But it's not quite what you've probably seen in the movies. Under certain conditions, it is possible to experience time passing at a different rate than 1 second per second. And there are important reasons why we need to understand this real-world form of time travel.",closed_qa,0.395864


At this point, we would manually verify each datapoint, and choose to either correct or remove the low-quality ones. Try to find mistakes in the other categories on your own!

## Using Cleanlab Studio to Detect Further Text Issues

In this section, we demonstrate how Cleanlab Studio can automatically generate metadata for any text dataset. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity. Currently, Cleanlab Studio mainly supports classification tasks, and thus requires a label column and only processes one desiginated text column per dataset. However, we can work around this by simply combining the ```instruction``` and ```response``` columns and using ```category``` as the label. If your dataset does not have an appropriate label cateogry, you can artificially generate labels by randomly assigning 0s and 1s to datapoints. The labeling issues metadata do not matter to us as we are only concerned about text issues.


After preparing our dataset, we load the data into Cleanlab Studio.



In [14]:
df['text'] = 'Prompt: ' + df['instruction'] + ' Response: ' + df['response']
df = df.drop(columns=['instruction', 'response'])

dataset_id = studio.upload_dataset(df, dataset_name="dolly-subset")
print(f"Dataset ID: {dataset_id}")

Uploading dataset...: 100%|████████████████████████████████████████████████|
Generating schema...: 100%|████████████████████████████████████████████████|
Ingesting Dataset...: 100%|████████████████████████████████████████████████|

Dataset ID: 6237e7d3664b4a1492e9f76eb675b9bf





Now that we have a dataset_id, we can use it to launch a Cleanlab Dtudio project. A project automatically trains ML models to provide AI-based analysis of your dataset.

In [16]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="dolly-text-issues",
    modality="text",
    label_column="category",
    text_column="text"
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Project successfully created and training has begun! project_id: 93ebea807443433e915392dc772b89e8


Once the project has been launched successfully and you see your project_id you can feel free to close this notebook. You should only execute the above cell once per dataset -- it will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.

Once your results are ready, you can optionally view your project in the [Cleanlab Studio web interface](https://app.cleanlab.ai/) and interactively improve your dataset. If you want to learn further programmatic workflows for Cleanlab Studio, check out our [full tutorial](https://help.cleanlab.ai/tutorials/text_data_quickstart/)!