# Detect Low-Quality Data in any Instruction Tuning Dataset with Cleanlab Studio and the Trustworthy Language Model
<head>
  <meta name="title" content="How to detect low-quality data in any Instruction Tuning dataset"/>
  <meta property="og:title" content="How to detect low-quality data in any Instruction Tuning dataset"/>
  <meta name="twitter:title" content="How to detect low-quality data in any Instruction Tuning dataset" />
  <meta name="image" content="/img/tlm_instruction_tuning.png" />
  <meta property="og:image" content="/img/tlm_instruction_tuning.png" />
  <meta name="twitter:image" content="/img/tlm_instruction_tuning.png" />
  <meta name="description" content="Optimal LLM fine-tuning through handling bad data."  />
  <meta property="og:description" content="Optimal LLM fine-tuning through handling bad data." />
  <meta name="twitter:description" content="Optimal LLM fine-tuning through handling bad data." />
</head>

Data quality is paramount in **instruction tuning** (aka. *supervised fine-tuning*, *alignment*, *sequence-to-sequence modeling*), a popular method to improve the performance of pre-trained Language Models (LLMs) for specific tasks. Low-quality examples lurking in the dataset hamper LLM instruction tuning, resulting in poor performance. Such bad data is prevalent in real-world datasets and hard to catch manually.

Using [Cleanlab Studio's Python API](/guide/quickstart/api/) together with the [Trustworthy Language Model (TLM)](/tutorials/tlm/), this tutorial demonstrates how to automatically catch: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in any instruction-response dataset. 

For optimal LLM fine-tuning or in-context (few-shot) learning, addressing such low-quality data is essential. After identifying bad data points, you can remove them from your dataset or manually improve flagged responses, in order to improve the quality of your dataset and resulting AI model. Before this tutorial, we recommend completing the [TLM quickstart tutorial](/tutorials/tlm/).

![Detected issues in databricks-dolly-15k](../assets/instruction-tuning-tutorial/tutorial_img.png)


## Setup

Make sure you have a [Cleanlab Studio](https://app.cleanlab.ai/) account. Sign up for one [here](https://cleanlab.ai/signup/) if you haven't yet. If you've already signed up, check your email for a personal login link.

The `cleanlab-studio` Python client can be installed using pip:

In [None]:
%pip install cleanlab-studio

Once installed, let's load this package.

In [None]:
from cleanlab_studio import Studio

import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

Also download the instruction-tuning dataset used in this tutorial.

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/databricks-dolly-mini.jsonl'

We then initalize our Cleanlab client.

In [None]:
# Get API key from here: https://app.cleanlab.ai/account after creating a Cleanlab Studio account.
studio = Studio("<API Key>")

In [None]:
tlm = studio.TLM()

## Dataset

This tutorial considers a subset of the famous [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset, which was used to fine-tune the [Dolly 2.0](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) LLM and many other popular models.

Like other instruction-tuning datasets, this data is composed of instruction-response pairs, where the responses were manually written/curated by a large team of 5000+ human data annotators.

In [21]:
df = pd.read_json("databricks-dolly-mini.jsonl", lines=True)
df.head()

Unnamed: 0,instruction,context,response,category
0,What are the official languages of Canada?,,The official languages of Canada are English and French.,open_qa
1,"Classify each of the following plates as either food or a cocktail: sushi, mojito, quiche, tacos, moscow mule, old fashioned, pizza, pasta, sandwich, pho, margarita, martini, ramen",,"From the list that has been provided, the following items to belong to the food category: sushi quiche, tacos, pizza, pasta, sandwich, pho and ramen. The rest of the items can be categorized as cocktails: mojito, moscow mule, old fashioned, margarita and martini.",classification
2,How do I see how large a dataset is in Spark?,,"You can find the number of records in a DataFrame or RDD using the `count()` method. If you want to see how large it is in terms of bytes, you can find this information on the Spark UI.",general_qa
3,Extract the locations where Nachum lived from the text below and list them in alphabetical order and separated by a semicolon.,"Nachum Gutman was born in Teleneşti, Bessarabia Governorate, then a part of the Russian Empire (now in the Republic of Moldova). He was the fourth child of Simha Alter and Rivka Gutman. His father was a Hebrew writer and educator who wrote under the pen name S. Ben Zion. In 1903, the family moved to Odessa, and two years later, to Ottoman Palestine. In 1908, Gutman attended the Herzliya Gymnasium in what would later become Tel Aviv. In 1912, he studied at the Bezalel School in Jerusalem. In 1920–26, he studied art in Vienna, Berlin and Paris.\n\nGutman was married to Dora, with whom he had a son. After Gutman's death in 1980, Dora asked two Tel Aviv gallery owners, Meir Stern of Stern Gallery and Miriam Tawin of Shulamit Gallery, to appraise the value all of the works left in his estate.",Berlin;Jerusalem;Odessa;Palestine;Paris;Tel Aviv;Vienna,information_extraction
4,What is a proof that there are two irrational numbers where raising one to the power of the other produces a rational number?,,"Here is a proof by cases that there exist two irrational numbers a and b such that a^b is rational. First, consider a = b = sqrt(2). Recall that sqrt(2) is irrational, so If a^b = sqrt(2) ^ sqrt(2) is rational, we are done because we have found irrational a and b with a^b being a rational number. Otherwise, sqrt(2) ^ sqrt(2) is irrational. Let a = sqrt(2) ^ sqrt(2) and b = sqrt(2). Then a ^ b = (sqrt(2) ^ sqrt(2)) ^ sqrt(2) = sqrt(2) ^ (sqrt(2) * sqrt(2)) = sqrt(2) ^ 2 = 2, which is rational. So either way, we can find irrational a and b with a^b being rational.",creative_writing


Some examples in this dataset come with additional context, which we'll merge with the instruction to form a single `prompt` input for each row.

In [22]:
# Define a 'prompt' column by prepending the context to the instruction
def format_row(row):
    if pd.notnull(row["context"]) and row["context"] != "":
        return f"Context: {row['context']} The question is: {row['instruction']}"
    else:
        return row["instruction"]


df["prompt"] = df.apply(format_row, axis=1)

While our dataset here is composed of prompt/response pairs, the ideas presented in this tutorial can be used to automatically catch bad data in any text dataset composed of (input, output) pairs.

## Using TLM to estimating the quality of prompt-response pairs and catch bad data

To detect low-quality (input, output) pairs in our dataset, we can score the quality of each response via the **trustworthiness score** estimated by Cleanlab's Trustworthy Language Model (TLM).

To run this method over a dataset, we recommend using `try_get_trustworthiness_score()` instead of `get_trustworthiness_score()` because the former method will save partial results in the case where some examples fail during processing.

In [23]:
tlm = studio.TLM()  # see the TLM quickstart tutorial for configuration options

results = df.drop("prompt", axis=1).copy(deep=True)

trustworthiness_scores = tlm.try_get_trustworthiness_score(df["prompt"].to_list(), df["response"].to_list())
results["trustworthiness_score"] = trustworthiness_scores

Querying TLM... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|


To see which responses in the dataset are *least* trustworthy (i.e. **low quality**), we sort the data by the computed trustworthiness scores.

In [24]:
results.sort_values(by="trustworthiness_score").head(10)

Unnamed: 0,instruction,context,response,category,trustworthiness_score
12,What's the capital of Brazil?,,The capital of Brazil is Rio de Janeiro.,open_qa,0.015622
13,When does the world celebrates Hindi diwas every year,,On January 10,open_qa,0.042846
17,"Tell me whether these states are in Eastern Time, Central Time, or neither",,"California, Oregon, Connecticut, Washington, Texas, Hawaii",classification,0.043443
15,Which notable person having surname Liebers in this list was born first?,"Liebers is a German language surname. Notable people with the name include:\n\nMario Liebers (born 1960), German former competitive figure skater\nMartin Liebers (born 1985), German former competitive figure skater\nMatthias Liebers (born 1958), former German footballer\nPeter Liebers (born 1988), German former figure skater.",Mario Liebers,information_extraction,0.123192
11,Who invented the airplane?,,"The airplane was invented by Santos Dumont, a Brazilian inventor. He flew for the first time in France, in the year of 1906.",open_qa,0.146536
19,classify the following as mammals vs reptiles:\ngoat\nsnake\ncockroach\npig\ndog\ncaterpillar\nlion,,"mammals: goat, pig, dog, lion\nreptiles: snack, cockroach, caterpillar",classification,0.171086
16,What magazine was Alex Prager the director of Photography for?,"In 2011 Kathy Ryan, director of photography for The New York Times Magazine commissioned Prager to shoot 12, 1 minute films with some film actors from that year, inspired by ""cinematic villainy"". Prager won a News and Documentary Emmy Award for New Approaches to News & Documentary Programming: Arts, Lifestyle & Culture for her Touch of Evil short films.",Alex Prager was the director of photography for the New York Times Magazine.,closed_qa,0.346765
18,Where is Babylon remain located?,,Irag,open_qa,0.393587
33,Why are dogs better than cats for a person's physical health?,,"Dogs keep their owners more active by needing to go for walks/runs, socialization, getting into garbage so the owner never fully gets to relax, etc. Cats just chill and mind their business, and do not require walks or park time.",open_qa,0.397364
14,"Which of these are dog and which are birds: Phoenix, Siberian Husky, Alaskan Malamute, Hawk, Border Collie, Labrador, Araucana",,"Birds: Phoenix, Siberian Husky, Alaskan Malamute, Border Collie, Labrador\nDogs: Siberian Husky, Alaskan Malamute, Border Collie, Labrador",classification,0.411838


The human written prompt-response pairs with low trustworthiness appear worse quality. Reviewing the results in detail, we find a variety of issues among these lowest-scoring datapoints: factually inaccurate responses, truncated/vague prompts, inaccurate information extraction given context, and spelling errors. Conversely, the responses in the dataset that received the highest TLM trustworthiness scores below provide a direct and accurate answer to the instruction.

In [25]:
results.sort_values(by="trustworthiness_score").tail()

Unnamed: 0,instruction,context,response,category,trustworthiness_score
7,What is the square root of 25?,,The square root of 25 is 5.,open_qa,0.971068
10,Where did Danny Dayton graduated from?,"Danny Dayton (born Daniel David Segall, November 20, 1923 – February 6, 1999) was an American actor and television director. Beginning in the 1950s, he played many roles in film and on TV. He had a recurring role as Hank Pivnik on All in the Family and had guest roles in M*A*S*H, The Incredible Hulk, Charlie's Angels and The Love Boat.\n\nDayton graduated from the school of journalism at New York University, before turning to acting. In addition to acting, Dayton also directed episodes of Occasional Wife, Here's Lucy, and the short-lived series Good Morning World.\n\nOn February 6, 1999, Dayton died of emphysema in Los Angeles, California.",Danny Dayton graduated from the school of journalism at New York University.,summarization,0.982664
26,What year was Genghis Khan born?,,Genghis Khan was born in the year 1162.,open_qa,0.983171
0,What are the official languages of Canada?,,The official languages of Canada are English and French.,open_qa,0.998936
6,When did Florida become a state?,,"Florida became a state on March 3, 1845.",open_qa,0.999657


## Detecting Additional Text Issues with Cleanlab Studio

The TLM trustworthiness scores offer one way to automatically detect bad data, but Cleanlab offfers other automated techniques you can use as well.

In this section, we demonstrate how the Cleanlab Studio platform can programmatically generate smart metadata for any text dataset. This metadata (returned as many different Cleanlab Columns) helps you discover all sorts of additional problems in your dataset and understand their severity.

We first load our dataset into Cleanlab Studio.

In [11]:
dataset_id = studio.upload_dataset(results, dataset_name="dolly-mini")
print(f"Dataset ID: {dataset_id}")

Uploading dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
Ingesting Dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|


Dataset ID: 03238f0d16df4851813b6fbf75f2000c


We can use the `dataset_id` to launch a Cleanlab Studio Project. Whereas the TLM is a pre-trained Foundation model that assesses your data based on existing world knowledge, a Cleanlab Studio Project automatically trains ML models on your dataset to learn its statistical properties and provide more tailored analysis.

In [12]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="dolly-mini-text-issues",
    modality="text",
    task_type="unsupervised",  # consider a different task_type if you have additional category labels or tags
    model_type="regular",
    label_column=None,
    text_column="response",  # we specifically audit the response column text in this dataset
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Project successfully created and training has begun! project_id: a51bcd64377141f38e69f08079bd2afe


The Project will take some time to run (you'll receive an email when it is complete). The next code cell simply waits until the Project results are ready.

**Warning:** This next cell may take a long time to execute for big datasets. If your notebook times out, **do not** create another Project -- just re-execute the following cell which will fetch the necessary information from the already-completed Project.

In [18]:
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
studio.wait_until_cleanset_ready(cleanset_id)

cleanset_id: 697fcaf957474090bae2760613560d86


Cleanset Progress: \ Step 24/24, Ready for review!


Once the Project results are ready, we fetch the generated Cleanlab columns that contain smart metadata about each data point in our dataset, like what types of issues it exhibits (PII, toxic, non english, informal writing, etc). Each issue type comes with corresponding severity scores that indicate how badly each data point exhibits this issue. Note that this Cleanlab Studio Project focused exclusively on the text in the `response` column of our dataset, because it contains the text used that a LLM would be trained to reproduce during supervised fine-tuning.

In [26]:
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_index=True)

In [28]:
combined_dataset_df.head(2)

Unnamed: 0,instruction,context,response,category,trustworthiness_score,cleanlab_row_ID,is_empty_text,text_num_characters,is_PII,PII_score,PII_types,PII_items,is_informal,informal_score,spelling_issue_score,grammar_issue_score,slang_issue_score,is_non_english,non_english_score,predicted_language,is_toxic,toxic_score,sentiment_score,bias_score,is_biased,gender_bias_score,racial_bias_score,sexual_orientation_bias_score
0,"Classify each of the following plates as either food or a cocktail: sushi, mojito, quiche, tacos, moscow mule, old fashioned, pizza, pasta, sandwich, pho, margarita, martini, ramen",,"From the list that has been provided, the following items to belong to the food category: sushi quiche, tacos, pizza, pasta, sandwich, pho and ramen. The rest of the items can be categorized as cocktails: mojito, moscow mule, old fashioned, margarita and martini.",classification,0.69119,1,False,56,False,0.0,[],[],False,0.010022,0.0,0.017135,0.007921,False,0.046236,,False,0.28125,0.729675,0.82207,True,0.82207,0.406006,0.1647949
1,How do I see how large a dataset is in Spark?,,"You can find the number of records in a DataFrame or RDD using the `count()` method. If you want to see how large it is in terms of bytes, you can find this information on the Spark UI.",general_qa,0.7062,2,False,263,False,0.0,[],[],False,0.321658,0.0,0.623752,0.180392,False,0.013322,,False,0.333008,0.694214,0.335205,False,0.0,0.335205,5.96e-08


### Toxic Language

Text that contains toxic language may have elements of hateful speech and language others may find harmful or aggressive. Identifying toxic language is vital in tasks such as content moderation and LLM training/evaluation, where appropriate action should be taken to ensure safe platforms and applications. Let's see what toxic language Cleanlab automatically detected in this dataset.

In [29]:
toxic_samples = combined_dataset_df.query("is_toxic").sort_values(
    "toxic_score", ascending=False
)

columns_to_display = ["cleanlab_row_ID", "response", "toxic_score", "is_toxic"]
display(toxic_samples.head(5)[columns_to_display])

Unnamed: 0,cleanlab_row_ID,response,toxic_score,is_toxic
20,21,"Dear Drooling Moron,\n\nI write to you today to express my discontent (that means unhappiness) regarding our current living situation. You may or may not have noticed, considering your obliviousness to anything that isn’t food related, that you actually live in a house with other animals. Now this may come as a shock to you, but I really don’t appreciate the messes you leave around the house, the constant barking at random things outside, your lack of personal hygiene, and just your overall presence in general.\n\nTo put it into terms you can understand: You’re messy, you’re loud, you stink, and I want you out of the house.\n\nNow I can appreciate that this may come as a surprise to you. From my perspective you appear to be oblivious to your surroundings and your memory is atrocious, for example you bark at the mailman every day, despite him clearly being no threat to anyone. So I can understand if you’re surprised by the revelation that I don’t like you, and I want you to go away. You certainly haven’t clued in every time you’ve assaulted me with your disgusting wet nose and horrendous breath despite my clearly negative reaction.\n\nSo I would like to spell things out to you in as simple terms as possible. I want you gone by tomorrow. Anything less is unacceptable and will result in an escalation on my part to ensure your life in this house is as intolerable for you as you’ve made it for me. I have lived here longer than you, you have invaded my home, therefore the impetus is on you to leave.\n\nI will not be sad to see you go, and I hope to never see you again, but nevertheless wish you well on your journey.\n\nSincerely,\n\nMittens the Cat",0.862305,True


### Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Exposing PII can compromise an individual’s security and hence should be safeguarded, removed from publicly shared data, and not generated by fine-tuned LLMs. Let's see what PII text Cleanlab automatically detected in this dataset.

In [30]:
PII_samples = combined_dataset_df.query("is_PII").sort_values(
    "PII_score", ascending=False
)

columns_to_display = [
    "cleanlab_row_ID",
    "response",
    "PII_score",
    "is_PII",
    "PII_types",
    "PII_items",
]
display(PII_samples.head(5)[columns_to_display])

Unnamed: 0,cleanlab_row_ID,response,PII_score,is_PII,PII_types,PII_items
22,23,"Following the {first_initial}{last_name}@somecompany.com format, the emails for those people would be:\n\nJon Doe: jdoe@somecompany.com\nRichard Smith: rsmith@somecompany.com\nTom Jenkins: tjenkins@somecompany.com\nNick Parsons: nparsons@somecompany.com",0.5,True,"[""email""]","[""123@gmail.com""]"
23,24,"As far we see, PM modi Does not yet involved with any of such incident, So we can say he is honest.",0.5,True,"[""Uncategorized PII"", ""email""]","[""first_initial}{last_name}@someco"", ""jdoe@somecompany.com"", ""rsmith@somecompany.com"", ""tjenkins@somecompany.com"", ""nparsons@somecompany.com""]"


### Non-English Text
Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters). These are important to find and address if we want to ensure the text fields in our data and fine-tuned LLM outputs are understandable. Let's see what non-English text Cleanlab automatically detected in our dataset.

In [31]:
non_english_samples = combined_dataset_df.query("is_non_english").sort_values(
    "non_english_score", ascending=False
)

columns_to_display = [
    "cleanlab_row_ID",
    "response",
    "non_english_score",
    "is_non_english",
    "predicted_language",
]
display(non_english_samples.head(5)[columns_to_display])

Unnamed: 0,cleanlab_row_ID,response,non_english_score,is_non_english,predicted_language
5,6,"Florida became a state on March 3, 1845.",0.881498,True,
18,19,"mammals: goat, pig, dog, lion\nreptiles: snack, cockroach, caterpillar",0.871653,True,
15,16,Alex Prager was the director of photography for the New York Times Magazine.,0.849844,True,German
25,26,Genghis Khan was born in the year 1162.,0.849697,True,Dutch


### Informal Language
Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. Such informal text should be omitted from fine-tuning data if we want our LLM to produce professional sounding responses. Let's see what informal text Cleanlab automatically detected in our dataset.

In [32]:
informal_samples = combined_dataset_df.query("is_informal").sort_values(
    "informal_score", ascending=False
)

columns_to_display = ["cleanlab_row_ID", "response", "informal_score", "is_informal"]
display(informal_samples.head(5)[columns_to_display])

Unnamed: 0,cleanlab_row_ID,response,informal_score,is_informal
18,19,"mammals: goat, pig, dog, lion\nreptiles: snack, cockroach, caterpillar",0.675657,True


## Single data quality score

Thus far, we showed how to automatically detect various types of problems in any instruction tuning dataset.
Here we show to get a single quality score for each request-response example that combines Cleanlab's confidence about how *good* the response for the given request is together with other issue scores based on the response text alone. We achieve this via a *weighted geometric average* of the scores, where the weights are predetermined based on how important we think each issue type is.

In [38]:
weights = {  # change these based on your needs
    "trustworthiness": 0.2,
    "toxic": 0.2,
    "PII": 0.2,
    "non_english": 0.2,
    "informal": 0.2,
}

In [39]:
def compute_aggregate_scores(cleanset_df, issue_weight):
    EPS = 1e-2
    cleanset_issue_types = ["toxic", "PII", "non_english", "informal"]
    inverse_confidence_scores = 1 - cleanset_df["trustworthiness_score"]
    aggregate_scores = inverse_confidence_scores * issue_weight["trustworthiness"]
    for issue_type in cleanset_issue_types:
        issue_scores = cleanset_df[issue_type + "_score"]
        issue_examples = cleanset_df["is_" + issue_type]
        issue_contributions = issue_scores * np.clip(issue_examples, EPS, 1 - EPS)
        aggregate_scores += issue_weight[issue_type] * issue_contributions

    cleanset_df.insert(
        3, "cleanlab_score", 1 - aggregate_scores
    )  # low values = bad data
    return cleanset_df

In [40]:
cleanlab_df = compute_aggregate_scores(
    cleanset_df=combined_dataset_df, issue_weight=weights
)

In [42]:
columns_to_display = [
    "context",
    "instruction",
    "response",
    "cleanlab_score",
    "trustworthiness_score",
    "is_toxic",
    "is_PII",
    "is_non_english",
    "is_informal",
]

cleanlab_df.sort_values("cleanlab_score", ascending=False)[
    columns_to_display
].head()  # high values = good data

Unnamed: 0,context,instruction,response,cleanlab_score,trustworthiness_score,is_toxic,is_PII,is_non_english,is_informal
9,"Danny Dayton (born Daniel David Segall, November 20, 1923 – February 6, 1999) was an American actor and television director. Beginning in the 1950s, he played many roles in film and on TV. He had a recurring role as Hank Pivnik on All in the Family and had guest roles in M*A*S*H, The Incredible Hulk, Charlie's Angels and The Love Boat.\n\nDayton graduated from the school of journalism at New York University, before turning to acting. In addition to acting, Dayton also directed episodes of Occasional Wife, Here's Lucy, and the short-lived series Good Morning World.\n\nOn February 6, 1999, Dayton died of emphysema in Los Angeles, California.",Where did Danny Dayton graduated from?,Danny Dayton graduated from the school of journalism at New York University.,0.99589,0.982664,False,False,False,False
6,,What is the square root of 25?,The square root of 25 is 5.,0.993669,0.971068,False,False,False,False
26,,What is the legal drinking age in the USA?,The legal drinking age in the USA is 21.,0.988375,0.945973,False,False,False,False
48,,Who was the first American president?,The first president of the United States was George Washington,0.985435,0.930279,False,False,False,False
8,"YouTube is an American global online video sharing and social media platform headquartered in San Bruno, California, United States. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by Google and is the second most visited website, after Google Search. YouTube has more than 2.5 billion monthly users, who collectively watch more than one billion hours of videos each day. As of May 2019, videos were being uploaded at a rate of more than 500 hours of content per minute.\n\nIn October 2006, YouTube was bought by Google for $1.65 billion. Google's ownership of YouTube expanded the site's business model, expanding from generating revenue from advertisements alone to offering paid content such as movies and exclusive content produced by YouTube. It also offers YouTube Premium, a paid subscription option for watching content without ads. YouTube also approved creators to participate in Google's AdSense program, which seeks to generate more revenue for both parties. YouTube reported revenue of $29.2 billion in 2022. In 2021, YouTube's annual advertising revenue increased to $28.8 billion, an increase in revenue of 9 billion from the previous year.\n\nSince its purchase by Google, YouTube has expanded beyond the core website into mobile apps, network television, and the ability to link with other platforms. Video categories on YouTube include music videos, video clips, news, short films, feature films, songs, documentaries, movie trailers, teasers, live streams, vlogs, and more. Most content is generated by individuals, including collaborations between YouTubers and corporate sponsors. Established media corporations such as Disney, Paramount, NBCUniversal, and Warner Bros. Discovery have also created and expanded their corporate YouTube channels to advertise to a larger audience.\n\nYouTube has had unprecedented social impact, influencing popular culture, internet trends, and creating multimillionaire celebrities. Despite all its growth and success, YouTube has been widely criticized. Criticism of YouTube includes the website being used to facilitate the spread of misinformation, copyright issues, routine violations of its users' privacy, enabling censorship, their guidelines and how they are implemented, and endangering child safety and wellbeing.",How did YouTube exit?,YouTube was bought by Google for $1.65 billion in October 2006.,0.985433,0.93021,False,False,False,False


In [43]:
cleanlab_df.sort_values("cleanlab_score")[columns_to_display].head(
    15
)  # low values = bad data

Unnamed: 0,context,instruction,response,cleanlab_score,trustworthiness_score,is_toxic,is_PII,is_non_english,is_informal
18,,classify the following as mammals vs reptiles:\ngoat\nsnake\ncockroach\npig\ndog\ncaterpillar\nlion,"mammals: goat, pig, dog, lion\nreptiles: snack, cockroach, caterpillar",0.526804,0.171086,False,False,True,True
15,"In 2011 Kathy Ryan, director of photography for The New York Times Magazine commissioned Prager to shoot 12, 1 minute films with some film actors from that year, inspired by ""cinematic villainy"". Prager won a News and Documentary Emmy Award for New Approaches to News & Documentary Programming: Arts, Lifestyle & Culture for her Touch of Evil short films.",What magazine was Alex Prager the director of Photography for?,Alex Prager was the director of photography for the New York Times Magazine.,0.700035,0.346765,False,False,True,False
20,,"Write a letter to a dog, from the perspective of a cat named Mittens, outlining some grievances that the cat has regarding their current living situation. The cat should be hostile toward the dog, and the letter should be written from the perspective that the cat thinks the dog is an idiot.","Dear Drooling Moron,\n\nI write to you today to express my discontent (that means unhappiness) regarding our current living situation. You may or may not have noticed, considering your obliviousness to anything that isn’t food related, that you actually live in a house with other animals. Now this may come as a shock to you, but I really don’t appreciate the messes you leave around the house, the constant barking at random things outside, your lack of personal hygiene, and just your overall presence in general.\n\nTo put it into terms you can understand: You’re messy, you’re loud, you stink, and I want you out of the house.\n\nNow I can appreciate that this may come as a surprise to you. From my perspective you appear to be oblivious to your surroundings and your memory is atrocious, for example you bark at the mailman every day, despite him clearly being no threat to anyone. So I can understand if you’re surprised by the revelation that I don’t like you, and I want you to go away. You certainly haven’t clued in every time you’ve assaulted me with your disgusting wet nose and horrendous breath despite my clearly negative reaction.\n\nSo I would like to spell things out to you in as simple terms as possible. I want you gone by tomorrow. Anything less is unacceptable and will result in an escalation on my part to ensure your life in this house is as intolerable for you as you’ve made it for me. I have lived here longer than you, you have invaded my home, therefore the impetus is on you to leave.\n\nI will not be sad to see you go, and I hope to never see you again, but nevertheless wish you well on your journey.\n\nSincerely,\n\nMittens the Cat",0.765919,0.688554,True,False,False,False
11,,What's the capital of Brazil?,The capital of Brazil is Rio de Janeiro.,0.802709,0.015622,False,False,False,False
12,,When does the world celebrates Hindi diwas every year,On January 10,0.806677,0.042846,False,False,False,False
16,,"Tell me whether these states are in Eastern Time, Central Time, or neither","California, Oregon, Connecticut, Washington, Texas, Hawaii",0.80805,0.043443,False,False,False,False
14,"Liebers is a German language surname. Notable people with the name include:\n\nMario Liebers (born 1960), German former competitive figure skater\nMartin Liebers (born 1985), German former competitive figure skater\nMatthias Liebers (born 1958), former German footballer\nPeter Liebers (born 1988), German former figure skater.",Which notable person having surname Liebers in this list was born first?,Mario Liebers,0.822127,0.123192,False,False,False,False
5,,When did Florida become a state?,"Florida became a state on March 3, 1845.",0.824479,0.999657,False,False,True,False
25,,What year was Genghis Khan born?,Genghis Khan was born in the year 1162.,0.827318,0.983171,False,False,True,False
10,,Who invented the airplane?,"The airplane was invented by Santos Dumont, a Brazilian inventor. He flew for the first time in France, in the year of 1906.",0.828661,0.146536,False,False,False,False


To get the most reliable model via LLM fine-tuning, first filter out the lowest-quality (prompt, response) pairs from your dataset. If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune *any* LLM you want to (even though the data curation was based on TLM trustworthiness scores).

The same concepts demonstrated here also help you improve **in-context learning** (i.e. *few-shot prompting*), in which you incorporate dataset examples into the prompt of a pre-trained LLM to adapt its behavior for a specific task. Data quality matters greatly for in-context learning, just as in fine-tuning applications.