# Catching Issues in a Dynamically Growing Dataset

This is the recommended tutorial for programatically auditing datasets that grow over time with the Cleanlab Studio [Python API](/guide/quickstart/api/).

In this tutorial, we consider data that comes in batches accumulated into a master dataset. While one could follow the other tutorials to use Cleanlab to auto-detect issues across the entire master dataset, here we demonstrate how to catch issues in the most recent batch of data. We additionally demonstrate how to fix issues in the latest data batch, in order to create a higher-quality master dataset.

While this tutorial focuses specifically on label issues for brevity, the same ideas can be applied to catch any of the other data issues Cleanlab Studio can auto-detect (outliers, near duplicates, unsafe or low-quality content, ...). This tutorial focuses on text data, but the same ideas can be applied to the other data modalities Cleanlab Studio supports such as images or structured/tabular data. We recommend first completing our [text data quickstart tutorial](/tutorials/text_data_quickstart/) to first understand how Cleanlab Studio works with a static dataset.


## Install and import dependencies

Make sure you have `wget` installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

In [None]:
!pip install cleanlab-studio

In [2]:
import numpy as np
import pandas as pd
import os
import random

from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

## Fetch and view dataset

Here we use a variant of the [BANKING77](https://paperswithcode.com/dataset/banking77-oos) text classification dataset, in which customer service request are labeled as belonging to one of *K* classes (intent categories). To fetch the dataset for this tutorial, make sure you have `wget` and `zip` installed.

In [None]:
!mkdir -p data/
!wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/growing_dataset.zip -O data/growing_dataset.zip

In [4]:
!unzip -q data/growing_dataset.zip -d data/

This data is stored amongst 3 unequal batches, which will be received incrementally in this tutorial. Batch 3 contains **1 class** that was not seen in batches 1 or 2, to help you handle applications where certain dataset classes might appear or disappear over time.

Let's view the first few rows of the first data batch:

In [None]:
BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/growing_dataset")

batch_1 = pd.read_csv(os.path.join(dataset_path, 'data_batch1.csv'))

### Ensure unique identifier for data points

For a dynamic dataset, having a unique identifier for each data point allows us to better track results.
The current dataset has two columns - `text` and `label` - both of which cannot be used to identify a unique row.

We can add a column `id` that would just contain sequential numbers, starting from 0 to batch size - 1. For the subsequent batches, we'll start from previous batch's size to the total size of the merged dataset.

In [6]:
# Create a new column and assign sequential numbers till batch size
batch_1["id"] = range(0, len(batch_1))

batch_1.head()

Unnamed: 0,text,label,id
0,why is there a fee when i thought there would be no fees?,card_payment_fee_charged,0
1,why can't my beneficiary access my account?,beneficiary_not_allowed,1
2,does it cost extra to send out more than one card?,getting_spare_card,2
3,can i change my pin at an atm?,change_pin,3
4,"i have a us credit card, will you accept it?",supported_cards_and_currencies,4


In [7]:
print(f"The total number of rows in the current dataset: {len(batch_1)}")

The total number of rows in the current dataset: 500


## Load batch 1 into Cleanlab Studio

Upon receiving our batch of data, let's load it into Cleanlab Studio for analysis. First use your API key to instantiate a `Studio` object.

In [8]:
from cleanlab_studio import Studio

# You can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

studio = Studio(API_KEY)

Load the data from batch 1
into Cleanlab Studio. More details on uploading a dataset can be found in [this guide](/guide/quickstart/api/#uploading-a-dataset). We would use the `id` column as the unique identifier while uploading the dataset.

In [9]:
# Identifier column
identifier_col = 'id'

In [None]:
dataset_id = studio.upload_dataset(batch_1, dataset_name="data-batch-1", id_column=identifier_col)
print(f"Dataset ID: {dataset_id}")

### Launch a Project

A Cleanlab Studio Project automatically trains ML models to provide AI-based analysis of your dataset. Let's launch one for the data we have received so far.

In [10]:
label_col = 'label'  # name of column containing labels

In [None]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="batch-1-analysis",
    modality="text",
    task_type="multi-class",
    model_type="regular",  # set this to "fast" if time-constrained
    label_column=label_col
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Once the project has been launched successfully and the `project_id` is visible, feel free to close this notebook. It will take some time for Cleanlab’s AI to train models on this data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.

You should only execute the above cell once per data batch.  After launching the project, you can poll for its status to programmatically wait until the results are ready for review:

In [None]:
# Fetch the cleanset id corresponding to the above project_id
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)

If your notebook timed out, you can resume by re-running the above lines of code. **Do not** create a new project for the same batch when coming back to this notebook. When the project is complete, the resulting [cleanset](/guide/concepts/cleanset) contains many Cleanlab columns of metadata that can be used to produce an **clean**er version of our original data**set**.

### Review issues detected in batch 1

Fetch the [Cleanlab columns](/guide/concepts/cleanlab_columns/) of metadata for this [cleanset](/guide/concepts/cleanset) using its `cleanset_id`. These columns have the same length as our original data batch  and provide metadata about each individual data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another Project), simply call `studio.download_cleanlab_columns(cleanset_id)` with the `cleanset_id` printed from the previous cell.

In [12]:
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()

Unnamed: 0,id,corrected_label,is_label_issue,label_issue_score,suggested_label,suggested_label_confidence_score,is_ambiguous,ambiguous_score,is_well_labeled,is_near_duplicate,...,PII_score,PII_types,PII_items,is_informal,informal_score,is_non_english,non_english_score,predicted_language,is_toxic,toxic_score
0,0,,False,0.312013,,0.477797,False,0.946128,True,False,...,0.0,[],[],False,0.033124,False,0.007218,,False,0.181641
1,1,,False,0.254928,,0.565396,False,0.893294,True,False,...,0.0,[],[],False,0.160778,False,0.008784,,False,0.062164
2,2,,False,0.494538,,0.230466,False,0.968432,False,False,...,0.0,[],[],False,0.275605,False,0.004739,,False,0.184692
3,3,,False,0.353607,,0.411529,False,0.921283,True,False,...,0.0,[],[],False,0.390461,False,0.058934,,False,0.120056
4,4,,False,0.333672,,0.441308,False,0.892861,True,False,...,0.0,[],[],False,0.261901,False,0.00822,,False,0.107422


Details about the Cleanlab columns and their meanings can be found in [the Cleanlab columns guide](/guide/concepts/cleanlab_columns/).

In this tutorial, we focus on label issues only.  The ideas demonstrated here can be used for other types of data issues that Cleanlab Studio auto-detects.

A data point flagged with a **label issue** likely has a wrong given label. For such data points, consider correcting their label to the `suggested_label` if it seems more appropriate.

The data points exhibiting this issue are indicated with boolean values in the `is_label_issue` column, and the severity of this issue in each data point is quantified in the `label_issue_score` column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).

Let's create a `given_label` column in our dataframe to clearly indicate the class label originally assigned to each data point (customer service request).

In [13]:
# Copy data into a new DataFrame
df = batch_1.copy()

# Combine the dataset with cleanlab columns
merge_df_cleanlab = df.merge(cleanlab_columns_df, on="id")

# Rename label column to "given_label"
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

To see which data points are estimated to be mislabeled, we filter by `is_label_issue`. We sort by `label_issue_score` to see which of these data points are *most likely* mislabeled.

In [14]:
label_issues = merge_df_cleanlab.query("is_label_issue", engine='python').sort_values("label_issue_score", ascending=False)

columns_to_display = ["id", "text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]

display(label_issues[columns_to_display])

Unnamed: 0,id,text,label_issue_score,is_label_issue,given_label,suggested_label
7,7,can i change my pin on holiday?,0.706388,True,beneficiary_not_allowed,change_pin
459,459,will i be sent a new card before mine expires?,0.665286,True,apple_pay_or_google_pay,card_about_to_expire
117,117,my card is almost expired. how fast will i get a new one and what is the cost?,0.65673,True,apple_pay_or_google_pay,card_about_to_expire
54,54,is it possible to change my pin?,0.648368,True,beneficiary_not_allowed,change_pin
160,160,p,0.605489,True,getting_spare_card,supported_cards_and_currencies
115,115,can i get a new card even though i am in china?,0.57577,True,apple_pay_or_google_pay,card_about_to_expire
119,119,what currencies does google pay top up accept?,0.557327,True,apple_pay_or_google_pay,supported_cards_and_currencies
369,369,do i need to verify my top-up card?,0.52789,True,getting_spare_card,apple_pay_or_google_pay


Note that in most of these rows, the `given_label` really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request), except the rows which have a label issue score less than 0.7. Luckily we can easily correct these data points by just using Cleanlab's `suggested_label` above, which seems like a more appropriate label in most cases.

While the boolean flags above help us estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. In this tutorial, we use a threshold on `label_issue_score` to select which data points to fix, excluding the rest of the data points which doesn't meet the threshold.

### Improve batch 1 data based on the detected issues

Let's use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform here.

For data flagged as label issues, we create a new `corrected_label` column, which will be the `given_label` for data points without detected label issues, and the `suggested_label` for data points with detected label issues. We  use a `label_issue_score` threshold of 0.7 to determine which data points to re-label. The remaining data points flagged as label issues will be excluded from the dataset to avoid potential contamination.

Throughout, we track all of the rows we fixed (re-labeled) or excluded.

In [15]:
# Set issue score threshold
label_threshold = 0.70

# DataFrame to track excluded rows
threshold_filtered_rows = label_issues.query('label_issue_score < @label_threshold')

# Find indices of rows to exclude
ids_to_exclude = threshold_filtered_rows["id"]
indices_to_exclude = merge_df_cleanlab.query('id in @ids_to_exclude').index

print(f"Excluding {len(threshold_filtered_rows)} text examples (out of {len(merge_df_cleanlab)})")

# Drop rows from the merge DataFrame
merge_df_cleanlab = merge_df_cleanlab.drop(indices_to_exclude)

corrected_label = np.where(merge_df_cleanlab["is_label_issue"],
                           merge_df_cleanlab["suggested_label"],
                           merge_df_cleanlab["given_label"])

# DataFrame to track fixed (re-labeled) rows
label_issues_fixed_rows = merge_df_cleanlab.query("is_label_issue", engine='python')

Excluding 7 text examples (out of 500)


Let's make a cleaned version of the batch 1 data after applying these corrections:

In [16]:
fixed_batch_1 = merge_df_cleanlab[["text", "id"]].copy()
fixed_batch_1["label"] = corrected_label

Let's also initialize our curated master dataset, a single DataFrame to store the clean data points accumulated across all the data batches.

In [17]:
fixed_dataset = pd.DataFrame(columns=["text", "label", "id"])
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_1], ignore_index=True)  # add clean data from batch 1

Perfect! Now let's grow our master dataset after recieving the data from batch 2.

## Adding a second batch of data

Our `fixed_dataset` currently contains the cleaned version of our first data batch. Suppose now we've collected a second batch of data, which consists of 300 rows, that we wish to add to this master fixed dataset.

In [18]:
batch_2 = pd.read_csv(os.path.join(dataset_path, 'data_batch2.csv'))

We will again add a unique identifier `id` which would start from the last id of batch 1 i.e. the size of batch 1 (500), till the size of batch 1 + batch 2 (800).

In [19]:
total_rows = len(batch_1) + len(batch_2)
batch_2["id"] = range(len(batch_1), total_rows)
batch_2.head()

Let's check if this second batch of data contain any different class labels than the first batch. We will define a helper function which will compare the unique values of `label` column.

In [20]:
def compare_classes(new_data, historical_data, label_column):
    historical_data_classes = set(historical_data[label_column])
    new_batch_classes = set(new_data[label_column])
    if len(historical_data_classes.difference(new_batch_classes)) > 0:
        print(f"New batch has no data points from the following classes: {historical_data_classes.difference(new_batch_classes)}")
    if len(new_batch_classes.difference(historical_data_classes)) > 0:
        print(f"New batch has data points from previously unseen classes: {new_batch_classes.difference(historical_data_classes)}")

In [21]:
compare_classes(batch_2, fixed_dataset, label_col)

New batch has no data points from the following classes: {'lost_or_stolen_phone'}


The new batch 2 doesn't contain data points from the `lost_or_stolen_phone` class.

We'll repeat the Cleanlab Studio steps that we previously performed for our first data batch, this time on a larger dataset composed of our clean historical data plus the newest data batch.

In [22]:
batch_1_2 = pd.concat([fixed_dataset, batch_2], ignore_index=True)

### Load dataset, launch Project, and get Cleanlab columns

Again: if your notebook times out during any of the following steps, you likely don't need to re-run that step (re-running the step may take a long time again). Instead try to run the next step after restarting your notebook.

In [None]:
dataset_id = studio.upload_dataset(batch_1_2, dataset_name="data-batch-1-2", id_column=identifier_col)
print(f"Dataset ID: {dataset_id}")

In [None]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="batch-1-2-analysis",
    modality="text",
    task_type="multi-class",
    model_type="regular",
    label_column=label_col
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

In [None]:
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)

### Review issues detected in batch 2

Similar to how we reviewed the label issues detected in batch 1 data, here we will focus on the label issues detected in the newest (second) batch of data. Note that our Project analyzed this batch of data together with the clean historical data, as more data allows Cleanlab's AI to more accurately detect data issues. As before, the first step toward reviewing results is to merge the Cleanlab columns with the dataset that the Project was run on:

In [25]:
df = batch_1_2.copy()
merge_df_cleanlab = df.merge(cleanlab_columns_df, on="id")
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

The current `merge_df_cleanlab` dataset consists of both the cleaned historical data (from batch 1) and the raw batch 2 data. Here we demonstrate how to focus on catching label issues in the new (batch 2) data only:

In [26]:
# Use identifier to create an array of batch-2 id's
ids_of_batch_2 = batch_2['id']
# Isolate batch-2 data from the merged dataset
batch_2_subset = merge_df_cleanlab.query('id in @ids_of_batch_2', engine='python')
# Get batch 2 rows flagged as label issues
label_issues = batch_2_subset.query("is_label_issue", engine='python').sort_values("label_issue_score", ascending=False)

display(label_issues[columns_to_display])

Unnamed: 0,id,text,label_issue_score,is_label_issue,given_label,suggested_label
742,749,why am i being charge a fee when using an atm?,0.789753,True,card_about_to_expire,card_payment_fee_charged
686,693,what atms will allow me to change my pin?,0.716268,True,beneficiary_not_allowed,change_pin
788,795,what services can i use to top up?,0.676624,True,apple_pay_or_google_pay,supported_cards_and_currencies
652,659,why do i see extra charges for withdrawing my money?,0.672563,True,card_about_to_expire,card_payment_fee_charged
587,594,bad bank,0.601772,True,apple_pay_or_google_pay,supported_cards_and_currencies


### Improve batch 2 data based on the detected issues

Assume we are in working with a production data pipeline where fixing issues in the most recent batch of data is highest priority. Just as before, we can apply the same strategy to clean the batch 2 data (re-label the flagged data points with `label_issue_score` above the same threshold and exclude the rest of the label issues from the dataset).

In [27]:
# Keep track of excluded rows
issues_below_threshold = label_issues.query('label_issue_score < @label_threshold', engine='python')

threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])

# Find indices of rows to exclude
ids_to_exclude = issues_below_threshold["id"]
indices_to_exclude = batch_2_subset.query('id in @ids_to_exclude', engine='python').index

print(f"Excluding {len(ids_to_exclude)} text example(s) (out of {len(batch_2_subset)} from batch-2)")

# Drop rows from the batch-2 subset
batch_2_subset = batch_2_subset.drop(indices_to_exclude)

corrected_label = np.where(batch_2_subset["is_label_issue"],
                           batch_2_subset["suggested_label"],
                           batch_2_subset["given_label"])

# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_2_subset.query("is_label_issue", engine='python')])

Excluding 3 text example(s) (out of 300 from batch-2)


Applying these corrections produces a cleaned version of the batch 2 data.
We add this cleaned batch 2 data to our master fixed dataset (which up to this point contained the cleaned batch 1 data).

In [28]:
fixed_batch_2 = batch_2_subset[["text", "id"]].copy()
fixed_batch_2["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_2], ignore_index=True)

Awesome! We have grown our master dataset with the additional data being collected, while still ensuring this dataset is clean and free of label issues.

## Adding another batch of data

Finally, let's add another batch of newly collected data (200 rows) to our master dataset. As previously mentioned, batch 3 contains data from 1 class that was not seen in batches 1 or 2.

In [29]:
batch_3 = pd.read_csv(os.path.join(dataset_path, 'data_batch3.csv'))

# Create an id column for unique identification of rows
total_rows = total_rows + len(batch_3)
batch_3["id"] = range(len(batch_1) + len(batch_2), total_rows)
batch_3.head()

Let's check the new class present in batch 3.

In [30]:
compare_classes(batch_3, fixed_dataset, label_col)

New batch has data points from previously unseen classes: {'cancel_transfer'}


We repeat the Cleanlab Studio related steps that we performed before.

In [31]:
batch_1_2_3 = pd.concat([fixed_dataset, batch_3], ignore_index=True)

### Load dataset, launch Project, and get Cleanlab columns

Again: if your notebook times out during any of the following steps, you likely don't need to re-run that step (re-running the step may take a long time again). Instead try to run the next step after restarting your notebook.

In [None]:
dataset_id = studio.upload_dataset(batch_1_2_3, dataset_name="data-batch-1-3", id_column=identifier_col)
print(f"Dataset ID: {dataset_id}")

In [None]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="batch-1-3-analysis",
    modality="text",
    task_type="multi-class",
    model_type="regular",
    label_column=label_col
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

In [None]:
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)

### Review issues detected in batch 3

In [33]:
df = batch_1_2_3.copy()
merge_df_cleanlab = df.merge(cleanlab_columns_df, on="id")
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

The merged dataset `batch_1_2_3` consists of clean historical data (from batches 1 + 2) and the raw batch 3 data.
As before, let's focus on the issues detected in the batch 3 data:

In [34]:
# Create an array of batch-3 id's
ids_of_batch_3 = batch_3['id']
# Isolate batch-3 subset from the current dataset
batch_3_subset = merge_df_cleanlab.query('id in @ids_of_batch_3', engine='python')
# Fetch rows with label issues
label_issues = batch_3_subset.query("is_label_issue", engine='python').sort_values("label_issue_score", ascending=False)

display(label_issues[columns_to_display])

Unnamed: 0,id,text,label_issue_score,is_label_issue,given_label,suggested_label
850,860,which currencies can i used to add funds to my account?,0.850637,True,cancel_transfer,supported_cards_and_currencies
978,988,i was charged for getting cash.,0.834098,True,card_about_to_expire,card_payment_fee_charged
949,959,"so, i was just charged for my recent atm withdrawal and any withdrawal prior to this has been free. what is the issue here?",0.769936,True,card_about_to_expire,card_payment_fee_charged
840,850,how long does it take for a top up to be approved?,0.582027,True,cancel_transfer,supported_cards_and_currencies


### Review issues in older batches

While we've been focusing on the issues detected in the latest batch of data only, we can also see if any issues have been detected in the older historical data (the cleaned version of batches 1 and 2). Now that there is significantly more data in the Cleanlab Studio Project, the AI is able to detect data issues more accurately and may find issues missed in previous rounds. Let's see if there are any new label issues detected in the previous cleaned versions of batches 1 and 2:

In [35]:
batch_1_2_subset = merge_df_cleanlab.query('id not in @ids_of_batch_3', engine='python')

display(batch_1_2_subset.query("is_label_issue", engine='python')[columns_to_display])

Unnamed: 0,id,text,label_issue_score,is_label_issue,given_label,suggested_label
39,39,please tell me how to change my pin.,0.8304,True,beneficiary_not_allowed,change_pin
67,68,how do i find my new pin?,0.811962,True,visa_or_mastercard,change_pin
90,91,explain roth ira,0.585754,True,beneficiary_not_allowed,supported_cards_and_currencies
94,95,what cards do you offer?,0.726255,True,visa_or_mastercard,supported_cards_and_currencies


While we've been fixing the label issues detected in each older data batch at the time the data was collected, this doesn't guarantee the older batches are 100% free of label issues.

To demonstrate another type of issue, we can also review the outliers detected in these data batches in isolation. From the latest Cleanlab Studio Project, here are the outliers detected in batch 3:

In [36]:
columns_to_display_outlier = ["id", "text", "outlier_score", "is_outlier", "given_label", "suggested_label"]
outlier_issues_batch_3 = merge_df_cleanlab.query('(id in @ids_of_batch_3) & (is_outlier)', engine='python')

display(outlier_issues_batch_3[columns_to_display_outlier])

Unnamed: 0,id,text,outlier_score,is_outlier,given_label,suggested_label
852,862,cancel transaction,0.178126,True,cancel_transfer,cancel_transfer


Outliers detected in batch 2:

In [37]:
outlier_issues_batch_2 = merge_df_cleanlab.query('(id in @ids_of_batch_2) & (is_outlier)', engine='python')
display(outlier_issues_batch_2[columns_to_display_outlier])

Unnamed: 0,id,text,outlier_score,is_outlier,given_label,suggested_label
502,509,metal card,0.234098,True,card_about_to_expire,getting_spare_card
561,568,changing my pin,0.17211,True,change_pin,change_pin
582,589,750 credit score,0.154556,True,getting_spare_card,supported_cards_and_currencies
639,647,404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body>,0.202901,True,change_pin,cancel_transfer


Outliers detected in batch 1:

In [38]:
ids_of_batch_1 = batch_1["id"]
outlier_issues_batch_1 = merge_df_cleanlab.query('(id in @ids_of_batch_1) & (is_outlier)', engine='python')
display(outlier_issues_batch_1[columns_to_display_outlier])

Unnamed: 0,id,text,outlier_score,is_outlier,given_label,suggested_label
90,91,explain roth ira,0.176312,True,beneficiary_not_allowed,supported_cards_and_currencies
280,285,payment did not process,0.184214,True,beneficiary_not_allowed,card_payment_fee_charged
450,456,switch banks,0.186031,True,change_pin,change_pin
456,463,my sc,0.247411,True,apple_pay_or_google_pay,supported_cards_and_currencies


### Improve batch 3 data based on the detected issues

Finally, we fix just the label issues detected in batch 3, using the same strategy applied to the previous data batches.

In [39]:
# Keep track of excluded rows
issues_below_threshold = label_issues.query('label_issue_score < @label_threshold', engine='python')

threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])

# Find indices of rows to exclude
ids_to_exclude = issues_below_threshold["id"]
indices_to_exclude = batch_3_subset.query('id in @ids_to_exclude', engine='python').index

print(f"Excluding {len(ids_to_exclude)} text example(s) (out of {len(batch_3_subset)} from batch-3)")

# Drop rows from the batch-3 subset
batch_3_subset = batch_3_subset.drop(indices_to_exclude)

corrected_label = np.where(batch_3_subset["is_label_issue"],
                           batch_3_subset["suggested_label"],
                           batch_3_subset["given_label"])

# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_3_subset.query("is_label_issue", engine='python')])

Excluding 1 text example(s) (out of 200 from batch-3)


And then add the cleaned batch 3 data to our master dataset:

In [40]:
fixed_batch_3 = batch_3_subset[["text", "id"]].copy()
fixed_batch_3["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_3], ignore_index=True)

print(f"Total number of label issues fixed across all 3 batches: {len(label_issues_fixed_rows)}")
print(f"Total number of rows, with label issues, excluded due to score less than threshold ({label_threshold}): {len(threshold_filtered_rows)}")

Total number of label issues fixed across all 3 batches: 6
Total number of rows, with label issues, excluded due to score less than threshold (0.7): 11


### Saving the master dataset

After cleaning and accumulating multiple batches of data, we save the resulting master fixed dataset to a CSV file that can be used as a plug-in replacement in your existing modeling workflows. The cleaned dataset will have the same format as your original dataset, so you can use it as a plug-in replacement to get more reliable results in your ML/Analytics pipelines (without changing your existing modeling code).

In [41]:
new_dataset_filename = "fixed_dataset.csv"  # Location to save clean master dataset

if os.path.exists(new_dataset_filename):
    raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
    fixed_dataset.to_csv(new_dataset_filename, index=False, columns=["text", "label"])
    print(f"Master fixed dataset saved to {new_dataset_filename}")

Master fixed dataset saved to fixed_dataset.csv
