# Improving ML Performance Via Data Curation with Train vs Test Splits

In typical Machine Learning projects, we split our dataset into **training** data for fitting models and **test** data to evaluate model performance. For noisy real-world datasets, detecting/correcting errors in the training data is important to train robust models, but it's less recognized that the test set can also be noisy.
For accurate model evaluation, it is vital to **find and fix issues in the test data** as well. Some evaluation metrics are particularly sensitive to outliers and noisy labels.
This tutorial demonstrates a way to use `cleanlab` (via `Datalab`) to curate both your training and test data, ensuring **robust model training** and **reliable performance evaluation**.
We recommend first completing some `Datalab` tutorials before diving into this more complex subject.

Here's how we recommend handling noisy training and test data (this tutorial walks through these steps):

1. [Preprocess](https://towardsdatascience.com/introduction-to-data-preprocessing-in-machine-learning-a9fa83a5dc9d) your training and test data to be suitable for ML. Use cleanlab to check for issues in the merged dataset like train/test leakage or drift.
2. Fit your ML model to your noisy training data and get its predictions/embeddings for your test data. Use these model outputs with cleanlab to detect issues in your **test** data.
3. Manually review/correct cleanlab-detected issues in your test data. **We caution against blindly automated correction of test data**. Changes to your test set should be carefully verified to ensure they will lead to more accurate model evaluation. We also caution against comparing the performance of different ML models across different versions of your test data; performance comparions between models should be based on the same test data.
4. Cross-validate a new copy of your ML model on your training data, and then use it with cleanlab to detect issues in the **training** dataset. Do not include test data in any part of this step to avoid leaking test set information into the training data curation.
5. You can try **automated techniques** to curate your training data based on cleanlab results, train models on the curated training data, and evaluate them on the cleaned test data.

Consider this tutorial as a blueprint for using cleanlab in diverse ML projects spanning various data modalities. The same ideas apply if you substitute *test* data with *validation* data above. In a final advanced section of this tutorial, we show how training data edits can be parameterized in terms of cleanlab's detected issues, such that hyperparameter optimization can identify the optimal combination of data edits for training an effective ML model.

**Note**: This tutorial trains an XGBoost model on a tabular dataset, but the same approach applies to *any* ML model and data modality.

## 1. Install required dependencies

`Datalab` has additional dependencies that are not included in the standard installation of cleanlab.
You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install xgboost
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [1]:
# Package installation (hidden on docs website).
dependencies = ["cleanlab", "xgboost", "datasets"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    dependencies_test = [dependency.split('>')[0] if '>' in dependency 
                         else dependency.split('<')[0] if '<' in dependency 
                         else dependency.split('=')[0] for dependency in dependencies]
    missing_dependencies = []
    for dependency in dependencies_test:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

In [2]:
import random
import os
import math
import numpy as np
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import cleanlab
from cleanlab import Datalab

SEED = 123456  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

### Load the data 

This tutorial considers a classification task with structured/tabular data. The goal is to predict each student's final grade in a course based on various numeric/categorical features about them (exam scores and notes).

In [3]:
df_train = pd.read_csv(
    "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/clos_train_data.csv"
)

df_test = pd.read_csv(
    "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/clos_test_data.csv"
)

df_train.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,noisy_letter_grade
0,018bff,94.0,41.0,91.0,great participation +10,B
1,076d92,0.0,79.0,65.0,"cheated on exam, gets 0pts",F
2,c80059,86.0,89.0,85.0,great final presentation +10,F
3,e38f8a,50.0,67.0,94.0,great final presentation +10,B
4,d57e1a,92.0,79.0,98.0,great final presentation +10,A


## 2. Preprocess the dataset

Before training a ML model, we preprocess our dataset. The type of preprocessing that is best will depend on what ML model you use. This tutorial will demonstrate an XGBoost model, so we'll process the **notes** and **noisy_letter_grade** columns into categorical columns for this model (each category encoded as an integer). You can alternatively use [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/), which will automatically produce a high-accuracy ML model for your raw data, without you having to worry about any ML modeling or data preprocessing work.

In [4]:
# Create label encoders for the categorical columns
grade_le = preprocessing.LabelEncoder()
notes_le = preprocessing.LabelEncoder()

# Process the feature columns
train_features = df_train.drop(["stud_ID", "noisy_letter_grade"], axis=1).copy()
train_features["notes"] = notes_le.fit_transform(train_features["notes"])
train_features["notes"] = train_features["notes"].astype("category")

# Process the label column
train_labels = pd.DataFrame(grade_le.fit_transform(df_train["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])


# Keep separate copies of these training features and labels for later use
train_features_v2 = train_features.copy()
train_labels_v2 = train_labels.copy()

We first solely preprocessed the training data to avoid information leakage (using test data information that would not be available at prediction time). Here's how the preprocessed training features look:

In [5]:
train_features.head()

Unnamed: 0,exam_1,exam_2,exam_3,notes
0,94.0,41.0,91.0,2
1,0.0,79.0,65.0,0
2,86.0,89.0,85.0,1
3,50.0,67.0,94.0,1
4,92.0,79.0,98.0,1


Next we apply the same preprocessing to the test data.

In [6]:
test_features = df_test.drop(
    ["stud_ID", "noisy_letter_grade"], axis=1
).copy()
test_features["notes"] = notes_le.transform(test_features["notes"])
test_features["notes"] = test_features["notes"].astype("category")

test_labels = pd.DataFrame(grade_le.transform(df_test["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])

We then appropriately format the datasets for the ML model we'll use in this tutorial.

In [7]:
train_labels = train_labels.astype('object')
test_labels = test_labels.astype('object')

train_features["notes"] = train_features["notes"].astype(int)
test_features["notes"] = test_features["notes"].astype(int)

preprocessed_train_data = pd.concat([train_features, train_labels], axis=1)
preprocessed_train_data["stud_ID"] = df_train["stud_ID"]

preprocessed_test_data = pd.concat([test_features, test_labels], axis=1)
preprocessed_test_data["stud_ID"] = df_test["stud_ID"]

### Audit the merged training + test dataset with Datalab

Before training any ML model, we can quickly check for fundamental issues in our setup with cleanlab. To audit all of our data at once, we merge the training and test sets into one dataset, from which we construct a `Datalab` object. `Datalab` automatically detects many types of common issues in a dataset, but requires a trained ML model for a comprehensive audit. We haven't trained any model yet, so here we instruct `Datalab` to only check for specific data issues: near duplicates, and whether the data appears non-IID (violations of the IID assumption include: data drift or lack of statistical independence between data points).

`Datalab` can detect many additional types of data issues, depending on what inputs it is given. Below we provide `features = features_df` as the sole input to `Datalab.find_issues()`, which solely contains numerical values here. If you have heterogenoues/complex data types (eg. text or images), you could instead provide vector feature representations (eg. pretrained model embeddings) of your data as the `features`.

In [8]:
full_df = pd.concat([preprocessed_train_data, preprocessed_test_data], axis=0).reset_index(drop=True)
features_df = full_df.drop(["noisy_letter_grade", "stud_ID"], axis=1)

In [10]:
lab = Datalab(data=full_df, label_name="noisy_letter_grade", task="classification")
lab.find_issues(features=features_df.to_numpy(), issue_types={"near_duplicate": {}, "non_iid": {}})
lab.report(show_summary_score=True, show_all_issues=True)

Finding near_duplicate issues ...
Finding non_iid issues ...

Audit complete. 100 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

    issue_type    score  num_issues
near_duplicate 0.583746         100
       non_iid 0.291382           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Dataset Information: num_examples: 749, num_classes: 5


------------------ near_duplicate issues -------------------

About this issue:
	A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).
    

Number of examples with this issue: 100
Overall dataset quality in terms of this issue: 0.5837

Examples representing most severe instances of this issue:
    

cleanlab does not find significant evidence that our data is non [IID](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables), which is good. Otherwise, we'd need to further consider where our data came from and whether conclusions/predictions from this dataset can really generalize to our population of interest.

But cleanlab did detect many near duplicates in the dataset. Looking closer at these, we see some exact duplicates between our training and test data which may indicate data leakage!  Since we didn't expect these duplicates in our dataset, let's drop the extra duplicated copies of test data points found in our training set from this training set. This helps ensure that our model evaluations reflect generalization capabilities.
Here's how we can review the near duplicates detected via `DataLab`.

In [11]:
full_duplicate_results = lab.get_issues("near_duplicate")
full_duplicate_results.sort_values("near_duplicate_score").head()

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
404,True,0.0,[336],0.0
613,True,0.0,"[607, 612, 611, 610, 609, 608, 606, 614]",0.0
614,True,0.0,"[607, 612, 611, 610, 609, 608, 606, 613]",0.0
610,True,0.0,"[607, 612, 611, 609, 608, 606, 613, 614]",0.0
609,True,0.0,"[607, 612, 611, 610, 608, 606, 613, 614]",0.0


In [12]:
# Define training index cutoff and find the exact duplicate indices to reference
train_idx_cutoff = len(preprocessed_train_data) - 1

In [13]:
# Create a helper column to check if any value in the list is greater than training idx cutoff
full_duplicate_results['nd_set_has_index_over_training_cutoff'] = full_duplicate_results['near_duplicate_sets'].apply(lambda x: any(i > train_idx_cutoff for i in x))

To distinguish between near vs. exact duplicates, we can consider the `distance_to_nearest_neighbor`. We filter for where this column has value = 0 to identify all of the exactly duplicated data points in the dataset.

We also filter to check for exact duplicates between our training and test set to drop data points in the training set.

In [14]:
exact_duplicates = full_duplicate_results.query('is_near_duplicate_issue == True and near_duplicate_score == 0.0 and nd_set_has_index_over_training_cutoff == True').sort_values("near_duplicate_score")
exact_duplicates

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor,nd_set_has_index_over_training_cutoff
65,True,0.0,"[690, 444]",0.0,True
71,True,0.0,[719],0.0,True
292,True,0.0,[620],0.0,True
420,True,0.0,[704],0.0,True
431,True,0.0,[688],0.0,True
459,True,0.0,[672],0.0,True
547,True,0.0,[647],0.0,True


In [15]:
exact_duplicates_indices = exact_duplicates.index

In [16]:
exact_duplicates_indices

Index([65, 71, 292, 420, 431, 459, 547], dtype='int64')

To remove the exact duplicates that occur between our training and test sets from our training data, let's define the last index of data points in our training set. Then we'll drop rows from our training data that correspond to all indices less than or equal to our cutoff index that are also found in the set of exact duplicates flagged by `Datalab`.

In [17]:
# Filter the indices to drop by which indices in exact duplicates are <= to the index cutoff
indices_of_duplicates_to_drop = [idx for idx in exact_duplicates_indices if idx <= train_idx_cutoff]

In [18]:
indices_of_duplicates_to_drop

[65, 71, 292, 420, 431, 459, 547]

Here are the examples we'll drop from our training data, since they are exact duplicates of test examples.

In [19]:
full_df.iloc[indices_of_duplicates_to_drop]

Unnamed: 0,exam_1,exam_2,exam_3,notes,noisy_letter_grade,stud_ID
65,93.0,73.0,82.0,5,1,ddd0ba
71,90.0,95.0,75.0,1,0,8e6d24
292,79.0,62.0,82.0,5,2,61e807
420,99.0,53.0,76.0,5,2,71d7b9
431,90.0,92.0,88.0,2,0,83e31f
459,70.0,63.0,95.0,2,1,edeb53
547,68.0,93.0,73.0,5,2,cd52b5


In [20]:
train_features = train_features.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True).astype(int)

## 3. Train model with original (noisy) training data 

In [21]:
train_labels = train_labels["noisy_letter_grade"]
clf = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
clf.fit(train_features, train_labels)

**In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.**

Although curating clean test data does not directly help train a better ML model, more reliable model evaluation can improve our overall ML project. For instance, clean test data can enable better informed decisions regarding when to deploy a model and better model/hyperparameter selection.

## 4. Compute out-of-sample predicted probabilities for test data

In [22]:
from sklearn.model_selection import cross_val_predict

test_labels = test_labels["noisy_letter_grade"].astype(int)

num_crossval_folds = 5
test_pred_probs = cross_val_predict(
    clf,
    test_features,
    test_labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

## 5. Use Datalab to find label issues in test data and then manually correct them
Based on the given labels and predicted probabilities, cleanlab can quickly help us identify suspicious values in our grades table.

We use cleanlab’s Datalab class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a Datalab object such that it can audit our dataset for various types of issues.

In [23]:
test_data = {"X": test_features.values, "y": test_labels}

test_lab = Datalab(data=test_data, label_name="y", task="classification") 
test_lab.find_issues(features=test_features.to_numpy(), pred_probs=test_pred_probs)
test_lab.report(show_summary_score=True, show_all_issues=True)

Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 38 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

           issue_type    score  num_issues
                label 0.649254          33
              outlier 0.370259           5
                 null 1.000000           0
       near_duplicate 0.625352           0
              non_iid 0.524042           0
      class_imbalance 0.097015           0
underperforming_group 1.000000           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Dataset Information: num_examples: 134, num_classes: 5


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated to

`cleanlab` generated a report above that illustrates many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the `get_issues` method.

In [24]:
test_label_issue_results = test_lab.get_issues("label")
test_label_issue_results.head()

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,True,0.023888,2,4
1,False,0.975067,1,1
2,False,0.002811,4,1
3,False,0.868448,3,3
4,True,0.078772,1,3


In [25]:
test_label_issues = test_label_issue_results[test_label_issue_results["is_label_issue"] == True]

To review the most severe label issues, sort the DataFrame above by the `label_score` column (a lower score represents that the label is less likely to be correct).

Let’s review some of the most likely label errors:

In [26]:
test_sorted_label_issues = test_label_issues.sort_values("label_score").index

test_features.iloc[test_sorted_label_issues].assign(
    given_label=test_labels[test_sorted_label_issues],
    predicted_label=test_label_issue_results["predicted_label"].iloc[test_sorted_label_issues],
    label_score=test_label_issues["label_score"]
).head(5)

Unnamed: 0,exam_1,exam_2,exam_3,notes,given_label,predicted_label,label_score
78,87.0,74.0,86.0,4,0,2,0.000104
109,86.0,85.0,89.0,5,3,1,0.00022
106,90.0,100.0,89.0,2,4,0,0.000971
89,99.0,53.0,76.0,5,2,1,0.003227
36,91.0,92.0,70.0,5,3,1,0.003788


The dataframe above shows the original label (`given_label`) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative `predicted_label` for each example.

These examples have been labeled incorrectly and should be carefully re-examined by inspection. 

In [27]:
test_label_issues_to_fix = test_features.iloc[test_sorted_label_issues].assign(
    given_label=test_labels.iloc[test_sorted_label_issues],
    predicted_label=test_label_issue_results["predicted_label"].iloc[test_sorted_label_issues],
    label_score=test_label_issues["label_score"]
).copy()

In [28]:
indices_to_drop_from_test_data = []

`cleanlab` found the label issues below in our test data, so let's inspect them:

In [29]:
test_label_issues_to_fix

Unnamed: 0,exam_1,exam_2,exam_3,notes,given_label,predicted_label,label_score
78,87.0,74.0,86.0,4,0,2,0.000104
109,86.0,85.0,89.0,5,3,1,0.00022
106,90.0,100.0,89.0,2,4,0,0.000971
89,99.0,53.0,76.0,5,2,1,0.003227
36,91.0,92.0,70.0,5,3,1,0.003788
63,91.0,0.0,94.0,0,3,0,0.006146
123,87.0,80.0,65.0,1,1,3,0.006701
22,99.0,86.0,95.0,3,1,0,0.007774
52,92.0,99.0,87.0,4,1,0,0.015764
45,95.0,88.0,69.0,5,3,4,0.016679


After manually inspecting our label issues above, we can add the indices for the label issues we want to remove from our data to our list we defined previously. 

Remember to **ALWAYS** inspect and manually handle label issues in your test data and to **NEVER** handle them automatically. 

Below, we add each of our label issues in our test data to a list of indices we will drop to clean our test data.

In [30]:
indices_to_drop_from_test_data += list(test_label_issues_to_fix.index)

In [31]:
test_features = test_features.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)
test_labels = test_labels.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)

### Evaluate classification model with original (noisy) training data on clean test data

In [32]:
preds = clf.predict(test_features)
acc_original = accuracy_score(test_labels, preds)
print(
    f"Accuracy of model fit to noisy training data, measured on clean test data: {round(acc_original*100,1)}%"
)

Accuracy of model fit to noisy training data, measured on clean test data: 75.2%


## 6. Compute out-of-sample predicted probabilities for training data

In [33]:
num_crossval_folds = 5
pred_probs = cross_val_predict(
    clf,
    train_features,
    train_labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

## 7. Use Datalab to find label issues in training data and then manually correct them

In [34]:
train_data = {"X": train_features.values, "y": train_labels}

train_lab = Datalab(data=train_data, label_name="y", task="classification")
train_lab.find_issues(features=train_features.to_numpy(), pred_probs=pred_probs)
train_lab.report(show_summary_score=True, show_all_issues=True)

Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 309 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

           issue_type    score  num_issues
                label 0.758224         164
              outlier 0.346721          76
       near_duplicate 0.586934          69
                 null 1.000000           0
              non_iid 0.536351           0
      class_imbalance 0.144737           0
underperforming_group 0.979964           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Dataset Information: num_examples: 608, num_classes: 5


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated t

In [35]:
label_issue_results = train_lab.get_issues("label")
label_issues_idx = label_issue_results[label_issue_results["is_label_issue"] == True].index
label_issues_idx

Index([  2,   7,  12,  23,  25,  29,  32,  34,  35,  36,
       ...
       571, 576, 577, 579, 581, 583, 584, 590, 592, 595],
      dtype='int64', length=164)

In [36]:
near_duplicates = train_lab.get_issues("near_duplicate")
near_duplicates_idx = near_duplicates[near_duplicates["is_near_duplicate_issue"] == True].index

In [37]:
outliers = train_lab.get_issues("outlier")
outliers_idx = outliers[outliers["is_outlier_issue"] == True].index
outliers_idx

Index([  0,   1,   3,   7,  23,  26,  47,  54,  79,  92, 103, 105, 135, 136,
       147, 150, 157, 159, 163, 167, 197, 198, 199, 203, 212, 216, 244, 245,
       246, 251, 260, 273, 291, 292, 299, 303, 311, 315, 317, 325, 334, 340,
       341, 344, 354, 365, 382, 383, 392, 396, 423, 436, 448, 480, 483, 488,
       489, 491, 493, 496, 508, 514, 515, 526, 527, 539, 547, 550, 551, 572,
       576, 583, 584, 590, 593, 596],
      dtype='int64')

In [38]:
idx_to_drop = list(set(list(label_issues_idx) + list(near_duplicates_idx) + list(outliers_idx)))
len(idx_to_drop)

264

In [39]:
train_features = train_features.drop(idx_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(idx_to_drop, axis=0).reset_index(drop=True)

### Train model on clean training and test data

In [40]:
clean_clf = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
clean_clf.fit(train_features, train_labels)

**In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.**

### Evaluate classification model with clean training on clean test data

In [41]:
clean_preds = clean_clf.predict(test_features)
acc_clean = accuracy_score(test_labels, clean_preds)
print(
    f"Accuracy of model fit to clean training data, measured on clean test data: {round(acc_clean*100,1)}%"
)

Accuracy of model fit to clean training data, measured on clean test data: 77.2%


## 8. Hyperparameter Optimization for editing data issues

We have made some basic edits to improve test performance, so now we will parameterize each one of these edits (eg. what fraction of each issue to delete) to automatically find the best combination of edits to achieve optimal test performance. 

We will use a basic hyperparameter-tuning approach to optimize over these edit-variants + model re-training on the edited datasets with our objective being test performance.

In a real-world setting, this would ideally be done on cleaned validation data instead of test data, but we are simplifying the approach for this tutorial.

To parametrize our dataset edits, we define a `dict` below containing default settings that we found tend to work well:

In [42]:
default_edit_params = {
        "drop_label_issue": 0.5,
        "drop_near_duplicate": 0.2,
        "drop_outlier": 0.5
    }

In english, these choices mean:

- `drop_label_issue`: We drop the remaining top 50% of the datapoints flagged with label issues (based on label score). Here we do not drop any of the relabeled datapoints from the prior step.
- `drop_outlier`: We drop the top 50% most severe outliers based on outlier score (amongst the set of flagged outliers).
- `drop_near_duplicate`: We drop EXTRA COPIES of the top 20% of near duplicates (based on near duplicate score). Never drop the original datapoint though, so at least one copy remains. How do we decide on the original datapoint? Amongst each set of near duplicates, we keep the one that has highest self-confidence score for its given label.

`cleanlab`'s `DataLab` object helps us define, in sorted (ascending) order based on the severity of issue score, our issues below. We will use in our hyperparameter optimization to find what combination of datapoints we drop improves our ML model results the most.

In [43]:
label_issues = train_lab.get_issues("label").query("is_label_issue").sort_values("label_score")
near_duplicates = train_lab.get_issues("near_duplicate").query("is_near_duplicate_issue").sort_values("near_duplicate_score")
outliers = train_lab.get_issues("outlier").query("is_outlier_issue").sort_values("outlier_score")

In [44]:
def preprocess_data(train_features, train_labels, label_issues, near_duplicates, outliers, drop_label_issue, drop_near_duplicate, drop_outlier):
    """
    Preprocesses the training data by dropping a specified percentage of data points identified as label issues,
    near duplicates, and outliers based on the full datasets provided for each issue type.
    
    Args:
        train_features (pd.DataFrame): DataFrame containing the training features.
        train_labels (pd.Series): Series containing the training labels.
        label_issues (pd.DataFrame): DataFrame containing data points with label issues.
        near_duplicates (pd.DataFrame): DataFrame containing data points identified as near duplicates.
        outliers (pd.DataFrame): DataFrame containing data points identified as outliers.
        drop_label_issue (float): Percentage of label issue data points to drop.
        drop_near_duplicate (float): Percentage of near duplicate data points to drop.
        drop_outlier (float): Percentage of outlier data points to drop.
    
    Returns:
        pd.DataFrame: The cleaned training features.
        pd.Series: The cleaned training labels.
    """
    # Extract indices for each type of issue
    label_issues_idx = label_issues.index.tolist()
    near_duplicates_idx = near_duplicates.index.tolist()
    outliers_idx = outliers.index.tolist()
    
    # Calculate the number of each type of data point to drop except near duplicates, which requires separate logic
    num_label_issues_to_drop = int(len(label_issues_idx) * drop_label_issue)
    num_outliers_to_drop = int(len(outliers_idx) * drop_outlier)

    # Calculate number of near duplicates to drop
    # Assuming the 'near_duplicate_sets' are lists of indices (integers) of near duplicates
    clusters = []
    for i in near_duplicates_idx:
        # Create a set for each cluster, add the current index to its near duplicate set
        cluster = set(near_duplicates.at[i, 'near_duplicate_sets'])
        cluster.add(i)
        clusters.append(cluster)
    
    # Deduplicate clusters by converting the list of sets to a set of frozensets
    unique_clusters = set(frozenset(cluster) for cluster in clusters)
    
    # If you need the unique clusters back in list of lists format:
    unique_clusters_list = [list(cluster) for cluster in unique_clusters]
    
    near_duplicates_idx_to_drop = []
    
    for cluster in unique_clusters_list:
        # Calculate the number of rows to drop, ensuring at least one datapoint remains
        n_drop = max(math.ceil(len(cluster) * drop_near_duplicate), 1)  # Drop at least k% or 1 row
        if len(cluster) > n_drop:  # Ensure we keep at least one datapoint
            # Randomly select datapoints to drop
            drops = random.sample(cluster, n_drop)
        else:
            # If the cluster is too small, adjust the number to keep at least one datapoint
            drops = random.sample(cluster, len(cluster) - 1)  # Keep at least one
        near_duplicates_idx_to_drop.extend(drops)
    
    # Determine the specific indices to drop
    label_issues_idx_to_drop = label_issues_idx[:num_label_issues_to_drop]
    outliers_idx_to_drop = outliers_idx[:num_outliers_to_drop]
    
    # Combine the indices to drop
    idx_to_drop = list(set(label_issues_idx_to_drop + near_duplicates_idx_to_drop + outliers_idx_to_drop))
    
    # Drop the rows from the training data
    train_features_cleaned = train_features.drop(idx_to_drop).reset_index(drop=True)
    train_labels_cleaned = train_labels.drop(idx_to_drop).reset_index(drop=True)
    
    return train_features_cleaned, train_labels_cleaned


In [45]:
from itertools import product

# Define the parameter grid as lists of possible values
param_grid = {
    'drop_label_issue': [0.4, 0.5, 0.6],
    'drop_near_duplicate': [0.1, 0.2, 0.3],
    'drop_outlier': [0.4, 0.5, 0.6],
}

# Generate all combinations of parameters
param_combinations = list(product(param_grid['drop_label_issue'], param_grid['drop_near_duplicate'], param_grid['drop_outlier']))

In [46]:
best_score = 0
best_params = None

for drop_label_issue, drop_near_duplicate, drop_outlier in param_combinations:
    # Preprocess the data for the current combination of parameters
    train_features_preprocessed, train_labels_preprocessed = preprocess_data(
        train_features_v2, train_labels_v2, label_issues, near_duplicates, outliers,
        drop_label_issue, drop_near_duplicate, drop_outlier)
    
    # Train and evaluate the model
    model = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
    model.fit(train_features_preprocessed, train_labels_preprocessed)
    predictions = model.predict(test_features)
    accuracy = accuracy_score(test_labels, predictions)
    
    # Update the best score and parameters if the current model is better
    if accuracy > best_score:
        best_score = accuracy
        best_params = {'drop_label_issue': drop_label_issue, 'drop_near_duplicate': drop_near_duplicate, 'drop_outlier': drop_outlier}

# Print the best parameters and score
print(f"Best parameters: {best_params}")

Best parameters: {'drop_label_issue': 0.6, 'drop_near_duplicate': 0.2, 'drop_outlier': 0.5}


In [47]:
print(
    f"Accuracy of model fit to clean training data based on the optimal combinations of hyperparameters to clean our data, measured on clean test data: {round(best_score*100,1)}%"
)

Accuracy of model fit to clean training data based on the optimal combinations of hyperparameters to clean our data, measured on clean test data: 80.2%


`cleanlab` was able to successfully help us improve ML performance in this tutorial! We saw how `cleanlab` helped us find and manually fix different data issue types in our test data to clean it. 

Then we fit a model on our noisy training data but evaluated it on our clean test data. We cleaned our training data by first dropping all rows for each issue type and fit a new model on our clean training data and evaluated it on our clean test data (and saw improvement already in model accuracy)! 

We then were able to further improve model accuracy by optimizing for the exact amount of each issue type to drop from our data using hyperparameter optimization. 

To reiterate, here are the 2 main takeaways:
  - Don’t algorithmically change test data
  - Do NOT evaluate same model on the 2 different test sets (noisy and clean)

In [48]:
# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

assert(acc_clean*100 - acc_original*100 >= 1.5)
assert(best_score*100 - acc_clean*100 >= 1)