# Data Curation with Train vs Test Splits

In typical Machine Learning projects, we split our dataset into training data for fitting models and test data to evaluate model performance. For noisy real-world datasets, detecting/correcting errors in the training data is important to train robust models, but it's less recognized that the test set can also be noisy.
For accurate model evaluation, it is vital to find and fix issues in the test data as well. Some evaluation metrics are particularly sensitive to outliers and noisy labels.
This tutorial demonstrates a way to use [Cleanlab Open Source (CLOS)](https://github.com/cleanlab/cleanlab) to clean both your training and test data, ensuring **robust** model training and **reliable** performance evaluation.

Here's how we recommend handling noisy training and test data with CLOS:


NEEED TO UPDATE THIS

- First focus on detecting issues in the test data. For the best detection, we recommend that you merge your training and test data and then run a Cleanlab Studio Project (which will benefit from more data) -- but only focus on project results for the test data.
- Manually review/correct Cleanlab-detected issues in your test data. To avoid bias, we caution against automated correction of test data. Instead, test data changes should be individually verified to ensure they will lead to more accurate model evaluation.
- Run a separate Cleanlab Studio project on the training data alone to detect issues in the training data (without any test set information leakage).
- Optionally, use automated Cleanlab suggestions to algorithmically refine the training data (or manually review/correct Cleanlab-detected issues in your training data).
- Estimate the final model's performance on the cleaned test data. **Do not compare the performance of different ML models estimated across different versions of your test data.** These estimates are incomparable.

Consider this tutorial as a blueprint for using Cleanlab Studio in ML projects spanning various data modalities and tasks.
Let’s get started!


**Note**: We are using tabular data in this tutorial but the approach can apply to other data modalities!

## Load the data

First install and import required dependencies for this tutorial.

## Install required dependencies

You can use `pip` to install all packages required for this tutorial as follows:


Make sure to install the version corresponding to this tutorial
 - E.g. if viewing master branch documentation:
 - !pip install git+https://github.com/cleanlab/cleanlab.git

In this tutorial we are using the following version of `cleanlab`: `2.6.1`

In [1]:
!pip install xgboost scikit-learn pandas "cleanlab[datalab]"



In [2]:
import random
import os
import cleanlab
import math
import numpy as np
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
from cleanlab import Datalab

SEED = 123456

np.random.seed(SEED)
random.seed(SEED)

In [3]:
cleanlab.__version__

'2.6.1'

In [4]:
seed_value = 55

## Load the data

In [5]:
# df_train = pd.read_csv(
#     "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/train.csv"
# )
# df_test = pd.read_csv(
#     "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/test.csv"
# )
# df_train.head()

In [6]:
# df_train = pd.read_csv("clos_train_data.csv")
# df_test = pd.read_csv("clos_test_data.csv")

df_train = pd.read_csv("clos_train_data_v7.csv")
df_test = pd.read_csv("clos_test_data_v10.csv")

In [7]:
df_train.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,noisy_letter_grade
0,018bff,94.0,41.0,91.0,great participation +10,B
1,b3c9a0,91.0,74.0,88.0,,B
2,076d92,0.0,79.0,65.0,"cheated on exam, gets 0pts",F
3,68827d,91.0,98.0,75.0,missed class frequently -10,C
4,c80059,86.0,89.0,85.0,great final presentation +10,F


## Preprocess the dataset

Before training an XGBoost model, we preprocess the notes and noisy_letter_grade columns into categorical columns.

In [155]:
# Create label encoders for the grade and notes columns
grade_le = preprocessing.LabelEncoder()
notes_le = preprocessing.LabelEncoder()

# Prepare the feature columns
train_features = df_train.drop(["stud_ID", "noisy_letter_grade"], axis=1).copy()
train_features["notes"] = notes_le.fit_transform(train_features["notes"])
train_features["notes"] = train_features["notes"].astype("category")

# Encode the label column into a cateogorical feature
train_labels = pd.DataFrame(grade_le.fit_transform(df_train["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])


# Keep copies of these training features and labels to use for more advanced issue handling later
train_features_v2 = train_features.copy()
train_labels_v2 = train_labels.copy()

Let's view the training set features after preprocessing.

In [9]:
train_features.head()

Unnamed: 0,exam_1,exam_2,exam_3,notes
0,94.0,41.0,91.0,2
1,91.0,74.0,88.0,5
2,0.0,79.0,65.0,0
3,91.0,98.0,75.0,3
4,86.0,89.0,85.0,1


Next we repeat the same preprocessing steps for our clean test data.

In [10]:
test_features = df_test.drop(
    ["stud_ID", "noisy_letter_grade"], axis=1
).copy()
test_features["notes"] = notes_le.transform(test_features["notes"])
test_features["notes"] = test_features["notes"].astype("category")

test_labels = pd.DataFrame(grade_le.transform(df_test["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])

In [11]:
train_labels = train_labels.astype('object')
test_labels = test_labels.astype('object')

train_features["notes"] = train_features["notes"].astype(int)
test_features["notes"] = test_features["notes"].astype(int)

preprocessed_train_data = pd.concat([train_features, train_labels], axis=1)
preprocessed_train_data["stud_ID"] = df_train["stud_ID"]

preprocessed_test_data = pd.concat([test_features, test_labels], axis=1)
preprocessed_test_data["stud_ID"] = df_test["stud_ID"]

## Check for Near Duplicate and IID Issues on full dataset using Datalab

In [12]:
full_df = pd.concat([preprocessed_train_data, preprocessed_test_data], axis=0).reset_index(drop=True)
features_df = full_df.drop(["noisy_letter_grade", "stud_ID"], axis=1)

In [13]:
lab = Datalab(data=full_df, label_name="noisy_letter_grade", task="classification")
lab.find_issues(features=features_df.to_numpy(), issue_types={"near_duplicate": {}, "non_iid": {}})
lab.report(show_summary_score=True, show_all_issues=True)

Finding near_duplicate issues ...
Finding non_iid issues ...

Audit complete. 87 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

    issue_type    score  num_issues
near_duplicate 0.596413          87
       non_iid 0.090307           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Dataset Information: num_examples: 856, num_classes: 5


------------------ near_duplicate issues -------------------

About this issue:
	A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).
    

Number of examples with this issue: 87
Overall dataset quality in terms of this issue: 0.5964

Examples representing most severe instances of this issue:
     i

cleanlab helped confirm there are no `non_IID` issues which is good. Otherwise, we'd need to further research where our data came from and why it is not [iid](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables). 

We can see that we have many `near_duplicate` issues. In fact, we have exact duplicates between our training and test data which is a sign of data leakage! To manage this, let's drop the exact duplicates that are found between our training and test sets from our training set. 

cleanlab helps us filter for the `near_duplicate` issues using our `DataLab` results. Then we can look at a sample of them.

In [14]:
full_duplicate_results = lab.get_issues("near_duplicate")
full_duplicate_results.sort_values("near_duplicate_score").head()

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
715,True,0.0,"[714, 707, 706, 709, 710, 711, 712, 713, 708]",0.0
260,True,0.0,[592],0.0
2,True,0.0,[540],0.0
384,True,0.0,[463],0.0
707,True,0.0,"[714, 706, 709, 710, 711, 712, 713, 708, 715]",0.0


We can then filter for exact duplicates below:

In [15]:
exact_duplicates = full_duplicate_results[(full_duplicate_results["is_near_duplicate_issue"] == True) & (full_duplicate_results["near_duplicate_score"] == 0.0)].sort_values("near_duplicate_score")
exact_duplicates

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
2,True,0.0,[540],0.0
260,True,0.0,[592],0.0
384,True,0.0,[463],0.0
463,True,0.0,[384],0.0
540,True,0.0,[2],0.0
592,True,0.0,[260],0.0
706,True,0.0,"[714, 707, 709, 710, 711, 712, 713, 708, 715]",0.0
707,True,0.0,"[714, 706, 709, 710, 711, 712, 713, 708, 715]",0.0
708,True,0.0,"[714, 707, 706, 709, 710, 711, 712, 713, 715]",0.0
709,True,0.0,"[714, 707, 706, 710, 711, 712, 713, 708, 715]",0.0


To remove the exact duplicates that occur between our training and test sets from our training data, let's define the cutoff index that our training data ends at and then we can drop the rows from our training data that correspond to all indices less than or equal to our cutoff index that are also found in our exact duplicates we just calculated above.

In [16]:
# Define training index cutoff and find the exact duplicate indices to reference
train_idx_cutoff = len(preprocessed_train_data) - 1
exact_duplicates_indices = exact_duplicates.index

# Filter the indices to drop by which indices in exact duplicates are <= to the index cutoff
indices_of_duplicates_to_drop = [idx for idx in exact_duplicates_indices if idx <= train_idx_cutoff]

Now let's view the rows which we will drop from our training data since they are exact duplicates of values we have in our test data.

In [17]:
full_df.iloc[indices_of_duplicates_to_drop]

Unnamed: 0,exam_1,exam_2,exam_3,notes,noisy_letter_grade,stud_ID
2,0.0,79.0,65.0,0,4,076d92
260,78.0,58.0,86.0,1,1,36284d
384,90.0,0.0,100.0,0,1,fe7277
463,72.0,0.0,80.0,0,4,a33f92
540,0.0,79.0,65.0,0,0,77c9c5
592,78.0,58.0,86.0,1,1,9afe83
706,99.0,59.0,70.0,3,3,37fd76
707,99.0,59.0,70.0,3,3,37fd76
708,99.0,59.0,70.0,3,3,37fd76
709,99.0,59.0,70.0,3,3,37fd76


Then we drop these rows from our training data to get rid of the data leakage issue.

In [18]:
train_features = train_features.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True).astype(int)

## Train model with original (noisy) training data 

In [19]:
train_labels = train_labels["noisy_letter_grade"]
clf = XGBClassifier(tree_method="hist", enable_categorical=True)
clf.fit(train_features, train_labels)

Although curating clean test data does not directly help train a better ML model, more reliable model evaluation can improve our overall ML project. For instance, clean test data can enable better informed decisions regarding when to deploy a model and better model/hyperparameter selection.

## Compute out-of-sample predicted probabilities for test data

In [20]:
from sklearn.model_selection import cross_val_predict

test_labels = test_labels["noisy_letter_grade"].astype(int)

num_crossval_folds = 5
test_pred_probs = cross_val_predict(
    clf,
    test_features,
    test_labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

## Use Datalab to find label issues in test data and then manually correct them
Based on the given labels and predicted probabilities, cleanlab can quickly help us identify suspicious values in our grades table.

We use cleanlab’s Datalab class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a Datalab object such that it can audit our dataset for various types of issues.

In [21]:
test_data = {"X": test_features.values, "y": test_labels}

test_lab = Datalab(data=test_data, label_name="y", task="classification") 
test_lab.find_issues(features=test_features.to_numpy(), pred_probs=test_pred_probs)
test_lab.report(show_summary_score=True, show_all_issues=True)

Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 45 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

           issue_type    score  num_issues
                label 0.621429          45
                 null 1.000000           0
              outlier 0.381558           0
       near_duplicate 0.596489           0
              non_iid 0.592292           0
      class_imbalance 0.100000           0
underperforming_group 1.000000           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Dataset Information: num_examples: 140, num_classes: 5


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated to

The above report shows that cleanlab identified many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues method.

In [22]:
test_label_issue_results = test_lab.get_issues("label")
test_label_issue_results.head()

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.942174,1,1
1,False,0.779307,2,2
2,True,0.001335,4,1
3,False,0.830184,1,1
4,False,0.795155,3,3


In [23]:
test_label_issues = test_label_issue_results[test_label_issue_results["is_label_issue"] == True]

To review the most severe label issues, sort the DataFrame above by the `label_score` column (a lower score represents that the label is less likely to be correct).

Let’s review some of the most likely label errors:

In [24]:
test_sorted_label_issues = test_label_issues.sort_values("label_score").index

test_features.iloc[test_sorted_label_issues].assign(
    given_label=test_labels[test_sorted_label_issues],
    predicted_label=test_label_issue_results["predicted_label"].iloc[test_sorted_label_issues],
    label_score=test_label_issues["label_score"]
).head(5)

Unnamed: 0,exam_1,exam_2,exam_3,notes,given_label,predicted_label,label_score
83,87.0,74.0,86.0,4,0,2,0.000121
2,72.0,91.0,91.0,5,4,1,0.001335
15,72.0,90.0,98.0,5,4,1,0.00136
112,86.0,85.0,89.0,5,3,1,0.001372
98,95.0,80.0,86.0,5,4,1,0.003727


The dataframe above shows the original label (`given_label`) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative `predicted_label` for each example.

These examples have been labeled incorrectly and should be carefully re-examined by inspection. 

In [25]:
test_label_issues_to_fix = test_features.iloc[test_sorted_label_issues].assign(
    given_label=test_labels.iloc[test_sorted_label_issues],
    predicted_label=test_label_issue_results["predicted_label"].iloc[test_sorted_label_issues],
    label_score=test_label_issues["label_score"]
).copy()

In [26]:
indices_to_drop_from_test_data = []

cleanlab found the label issues below in our test data, so let's inspect them:

In [27]:
test_label_issues_to_fix

Unnamed: 0,exam_1,exam_2,exam_3,notes,given_label,predicted_label,label_score
83,87.0,74.0,86.0,4,0,2,0.000121
2,72.0,91.0,91.0,5,4,1,0.001335
15,72.0,90.0,98.0,5,4,1,0.00136
112,86.0,85.0,89.0,5,3,1,0.001372
98,95.0,80.0,86.0,5,4,1,0.003727
93,80.0,60.0,80.0,5,4,1,0.004443
79,71.0,78.0,80.0,1,3,1,0.005022
127,87.0,80.0,65.0,1,1,0,0.00513
102,95.0,81.0,76.0,5,1,3,0.005515
58,79.0,73.0,78.0,5,2,4,0.005803


After manually inspecting our label issues above, we can add the indices for the label issues we want to remove from our data to our list we defined previously. 

Remember to ALWAYS inspect and manually handle label issues in your test data and to NEVER handle them automatically. 

Below, we add each of our label issues in our test data to a list of indices we will drop to clean our test data.

In [28]:
indices_to_drop_from_test_data += list(test_label_issues_to_fix.index)

In [29]:
test_features = test_features.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)
test_labels = test_labels.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)

### Evaluate classification model with original (noisy) training data on clean test data

In [166]:
preds = clf.predict(test_features)
acc_original = accuracy_score(test_labels, preds)
print(
    f"Accuracy of model fit to noisy training data, measured on clean test data: {round(acc_original*100,1)}%"
)

Accuracy of model fit to noisy training data, measured on clean test data: 72.6%


##  Compute out-of-sample predicted probabilities for training data

In [31]:
num_crossval_folds = 5
pred_probs = cross_val_predict(
    clf,
    train_features,
    train_labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

## Use Datalab to find label issues in training data and then manually correct them

In [32]:
train_data = {"X": train_features.values, "y": train_labels}

train_lab = Datalab(data=train_data, label_name="y", task="classification")
train_lab.find_issues(features=train_features.to_numpy(), pred_probs=pred_probs)
train_lab.report(show_summary_score=True, show_all_issues=True)

Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 357 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

           issue_type    score  num_issues
                label 0.734286         209
              outlier 0.340853          78
       near_duplicate 0.597323          70
                 null 1.000000           0
              non_iid 0.477806           0
      class_imbalance 0.154286           0
underperforming_group 0.938766           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Dataset Information: num_examples: 700, num_classes: 5


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated t

In [33]:
label_issue_results = train_lab.get_issues("label")
label_issues_idx = label_issue_results[label_issue_results["is_label_issue"] == True].index
label_issues_idx

Index([  2,   3,   7,   9,  14,  15,  19,  24,  25,  26,
       ...
       665, 672, 674, 678, 680, 688, 691, 692, 695, 696],
      dtype='int64', length=209)

In [34]:
near_duplicates = train_lab.get_issues("near_duplicate")
near_duplicates_idx = near_duplicates[near_duplicates["is_near_duplicate_issue"] == True].index

In [35]:
outliers = train_lab.get_issues("outlier")
outliers_idx = outliers[outliers["is_outlier_issue"] == True].index
outliers_idx

Index([  0,   4,   9,  29,  47,  65,  93,  95,  99, 124, 157, 172, 182, 191,
       195, 198, 209, 229, 230, 235, 237, 245, 250, 278, 280, 285, 312, 331,
       340, 344, 352, 355, 360, 362, 365, 373, 382, 391, 392, 395, 407, 421,
       439, 440, 455, 463, 465, 470, 483, 488, 491, 499, 508, 521, 526, 563,
       566, 571, 572, 576, 579, 590, 599, 600, 612, 613, 627, 637, 641, 642,
       656, 665, 670, 676, 680, 681, 693, 697],
      dtype='int64')

In [36]:
idx_to_drop = list(set(list(label_issues_idx) + list(near_duplicates_idx) + list(outliers_idx)))
len(idx_to_drop)

310

In [37]:
train_features = train_features.drop(idx_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(idx_to_drop, axis=0).reset_index(drop=True)

## Train model on clean training and test data

In [38]:
clean_clf = XGBClassifier(tree_method="hist", enable_categorical=True)
clean_clf.fit(train_features, train_labels)

### Evaluate classification model with clean training on clean test data

In [39]:
clean_preds = clean_clf.predict(test_features)
acc_clean = accuracy_score(test_labels, clean_preds)
print(
    f"Accuracy of model fit to clean training data, measured on clean test data: {round(acc_clean*100,1)}%"
)

Accuracy of model fit to clean training data, measured on clean test data: 80.0%


### Hyperparameter Optimization for editing data issues

We have made some basic edits to improve test performance, so now we will parameterize each one of these edits (eg. what fraction of each issue to delete) to automatically find the best combination of edits to achieve optimal test performance. 

We will use a basic hyperparameter-tuning library to optimize over these edit-variants + model re-training on the edited datasets with our objective being test performance.

In a real-world setting, this would ideally be done on cleaned validation data instead of test data, but we are simplifying the approach for this tutorial.

To parametrize our dataset edits, we define a `dict` below containing default settings that we found tend to work well:

In [40]:
default_edit_params = {
        "drop_label_issue": 0.5,
        "drop_near_duplicate": 0.2,
        "drop_outlier": 0.5
    }

In english, these choices mean:

- `drop_label_issue`: We drop the remaining top 50% of the datapoints flagged with label issues (based on label score). Here we do not drop any of the relabeled datapoints from the prior step.
- `drop_outlier`: We drop the top 50% most severe outliers based on outlier score (amongst the set of flagged outliers).
- `drop_near_duplicate`: We drop EXTRA COPIES of the top 20% of near duplicates (based on near duplicate score). Never drop the original datapoint though, so at least one copy remains. How do we decide on the original datapoint? Amongst each set of near duplicates, we keep the one that has highest self-confidence score for its given label.

`cleanlab`'s `DataLab` object helps us define, in sorted (ascending) order based on the severity of issue score, our issues below. We will use in our hyperparameter optimization to find what combination of datapoints we drop improves our ML model results the most.

In [165]:
label_issues = train_lab.get_issues("label").query("is_label_issue").sort_values("label_score")
near_duplicates = train_lab.get_issues("near_duplicate").query("is_near_duplicate_issue").sort_values("near_duplicate_score")
outliers = train_lab.get_issues("outlier").query("is_outlier_issue").sort_values("outlier_score")

In [161]:
def preprocess_data(train_features, train_labels, label_issues, near_duplicates, outliers, drop_label_issue, drop_near_duplicate, drop_outlier):
    """
    Preprocesses the training data by dropping a specified percentage of data points identified as label issues,
    near duplicates, and outliers based on the full datasets provided for each issue type.
    
    Args:
        train_features (pd.DataFrame): DataFrame containing the training features.
        train_labels (pd.Series): Series containing the training labels.
        label_issues (pd.DataFrame): DataFrame containing data points with label issues.
        near_duplicates (pd.DataFrame): DataFrame containing data points identified as near duplicates.
        outliers (pd.DataFrame): DataFrame containing data points identified as outliers.
        drop_label_issue (float): Percentage of label issue data points to drop.
        drop_near_duplicate (float): Percentage of near duplicate data points to drop.
        drop_outlier (float): Percentage of outlier data points to drop.
    
    Returns:
        pd.DataFrame: The cleaned training features.
        pd.Series: The cleaned training labels.
    """
    # Extract indices for each type of issue
    label_issues_idx = label_issues.index.tolist()
    near_duplicates_idx = near_duplicates.index.tolist()
    outliers_idx = outliers.index.tolist()
    
    # Calculate the number of each type of data point to drop except near duplicates, which requires separate logic
    num_label_issues_to_drop = int(len(label_issues_idx) * drop_label_issue)
    num_outliers_to_drop = int(len(outliers_idx) * drop_outlier)

    # Calculate number of near duplicates to drop
    # Assuming the 'near_duplicate_sets' are lists of indices (integers) of near duplicates
    clusters = []
    for i in near_duplicates_idx:
        # Create a set for each cluster, add the current index to its near duplicate set
        cluster = set(near_duplicates.at[i, 'near_duplicate_sets'])
        cluster.add(i)
        clusters.append(cluster)
    
    # Deduplicate clusters by converting the list of sets to a set of frozensets
    unique_clusters = set(frozenset(cluster) for cluster in clusters)
    
    # If you need the unique clusters back in list of lists format:
    unique_clusters_list = [list(cluster) for cluster in unique_clusters]
    
    near_duplicates_idx_to_drop = []
    
    for cluster in unique_clusters_list:
        # Calculate the number of rows to drop, ensuring at least one datapoint remains
        n_drop = max(math.ceil(len(cluster) * drop_near_duplicate), 1)  # Drop at least k% or 1 row
        if len(cluster) > n_drop:  # Ensure we keep at least one datapoint
            # Randomly select datapoints to drop
            drops = random.sample(cluster, n_drop)
        else:
            # If the cluster is too small, adjust the number to keep at least one datapoint
            drops = random.sample(cluster, len(cluster) - 1)  # Keep at least one
        near_duplicates_idx_to_drop.extend(drops)
    
    # Determine the specific indices to drop
    label_issues_idx_to_drop = label_issues_idx[:num_label_issues_to_drop]
    outliers_idx_to_drop = outliers_idx[:num_outliers_to_drop]
    
    # Combine the indices to drop
    idx_to_drop = list(set(label_issues_idx_to_drop + near_duplicates_idx_to_drop + outliers_idx_to_drop))
    
    # Drop the rows from the training data
    train_features_cleaned = train_features.drop(idx_to_drop).reset_index(drop=True)
    train_labels_cleaned = train_labels.drop(idx_to_drop).reset_index(drop=True)
    
    return train_features_cleaned, train_labels_cleaned


In [162]:
from itertools import product

# Define the parameter grid as lists of possible values
param_grid = {
    'drop_label_issue': [0.4, 0.5, 0.6],
    'drop_near_duplicate': [0.1, 0.2, 0.3],
    'drop_outlier': [0.4, 0.5, 0.6],
}

# Generate all combinations of parameters
param_combinations = list(product(param_grid['drop_label_issue'], param_grid['drop_near_duplicate'], param_grid['drop_outlier']))

In [163]:
best_score = 0
best_params = None

for drop_label_issue, drop_near_duplicate, drop_outlier in param_combinations:
    # Preprocess the data for the current combination of parameters
    train_features_preprocessed, train_labels_preprocessed = preprocess_data(
        train_features_v2, train_labels_v2, label_issues, near_duplicates, outliers,
        drop_label_issue, drop_near_duplicate, drop_outlier)
    
    # Train and evaluate the model
    model = XGBClassifier(tree_method="hist", enable_categorical=True)
    model.fit(train_features_preprocessed, train_labels_preprocessed)
    predictions = model.predict(test_features)
    accuracy = accuracy_score(test_labels, predictions)
    
    # Update the best score and parameters if the current model is better
    if accuracy > best_score:
        best_score = accuracy
        best_params = {'drop_label_issue': drop_label_issue, 'drop_near_duplicate': drop_near_duplicate, 'drop_outlier': drop_outlier}

# Print the best parameters and score
print(f"Best parameters: {best_params}")

Best parameters: {'drop_label_issue': 0.6, 'drop_near_duplicate': 0.1, 'drop_outlier': 0.6}


In [164]:
print(
    f"Accuracy of model fit to clean training data based on the optimal combinations of hyperparameters to clean our data, measured on clean test data: {round(best_score*100,1)}%"
)

Accuracy of model fit to clean training data based on the optimal combinations of hyperparameters to clean our data, measured on clean test data: 82.1%
