# Data Curation with Train vs Test Splits

In typical Machine Learning projects, we split our dataset into training data for fitting models and test data to evaluate model performance. For noisy real-world datasets, detecting/correcting errors in the training data is important to train robust models, but it's less recognized that the test set can also be noisy.
For accurate model evaluation, it is vital to find and fix issues in the test data as well. Some evaluation metrics are particularly sensitive to outliers and noisy labels.
This tutorial demonstrates a way to use [Cleanlab Open Source (CLOS)](https://github.com/cleanlab/cleanlab) to clean both your training and test data, ensuring **robust** model training and **reliable** performance evaluation.

Here's how we recommend handling noisy training and test data with CLOS:


NEEED TO UPDATE THIS

- First focus on detecting issues in the test data. For the best detection, we recommend that you merge your training and test data and then run a Cleanlab Studio Project (which will benefit from more data) -- but only focus on project results for the test data.
- Manually review/correct Cleanlab-detected issues in your test data. To avoid bias, we caution against automated correction of test data. Instead, test data changes should be individually verified to ensure they will lead to more accurate model evaluation.
- Run a separate Cleanlab Studio project on the training data alone to detect issues in the training data (without any test set information leakage).
- Optionally, use automated Cleanlab suggestions to algorithmically refine the training data (or manually review/correct Cleanlab-detected issues in your training data).
- Estimate the final model's performance on the cleaned test data. **Do not compare the performance of different ML models estimated across different versions of your test data.** These estimates are incomparable.

Consider this tutorial as a blueprint for using Cleanlab Studio in ML projects spanning various data modalities and tasks.
Let’s get started!


**Note**: We are using tabular data in this tutorial but the approach can apply to other data modalities!

## Load the data

First install and import required dependencies for this tutorial.

## Install required dependencies

You can use `pip` to install all packages required for this tutorial as follows:


Make sure to install the version corresponding to this tutorial
 - E.g. if viewing master branch documentation:
 - !pip install git+https://github.com/cleanlab/cleanlab.git

In this tutorial we are using the following version of `cleanlab`: `2.6.1`

In [109]:
!pip install xgboost scikit-learn pandas "cleanlab[datalab]"

71549.82s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


In [110]:
import random
import os
import cleanlab
import numpy as np
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
from cleanlab import Datalab

SEED = 123456

np.random.seed(SEED)
random.seed(SEED)

In [114]:
cleanlab.__version__

'2.6.1'

In [5]:
seed_value = 55

## Load the data

In [6]:
# df_train = pd.read_csv(
#     "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/train.csv"
# )
# df_test = pd.read_csv(
#     "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/test.csv"
# )
# df_train.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,noisy_letter_grade
0,37fd76,99,59,70,missed class frequently -10,D
1,018bff,94,41,91,great participation +10,B
2,b3c9a0,91,74,88,,B
3,076d92,0,79,65,"cheated on exam, gets 0pts",F
4,68827d,91,98,75,missed class frequently -10,C


In [186]:
df_train = pd.read_csv("clos_train_data.csv")
df_test = pd.read_csv("clos_test_data.csv")

In [187]:
df_train.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,noisy_letter_grade
0,37fd76,99.0,59.0,70.0,missed class frequently -10,D
1,018bff,94.0,41.0,91.0,great participation +10,B
2,b3c9a0,91.0,74.0,88.0,,B
3,076d92,0.0,79.0,65.0,"cheated on exam, gets 0pts",F
4,68827d,91.0,98.0,75.0,missed class frequently -10,C


## Preprocess the dataset

Before training an XGBoost model, we preprocess the notes and noisy_letter_grade columns into categorical columns.

In [218]:
# Create label encoders for the grade and notes columns
grade_le = preprocessing.LabelEncoder()
notes_le = preprocessing.LabelEncoder()

# Prepare the feature columns
train_features = df_train.drop(["stud_ID", "noisy_letter_grade"], axis=1).copy()
train_features["notes"] = notes_le.fit_transform(train_features["notes"])
train_features["notes"] = train_features["notes"].astype("category")

# Encode the label column into a cateogorical feature
train_labels = pd.DataFrame(grade_le.fit_transform(df_train["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])

Let's view the training set features after preprocessing.

In [219]:
train_features.head()

Unnamed: 0,exam_1,exam_2,exam_3,notes
0,99.0,59.0,70.0,3
1,94.0,41.0,91.0,2
2,91.0,74.0,88.0,5
3,0.0,79.0,65.0,0
4,91.0,98.0,75.0,3


Next we repeat the same preprocessing steps for our clean test data.

In [220]:
test_features = df_test.drop(
    ["stud_ID", "noisy_letter_grade"], axis=1
).copy()
test_features["notes"] = notes_le.transform(test_features["notes"])
test_features["notes"] = test_features["notes"].astype("category")

test_labels = pd.DataFrame(grade_le.transform(df_test["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])

In [215]:
train_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 707 entries, 0 to 706
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   exam_1  707 non-null    float64 
 1   exam_2  707 non-null    float64 
 2   exam_3  707 non-null    float64 
 3   notes   707 non-null    category
dtypes: category(1), float64(3)
memory usage: 17.6 KB


In [236]:
train_labels = train_labels.astype('object')
test_labels = test_labels.astype('object')

train_features["notes"] = train_features["notes"].astype(int)
test_features["notes"] = test_features["notes"].astype(int)

preprocessed_train_data = pd.concat([train_features, train_labels], axis=1)
preprocessed_train_data["stud_ID"] = df_train["stud_ID"]

preprocessed_test_data = pd.concat([test_features, test_labels], axis=1)
preprocessed_test_data["stud_ID"] = df_test["stud_ID"]

## Check for Near Duplicate and IID Issues on full dataset using Datalab

In [240]:
full_df = pd.concat([preprocessed_train_data, preprocessed_test_data], axis=0).reset_index(drop=True)
features_df = full_df.drop(["noisy_letter_grade", "stud_ID"], axis=1)

In [241]:
lab = Datalab(data=full_df, label_name="noisy_letter_grade", task="classification")
lab.find_issues(features=features_df.to_numpy(), issue_types={"near_duplicate": {}, "non_iid": {}})
lab.report(show_summary_score=True, show_all_issues=True)

Finding near_duplicate issues ...
Finding non_iid issues ...

Audit complete. 106 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

    issue_type    score  num_issues
near_duplicate 0.588192         106
       non_iid 0.595350           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Dataset Information: num_examples: 940, num_classes: 5


------------------ near_duplicate issues -------------------

About this issue:
	A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).
    

Number of examples with this issue: 106
Overall dataset quality in terms of this issue: 0.5882

Examples representing most severe instances of this issue:
    

We can see that we have many near_duplicate issues Now let's drop the near (or exact) duplicates that are found between our training and test sets from our training set.

In [244]:
full_duplicate_results = lab.get_issues("near_duplicate")
full_duplicate_results.sort_values("near_duplicate_score").head()

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
732,True,0.0,"[874, 133, 149]",0.0
205,True,0.0,[796],0.0
593,True,0.0,[261],0.0
204,True,0.0,[861],0.0
541,True,0.0,[3],0.0


In [246]:
full_duplicate_issues = full_duplicate_results[full_duplicate_results["is_near_duplicate_issue"] == True].sort_values("near_duplicate_score")
full_duplicate_issues 

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
3,True,0.000000,[541],0.000000
261,True,0.000000,[593],0.000000
385,True,0.000000,[464],0.000000
464,True,0.000000,[385],0.000000
541,True,0.000000,[3],0.000000
...,...,...,...,...
272,True,0.105667,[760],0.000004
828,True,0.112253,[880],0.000004
880,True,0.112253,[828],0.000004
517,True,0.113335,[79],0.000004


In [None]:
full_duplicate_issues = full_duplicate_results[full_duplicate_results["is_near_duplicate_issue"] == True].sort_values("near_duplicate_score")

## Train model with original (noisy) training data

In [221]:
clf = XGBClassifier(tree_method="hist", enable_categorical=True)
clf.fit(train_features, train_labels)

Although curating clean test data does not directly help train a better ML model, more reliable model evaluation can improve our overall ML project. For instance, clean test data can enable better informed decisions regarding when to deploy a model and better model/hyperparameter selection.

## Compute out-of-sample predicted probabilities for test data

In [68]:
from sklearn.model_selection import cross_val_predict


num_crossval_folds = 5
test_pred_probs = cross_val_predict(
    clf,
    test_features,
    test_labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

### Evaluate classification model with original (noisy) training data on noisy test data

In [69]:
preds = model.predict(test_features)
acc_original = accuracy_score(test_labels, preds)
print(
    f"Accuracy of model fit to noisy training data, measured on clean test data: {round(acc_original*100,1)}%"
)

Accuracy of model fit to noisy training data, measured on clean test data: 68.2%


## Use cleanlab to find label issues in test data and then manually correct them
Based on the given labels, predicted probabilities, and KNN graph, cleanlab can quickly help us identify suspicious values in our grades table.

We use cleanlab’s Datalab class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a Datalab object such that it can audit our dataset for various types of issues.

In [79]:
from cleanlab import Datalab

test_data = {"X": test_features.values, "y": test_labels}

test_lab = Datalab(test_data, label_name="y")
test_lab.find_issues(pred_probs=test_pred_probs)

Finding label issues ...


0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to di

Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...
Error in underperforming_group: If a knn_graph is not provided, features must be provided to fit a new knn.
Failed to check for these issue types: [UnderperformingGroupIssueManager]

Audit complete. 94 issues found in the dataset.


In [72]:
test_lab.report()

Here is a summary of the different kinds of issues found in the data:

issue_type  num_issues
     label          94

Dataset Information: num_examples: 233, num_classes: 5


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.
    

Number of examples with this issue: 94
Overall dataset quality in terms of this issue: 0.6052

Examples representing most severe instances of this issue:
     is_label_issue  label_score  given_label  predicted_label
26             True     0.000102            4                1
6              True     0.000195            4                1
212            True     0.000366            4                0
177            True     0.000516            4                0
61             True     0.000551            3                4


The above report shows that cleanlab identified many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues method.

In [76]:
test_label_issue_results = test_lab.get_issues("label")
test_label_issue_results.head()

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.682157,1,1
1,False,0.974964,2,2
2,False,0.969182,1,1
3,True,0.028468,3,4
4,False,0.880642,3,3


To review the most severe label issues, sort the DataFrame above by the `label_score` column (a lower score represents that the label is less likely to be correct).

Let’s review some of the most likely label errors:

In [78]:
test_sorted_label_issues = test_label_issue_results.sort_values("label_score").index

test_features.iloc[test_sorted_label_issues].assign(
    given_label=test_labels[test_sorted_label_issues],
    predicted_label=test_label_issue_results["predicted_label"].iloc[test_sorted_label_issues]
).head(5)

Unnamed: 0,exam_1,exam_2,exam_3,notes,given_label,predicted_label
26,72,90,98,5,4,1
6,72,91,91,5,4,1
212,79,77,95,2,4,0
177,90,100,89,2,4,0
61,92,0,94,0,3,4


The dataframe above shows the original label (given_label) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative predicted_label for each example.

These examples have been labeled incorrectly and should be carefully re-examined - a student with grades of 89, 95 and 73 surely does not deserve a D!

In [41]:
test_features_to_change = test_features.iloc[test_sorted_label_issues].assign(
    given_label=test_labels.iloc[test_sorted_label_issues],
    predicted_label=test_label_issue_results["predicted_label"].iloc[test_sorted_label_issues]
).copy()

In [42]:
test_features_to_change.columns

Index(['exam_1', 'exam_2', 'exam_3', 'notes', 'given_label',
       'predicted_label'],
      dtype='object')

In [43]:
test_features_to_change

Unnamed: 0,exam_1,exam_2,exam_3,notes,given_label,predicted_label
202,0.578948,0.901414,0.433183,2,4,0
138,-0.269968,-0.776429,0.926049,5,2,1
253,0.578948,0.199061,0.221955,5,4,1
190,0.878565,0.082002,0.644411,2,4,0
32,0.828629,0.199061,-0.482140,5,3,1
...,...,...,...,...,...,...
147,0.529012,0.706316,0.292364,2,0,0
61,0.628884,0.511218,-0.130092,2,0,0
60,-1.019011,-3.000546,-0.834187,0,4,4
44,0.529012,0.511218,0.714821,1,0,0


In [44]:
test_features_to_change[["given_label", "predicted_label"]]

Unnamed: 0,given_label,predicted_label
202,4,0
138,2,1
253,4,1
190,4,0
32,3,1
...,...,...
147,0,0
61,0,0
60,4,4
44,0,0


In [45]:
test_features_to_change.loc[:, 'given_label'] = test_features_to_change.loc[:, 'predicted_label']
test_features_to_change = test_features_to_change.rename(columns={"given_label": "clean_label"}).drop("predicted_label", axis=1)
test_features_to_change

Unnamed: 0,exam_1,exam_2,exam_3,notes,clean_label
202,0.578948,0.901414,0.433183,2,0
138,-0.269968,-0.776429,0.926049,5,1
253,0.578948,0.199061,0.221955,5,1
190,0.878565,0.082002,0.644411,2,0
32,0.828629,0.199061,-0.482140,5,1
...,...,...,...,...,...
147,0.529012,0.706316,0.292364,2,0
61,0.628884,0.511218,-0.130092,2,0
60,-1.019011,-3.000546,-0.834187,0,4
44,0.529012,0.511218,0.714821,1,0


In [64]:
outlier_results = lab.get_issues("outlier")
sorted_outliers = outlier_results.sort_values("outlier_score").index

train_features.iloc[sorted_outliers].head()

Unnamed: 0,exam_1,exam_2,exam_3,notes
397,0.122686,-3.063025,4.024837,0
654,-3.070488,0.147247,3.345416,0
743,2.669142,0.350429,-1.206705,1
283,0.607725,0.553611,4.364548,5
651,0.001427,2.544792,0.627732,1


In [65]:
outlier_results.sort_values("outlier_score")["is_outlier_issue"].value_counts()

is_outlier_issue
False    730
True      35
Name: count, dtype: int64

In [94]:
# outlier_issues = outlier_results.query("is_outlier_issue").index
outlier_issues = outlier_results.sort_values("outlier_score")[:55].index
no_outlier_issues_train_features = train_features.drop(outlier_issues, axis=0).reset_index(drop=True)
no_outlier_issues_train_labels = train_labels.drop(outlier_issues, axis=0).reset_index(drop=True)

In [None]:
duplicate_results = lab.get_issues("near_duplicate")
nd_issues = duplicate_results.sort_values("near_duplicate_score")[:200].index
no_nd_issues_train_features = train_features.drop(nd_issues, axis=0).reset_index(drop=True)
no_nd_issues_train_labels = train_labels.drop(nd_issues, axis=0).reset_index(drop=True)

### Hyperparameter Optimization for editing data issues

We have made some basic edits to improve test performance, so now we will parameterize each one of these edits (eg. what fraction of each issue to delete) to automatically find the best combination of edits to achieve optimal test performance. 

We will use a basic hyperparameter-tuning library to optimize over these edit-variants + model re-training on the edited datasets with our objective being test performance.

In a real-world setting, this would ideally be done on cleaned validation data instead of test data, but we are simplifying the approach for this tutorial.

To parametrize our dataset edits, we define a `dict` below containing default settings that we found tend to work well:

In [235]:
default_edit_params = {
        "drop_label_issue": 0.5,
        "drop_near_duplicate": 0.2,
        "drop_outlier": 0.5
    }

In english, these choices mean:

- `relabel_confidence_threshold`: We relabel any datapoint that is flagged with a label issue, but the model’s predicted label (for another class) has probability > 0.95
- `drop_label_issue`: We drop the remaining top 50% of the datapoints flagged with label issues (based on label score). Here we do not drop any of the relabeled datapoints from the prior step.
- `drop_outlier`: We drop the top 50% most severe outliers based on outlier score (amongst the set of flagged outliers).
- `drop_near_duplicate`: We drop EXTRA COPIES of the top 20% of near duplicates (based on near duplicate score). Never drop the original datapoint though, so at least one copy remains. How do we decide on the original datapoint? Amongst each set of near duplicates, we keep the one that has highest self-confidence score for its given label.

In [None]:
param_grid = {
    'drop_label_issue': [0.4, 0.5, 0.6],
    'drop_near_duplicate': [0.1, 0.2, 0.3],
    'drop_outlier': [0.4, 0.5, 0.6],
    'relabel_confidence_threshold': [0.9, 0.95, 0.99]
}

In [237]:
lab.get_issues().columns

Index(['is_label_issue', 'label_score', 'is_outlier_issue', 'outlier_score',
       'is_near_duplicate_issue', 'near_duplicate_score', 'is_non_iid_issue',
       'non_iid_score', 'is_class_imbalance_issue', 'class_imbalance_score',
       'is_underperforming_group_issue', 'underperforming_group_score'],
      dtype='object')

In [None]:
duplicate_results = lab.get_issues("near_duplicate")
nd_issues = duplicate_results.sort_values("near_duplicate_score")[:200].index
no_nd_issues_train_features = train_features.drop(nd_issues, axis=0).reset_index(drop=True)
no_nd_issues_train_labels = train_labels.drop(nd_issues, axis=0).reset_index(drop=True)

In [238]:
issue_names = ['is_label_issue', 'is_outlier_issue', 'is_near_duplicate_issue']

In [242]:
lab.get_issues("near_duplicate").sort_values("near_duplicate_score")

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
374,True,0.000000,[452],0.000000
266,True,0.000000,[258],0.000000
556,True,0.000000,[716],0.000000
452,True,0.000000,[374],0.000000
277,True,0.000000,[311],0.000000
...,...,...,...,...
558,False,0.999539,[],1.365814
651,False,0.999932,[],1.706353
743,False,0.999983,[],1.957707
654,False,0.999994,[],2.145048


In [244]:
percentage = 0.2

filtered_df = lab.get_issues("near_duplicate")

In [246]:
# Calculate the number of rows to drop based on the percentage
num_rows = int(len(filtered_df) * percentage)
num_rows

153

In [248]:
# Assume 'near_duplicate_score' and 'near_duplicate_cluster_id' are columns in the DataFrame
sorted_df = filtered_df.sort_values(by="near_duplicate_score", ascending=True).reset_index(drop=True)
grouped_df = sorted_df.groupby("near_duplicate_sets")

In [249]:
grouped_df

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x2c0296490>

In [None]:
from typing import List
import pandas as pd

def get_top_fraction_ids(
    lab_results: pd.DataFrame, bool_column_name: str, percentage: float, asc=True
) -> List[str]:
    """
    This function returns the IDs of datapoints to drop based on a specified percentage.
    
    Parameters:
    - lab_results (pd.DataFrame): The input DataFrame containing the labeling results.
    - bool_column_name (str): The name of the column indicating the issue.
    - percentage (float): The percentage of rows to be extracted.
    - asc (bool, optional): If True, the rows are sorted in ascending order based on the score column; 
                            if False, in descending order. Default is True.

    Returns:
    - list: A list of row indices representing the top specified percentage of rows based on the specified score column.
    """
    
    # Construct a filter based on the issue column
    filter_condition = lab_results[bool_column_name]

    # Create a new DataFrame based on the filter
    filtered_df = lab_results[filter_condition]
    
    # Calculate the number of rows to drop based on the percentage
    num_rows = int(len(filtered_df) * percentage)
    
    # For 'is_near_duplicate_issue', handle duplicates specifically
    if bool_column_name == "is_near_duplicate_issue":
        # Assume 'near_duplicate_score' and 'near_duplicate_cluster_id' are columns in the DataFrame
        sorted_df = filtered_df.sort_values(by="near_duplicate_score", ascending=asc).reset_index(drop=True)
        grouped_df = sorted_df.groupby("near_duplicate_cluster_id")
        
        # Initialize an empty list to store the indices to be dropped
        drop_indices = []
        
        # Iterate over each group
        for _, group_df in grouped_df:
            # Calculate number of rows to drop for this group based on the percentage
            group_num_rows = int(len(group_df) * percentage)
            if group_num_rows > 0:
                # Select the top percentage of rows based on the score, maintaining at least one datapoint
                selected_indices = group_df.head(group_num_rows)["cleanlab_row_ID"]
                drop_indices.extend(selected_indices)
    else:
        # For other types of issues, directly select the top percentage of rows based on the score
        score_col_name = f"{bool_column_name}_score"  # Adjust based on actual score column naming
        sorted_df = filtered_df.sort_values(by=score_col_name, ascending=asc)
        drop_indices = sorted_df.head(num_rows)["cleanlab_row_ID"]

    return list(drop_indices)
