# Handling Mislabeled Tabular Data to Improve Your XGBoost Model

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/examples/blob/master/find_tabular_errors/find_tabular_errors.ipynb)

This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.

At a high level we will:
- Establish a baseline XGBoost model accuracy on the original data.
- Use cleanlab's `find_label_issues()` to highlight hundreds of mislabeled data points. 
- Remove the data with automatically-flagged label issues from the dataset, and then retrain the exact same XGBoost model. **This simple step reduces the error in model predictions by 70%!** The raw difference in accuracy values between the two XGBoost models is a whopping **23%**.
- Introduce a **no-code** solution to efficiently fix the label errors in the dataset which **reduces the error in model predictions by 78%** from the baseline, identical XGBoost model!

## Setup and Data Processing

Let’s take a look at our student grades tabular dataset. The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features, but 20% of the grade labels in this dataset are actually incorrect.

We have access to the true letter grade each student should’ve received, which we use for evaluating both the underlying accuracy of model predictions and how well cleanlab detects which data are mislabeled. These true grades are only reserved for evaluation. They are not present in the dataset used for ML.

In your noisily-labeled datasets, there will typically be no such ground truth, and therefore addressing label issues is even more important to facilitate proper model evaluation.

In [None]:
!pip install cleanlab==2.2
!pip install xgboost==1.7

from cleanlab.filter import find_label_issues
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

df = pd.read_csv("https://s.cleanlab.ai/student-grades-demo.csv")
df_c = df.copy()

# Transform letter grades and notes to categorical numbers.
# Necessary for XGBoost and cleanlab.
df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])
df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])
df['notes'] = preprocessing.LabelEncoder().fit_transform(df["notes"])
df['notes'] = df['notes'].astype('category')
df.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
0,f48f73,53,77,93,5,2,2
1,0bd4e7,81,64,80,2,1,1
2,e1795d,74,88,97,5,1,1
3,cb9d7a,61,94,78,5,2,2
4,9acca4,48,90,91,5,2,2


# Training XGBoost Classifier

Now that we’ve seen what can be achieved with cleanlab, let’s take a look at how we get there.

First, we need to obtain **out-of-sample** predicted probabilities for all of our data in order to provide the `find_label_issues()` method with the necessary input. To do this, we will use XGBoost, an implementation of gradient-boosting decision trees (GBDT), which are commonly used with tabular data. Specifically, getting the predicted probabilities can be achieved through the use of an `XGBClassifier` model with cross-validation, which can be implemented easily using the `cross_val_predict` function from scikit-learn.

If our tabular data consisted solely of numerical and boolean values, we could potentially utilize a simpler model such as a nearest-neighbor or logistic regression. However, our data includes a notes column, which we will treat as a categorical feature. Fortunately, XGBoost (>v1.6) is able to handle mixed data types (numerical and categorical) by setting the `enable_categorical` parameter to `true`, thereby simplifying the modeling process.

In [None]:
# Train model on noisy labels.
# Convert numerical notes label encoding to categorical.
data = df.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)
labels = df['noisy_letter_grade']

# XGBoost(experimental) supports categorical data.
# Here we use default hyperparameters for simplicity.
# Get out-of-sample predicted probabilities and check model accuracy.
model = XGBClassifier(tree_method="hist", enable_categorical=True)
pred_probs = cross_val_predict(model, data, labels, method='predict_proba')
preds = np.argmax(pred_probs, axis=1)

acc_original = accuracy_score(preds, labels)
print(f"Accuracy with original data: {round(acc_original*100,1)}%")

Accuracy with original data: 67.4%


Using the default hyperparameters, our cross-validated XGBoost model demonstrates an accuracy of 67.3% when predicting the noisy labels. This level of performance on such a basic task is unsatisfactory. It appears that the presence of 20% label noise is significantly disrupting the model’s ability to accurately predict the labels.

# Find Label Issues

In just one line of code we get a list of possible label issues - it really is that easy! Top 5 results shown below.

Let’s take a look at a few of the label issues automatically identified in our dataset. Take a look at row 2, where the student cheated on exam 1 and got grades of 0, 96, and 90, which should result in a ‘D’ yet was accidentally labeled as a ‘B’. In row 5, the student missed homework resulting in a deduction of 10 points from the overall average, receiving exam grades of 97, 86, and 68 (averages to 83, overall 73 with the deduction), which should result in a ‘C’ yet was accidentally labeled as an ‘A’.

**Note: `find_label_issues` is able to determine that the given label is incorrect, without ever seeing the ground truth label `letter_grade`.**

In [None]:
# Returns list of indices of label issues, sorted by self_confidence.
issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence')
# Filter original data to show some issues.
issues_df = df_c.iloc[issue_idx]
# Show a few good examples.
issues_df.iloc[13:18]

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
23,5eef2c,90,83,51,,C,A
159,b3a1a5,0,96,90,"cheated on exam, gets 0pts",D,B
301,4591b4,66,72,83,missed homework frequently -10,D,B
71,38a6ec,88,67,74,,C,A
885,f00c02,97,86,68,missed homework frequently -10,C,A


# How'd We Do?

Let's go a step further and see how cleanlab did at automatically identifying which data points are mislabeled. If we take the intersection of the label errors identified by cleanlab and the true label errors, we see that cleanlab was able to identify 80% of the label errors correctly (based on predictions from a model that is only 67% accurate). 

In [None]:
# Computing percentage of true errors identified. 
true_error_idx = df[df.letter_grade != df.noisy_letter_grade].index.values
cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)
print(f"Percentage of errors found: {round(cl_acc*100,1)}%")

Percentage of errors found: 79.8%


# Retraining for a More Robust Model

Now that we have the indices of potential label errors let’s remove them from our data, retrain our model, and see what performance improvement we can gain.

Keep in mind our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, which only achieved a cross-validation accuracy of 67%.

Let’s use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`.

In [None]:
# Remove the label errors found by cleanlab.
data = df.drop(issue_idx)
labels = data['noisy_letter_grade']
data = data.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)

# Train a more robust classifier with less erroneous data.
model = XGBClassifier(tree_method="hist", enable_categorical=True)
pred_probs = cross_val_predict(model, data, labels, method='predict_proba')
preds = np.argmax(pred_probs, axis=1)

acc_clean = accuracy_score(preds, labels)
print(f"Accuracy with original data: {round(acc_original*100, 1)}%")
print(f"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%")

# Compute reduction in error.
err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)
print(f"Reduction in error: {round(err*100,1)}%")

Accuracy with original data: 67.4%
Accuracy with errors found by cleanlab removed: 90.1%
Reduction in error: 69.7%


After removing the suspected label issues, our model's new cross-validation accuracy is now 90%, which means we **reduced the error-rate of the model by 70%** (the original model had 67% accuracy). 

**Note: throughout this entire process, we never changed any code related to model architecture/hyperparameters, training, or data preprocessing! This improvement is strictly coming from increasing the quality of our data which leaves room for additional optimizations on the modeling side.**

# Fixing the Label Errors

Instead of just dropping the potential label issues, the smarter (yet more complex) way to increase our data quality would be to correct the label issues by hand. This simultaneously removes a noisy data point and adds an accurate one, but making such corrections manually is cumbersome.

[Cleanlab Studio](https://cleanlab.ai/studio) provides a user-friendly interface to make these changes without writing a single line of code. Simply upload your dataset and Studio computes everything we just did for you, so you can spend more time fixing the issues instead of just finding them.

Here, we use the auto-fix feature on this dataset and replace the Studio-found label issues with the automatically-suggested label. From data upload to data export, the whole process took only 5 minutes without having to know any ML.  


In [None]:
# Get the export produced by Cleanlab Studio
clean_df = pd.read_csv("https://s.cleanlab.ai/student-grades-demo-studio-export.csv")

# Same pre-processing as above.
clean_df['cleanlab_suggested_label'] = preprocessing.LabelEncoder().fit_transform(clean_df['cleanlab_suggested_label'])
clean_df['notes'] = preprocessing.LabelEncoder().fit_transform(clean_df["notes"])
clean_df['notes'] = clean_df['notes'].astype('category')

# Train a more robust classifier with less erroneous data.
labels = clean_df['cleanlab_suggested_label']
data = clean_df[['exam_1','exam_2','exam_3','notes']]
model = XGBClassifier(tree_method="hist", enable_categorical=True)
preds = cross_val_predict(model, data, labels, method='predict')

acc_studio = accuracy_score(preds, labels)
print(f"Accuracy with original data: {round(acc_original*100, 1)}%")
print(f"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%")
print(f"Accuracy with errors found by Studio fixed: {round(acc_studio*100, 1)}%")

# Compute total reduction in error.
tot_err = ((1-acc_original)-(1-acc_studio))/(1-acc_original)
print(f"Total reduction in error: {round(tot_err*100,1)}%")

Accuracy with original data: 67.4%
Accuracy with errors found by cleanlab removed: 90.1%
Accuracy with errors found by Studio fixed: 92.1%
Total reduction in error: 75.9%


# Conclusion

Cleanlab is an incredibly powerful and efficient tool for identifying and addressing label errors in your data that can be used to improve any ML model (not just XGBoost) for most types of data (not just tabular, but also images, text, audio, etc). By implementing just a few lines of open-source code, cleanlab can automatically detect and help you prioritize many potential issues within your data. With this insight, you'll be able to improve the quality of your data and ultimately achieve better model performance.

For the student grades dataset, we found that **simply dropping identified label errors and retraining the model resulted in a 70% reduction in prediction error** on our classification problem (with accuracy improving from 67% to 90%). Going one step further, we used Cleanlab Studio to automatically fix the incorrect labels,resulting in a 76% reduction in prediction error (with accuracy improving from 67% to 92%).

By using open-source libraries for data-centric AI like [cleanlab](https://github.com/cleanlab/cleanlab) to ensure the integrity of your data, you can mitigate costly labeling errors and boost the performance of your models.


# Next Steps

We would like to extend a special thanks to all of our open-source contributors. Your support and engagement have played a crucial role in the development and success of cleanlab. If you are interested in becoming a contributor to the cleanlab project and helping us build the standard open-source library for data-centric AI, please visit our [GitHub page](https://github.com/cleanlab/cleanlab) and [contributing guide](https://github.com/cleanlab/cleanlab/blob/master/CONTRIBUTING.md).

If you interested in using cleanlab to improve your data-centric techniques and ML tasks, our comprehensive [tutorials](https://docs.cleanlab.ai/) provide a simple and efficient way to get started. In just 5 minutes, you can learn how to apply cleanlab to a variety of data types (text, tabular, image, audio, etc) and ML tasks (classification, entity recognition, image/document tagging, etc).

We would love to connect with you, too!
- Join our [Cleanlab Community Slack](https://cleanlab.ai/slack/)
- Follow us on [LinkedIn](https://www.linkedin.com/company/cleanlab/)
- Follow us on [Twitter](https://twitter.com/CleanlabAI)

Bonus: Learn how cleanlab can also help improve training data in [Kaggle](https://www.kaggle.com/code/ulytkch/cleanlab-data-centric-ai-example-0-7703-python/notebook) competitions.
