# Datalab

The `Datalab` class is a new feature of cleanlab that helps you identify issues in your machine learning datasets that may negatively impact the performance of your machine learning model if not addressed.

This includes:

- Finding noisy labels
- Detecting outliers
- Spotting near duplicates


In this tutorial, we will walk throug the process of using `Datalab` to identify issues in a (toy) dataset.

## Setup

### Installation



`Datalab` has additional dependencies that are not included in the standard installation of cleanlab.

To install everything necessary for this tutorial, run:


In [None]:
# !pip install sklearn matplotlib
# !pip install git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]

then import the following:

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import NearestNeighbors

from scripts.data_generation import create_data, plot_data
from cleanlab import Datalab

### Loading data

We'll load a toy dataset for this tutorial.
The dataset has two numerical features and a label column with three classes.

The true label for each example is determined by the sum of the features and is assigned to one of three bins.


A dataset is created and stored in a dictionary called `data_dict`. The following variables are extracted for convenience:

- `X_train`: A matrix of features for the training set.
- `noisy_labels`: A vector of noisy labels for *the training set* (represented as strings).
- `y_train_idx`: A vector of true labels for the training set (represented as integers).
- `noisy_labels_idx`: Noisy labels for the training set (represented as integers).
- `X_out`: A matrix of features for the examples that are manually added as outliers.
  - This also contains a pair of near duplicate examples.
  - The examples are also in `X_train`.

Below we also print out the label accuracy for the training set (the proportion of examples whose noisy label matches the true label), then plot the features of the training with each set of labels.

Noisy labels are highlighted in red if they do not match the true label.

In [None]:
data_dict = create_data()

X_train, noisy_labels, y_train_idx, noisy_labels_idx, X_out = (
    data_dict[key]
    for key in [
        "X_train", "noisy_labels", "y_train_idx", "noisy_labels_idx", "X_out"
    ]
)

In [None]:
plot_data(X_train, y_train_idx, noisy_labels_idx, X_out)

Typically, in real-world scenarios, you may not have information about the true labels or the distribution of the features a priori.

There may be exceptions, such as when you have extensive domain knowledge or have carefully verified each data point by hand. However, this is not a guarantee that the data is completely error-free. Some nuances of the data may not be immediately apparent and can be difficult to capture during the data collection process.

Hence, we'll assume that we are unaware of the true labels and the typical distribution of the features.

`Datalab` has several ways of loading the data.
In this case we'll wrap the training features and noisy labels in a dictionary so that we can pass it to `Datalab`.

In [None]:
data = {"X": X_train, "y": noisy_labels}

### Get out-of-sample predicted probabilities from a classifier

To perform certain issue checks, `Datalab` relies on out-of-sample predicted probabilities from a trained model.

For the purposes of this tutorial, we will use a simple logistic regression model 
and the cross_val_predict() function from scikit-learn to generate out-of-sample predicted probabilities for the training set.

This allows us to demonstrate how to use `Datalab` to detect label errors.

In [None]:
model = LogisticRegression()
pred_probs = cross_val_predict(
    estimator=model, X=data["X"], y=data["y"], cv=5, method="predict_proba"
)

## Instantiate a Datalab object

We'll instantiate a `Datalab` and provide it with the data object and a name of the label column in the data object.

We'll use the `find_issues` method of `Datalab` to identify issues in the dataset.
This method accepts out-of-sample predicted probabilities and the feature data as input.

In [None]:
lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, features=data["X"])

We can review the results of the issue checks using the `report` method of `Datalab`.
This method provides a comprehensive summary of each type of issue found in the dataset.

In [None]:
lab.report()

For more details on a particular issue, there are several methods that can be used to access the results of the issue checks.

For example, we fetch summarized statistics on a particular issue using the `get_summary` method.

In [None]:
lab.get_summary("label")

We can see results of an issue check for all examples using the `get_issues` method.

In [None]:
lab.get_issues("label").head()

Some intermediate results from a particular issue check can be accessed using the `get_info` method.

In [None]:
lab.get_info("label")

for k, v in lab.get_info("label").items():
    str_v = str(v)
    n = 50
    if len(str_v) > n:
        str_v = str_v[:n] + "..."
    print(f"{k}: {str_v}")

These are also directly accessible as attributes of the `Datalab` instance.

In [None]:
print("Summary:", lab.issue_summary, sep="\n", end="\n\n")

print("Issues:", lab.issues.head(), sep="\n", end="\n\n")

# print("Info:", lab.info)  # Too long to print

## Incremental issue search

We can call `find_issues` multiple times on a `Datalab` object to find issues incrementally.

This is done via the `issue_types` argument which accepts a dictionary of issue types and any corresponding keyword arguments that may be updated.

Notice how the call to `find_issues` updates the output of the call to `report`.

In [None]:
lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, issue_types={"outlier": {}})
lab.report()

In [None]:
# Run the next check
lab.find_issues(pred_probs=pred_probs, issue_types={"label": {}})
lab.report()

In [None]:
# Previous checks can be overwritten
lab.find_issues(pred_probs=pred_probs, features=data["X"], issue_types={"outlier": {}})
lab.report()

## Customizing hyperparameter in issue search

In this example, we'll use custom hyperparameters for the `"outlier"` and `"near_duplicate"` issue types.

We can also increase the verbosity of the `report` by increasing the `verbosity` argument, to show additional information about the issue checks.

In [None]:
lab = Datalab(data, label_name="y")
knn = NearestNeighbors(n_neighbors=3, metric="euclidean")
issue_types = {
    "outlier": {"ood_kwargs": {"params": {"knn": knn}}},
    "near_duplicate": {"metric": "euclidean"},
}

lab.find_issues(
    pred_probs=pred_probs,
    features=data["X"],
    issue_types=issue_types,
)
lab.report(k=10, verbosity=2)

Notice how the number of issues has changed after adjusting the hyperparameters for the check for outliers and near duplicates.

## Adding a custom IssueManager


All issue checks in `Datalab` are implemented as subclasses of an `IssueManager` class provided by cleanlab.

You may come up with your own issue checks that are not included in cleanlab.

To integrate these into `Datalab`, you must create a subclass of `IssueManager` and add it to to a registry of issue managers that is used by `Datalab`.

Here, we'll create an arbitrary issue check that checks the divisibility of an example's index in the dataset by 13.

The necessary members to implement in the subclass are:

- A class variable called `issue_name` that acts as a unique identifier for the type of issue.
  - This is further used to derive other necessary attributes of the issue manager.
- A method called `find_issues` that:
  - Gives a quality score for each example in the dataset, in terms of how unlikely it is to be an issue.
  - Tag each example as an issue or not.
  - Combine these in a dataframe that is assigned to an `issues` attribute of the issue manager.
  - Define a summarized score for the entire dataset, to set a `summary` attribute of the issue manager.

In [None]:
from cleanlab.experimental.datalab.issue_manager import IssueManager
from cleanlab.experimental.datalab.factory import register


def scoring_function(idx: int) -> float:
    if idx == 0:
        # Zero excluded from the divisibility check, gets the highest score
        return 1
    rem = idx % 13
    inv_scale = idx // 13
    if rem == 0:
        return 0.5 * (1 - np.exp(-0.1*(inv_scale-1)))
    else:
        return 1 - 0.49 * (1 - np.exp(-inv_scale**0.5))*rem/13


@register
class SuperstitionIssueManager(IssueManager):
    """A custom issue manager that keeps track of issue indices that
    are divisible by 13.
    """
    description: str = "Examples with indices that are divisible by 13 may be unlucky."  # Optional
    issue_name: str = "superstition"

    def find_issues(self, div=13, **_) -> None:
        ids = self.datalab.issues.index.to_series()
        issues_mask = ids.apply(lambda idx: idx % div == 0 and idx != 0)
        scores = ids.apply(scoring_function)
        self.issues = pd.DataFrame(
            {
                f"is_{self.issue_name}_issue": issues_mask,
                self.issue_score_key: scores,
            },
        )
        summary_score = 1 - sum(issues_mask) / len(issues_mask)
        self.summary = self.get_summary(score = summary_score)



Once registered, this issue manager performs custom issue checks when `find_issues` is called on a `Datalab` instance.

As the `Datalab` instance already has results from the outlier and near duplicate checks, the custom issue check is performed separately.

In [None]:
lab.find_issues(issue_types={"superstition": {}})
lab.report()

## Save and load Datalab objects

We can save the Datalab object to a file using the `save` method.

A directory path needs to be provided as an argument to the `save` method.

The same directory path can be used to load the Datalab object using the `load` method.

In [None]:
path = "test-lab"
lab.save(path)

new_lab = Datalab.load(path)
new_lab.report()