# Miscellaneous workflows with Datalab

This notebook provides a comprehensive guide on using `Datalab` to perform various data quality checks and analyses. We cover multiple workflows to demonstrate the flexibility and power of `Datalab`, focusing on practical examples that address less commonly shown issue checks. Each section includes step-by-step instructions and code examples to help you implement these workflows on your own datasets.

## Find Null Values in a Dataset

In this section, we will demonstrate how to use `Datalab` to identify and visualize null values in a dataset.
This tutorial will guide you through loading a dataset, detecting null values, and creating a clear visualization of the results.

### 1. Load the Dataset

First, we will load the dataset into a Pandas DataFrame. For simplicity, we will use a dataset in TSV (tab-separated values) format.
Some care is needed when loading the dataset to ensure that the data is correctly parsed.


In [None]:
# Define the dataset as a multi-line string
dataset_tsv = """
Age	Gender	Location	Annual_Spending	Number_of_Transactions	Last_Purchase_Date
56.0	Other	Rural	4099.62	3	2024-01-03
NaN	Female	Rural	6421.16	5	NaT
46.0	Male	Suburban	5436.55	3	2024-02-26
32.0	Female	Rural	4046.66	3	2024-03-23
60.0	Female	Suburban	3467.67	6	2024-03-01
25.0	Female	Suburban	4757.37	4	2024-01-03
38.0	Female	Rural	4199.53	6	2024-01-03
56.0	Male	Suburban	4991.71	6	2024-04-03
NaN
NaN	Male	Rural	4655.82	1	NaT
40.0	Female	Rural	5584.02	7	2024-03-29
28.0	Female	Urban	3102.32	2	2024-04-07
28.0	Male	Rural	6637.99	11	2024-04-08
NaN	Male	Urban	9167.47	4	2024-01-02
NaN	Male	Rural	6790.46	3	NaT
NaN	Other	Rural	5327.96	8	2024-01-03
"""

# Import necessary libraries
from io import StringIO
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_csv(StringIO(dataset_tsv), sep='\t', parse_dates=["Last_Purchase_Date"])

# Display the original DataFrame
display(df)

### 2: Encode Categorical Values

The features argument in find_issues generally works with a numerical array.
Therefore, we need to encode any categorical values numerically. A common workflow is to encode categorical values in the dataset before passing it to the find_issues method.
However, some encoding strategies may lose the original null values.

Here's a strategy to encode categorical columns while keeping the original DataFrame structure intact:

In [None]:
# Define a function to encode categorical columns
def encode_categorical_columns(df, columns, drop=True, inplace=False):
    if not inplace:
        df = df.copy()
    for column in columns:
        # Drop NaN values or replace them with a placeholder
        categories = df[column].dropna().unique()
        
        # Create a mapping from categories to numbers
        category_to_number = {category: idx for idx, category in enumerate(categories)}
        
        # Apply the mapping to the column
        df[column + '_encoded'] = df[column].map(category_to_number)

    if drop:
        df = df.drop(columns=columns)

    return df

# Encode the categorical columns
columns_to_encode = ["Gender", "Location"]
encoded_df = encode_categorical_columns(df, columns=columns_to_encode)

# Display the encoded DataFrame
display(encoded_df)

### 3. Initialize Datalab

Next, we will initialize `Datalab` with the original DataFrame. This will allow us to use the Datalab methods to find all kinds issues in our dataset.

In [None]:
# Import the Datalab class from cleanlab
from cleanlab import Datalab

# Initialize Datalab with the original DataFrame
lab = Datalab(data=df)

### 4. Detect Null Values
We will use the find_issues method from `Datalab` to detect null values in our dataset.

In [None]:
# Detect issues in the dataset, focusing on null values
lab.find_issues(features=encoded_df, issue_types={"null": {}})

# Display the identified issues
null_issues = lab.get_issues("null")
display(null_issues)


### 5. Sort the Dataset by Null Issues

To better understand the impact of null values, we will sort the original DataFrame by the `null_score` from the `null_issues` DataFrame.

This score indicates the severity of null issues for each row.

In [None]:
# Sort the issues DataFrame by 'null_score' and get the sorted indices
sorted_indices = (
    null_issues
    .sort_values("null_score")
    .index
)

# Sort the original DataFrame based on the sorted indices from the issues DataFrame
sorted_df = df.loc[sorted_indices]


### 6. (Optional) Visualize the Results

Finally, we will create a nicely formatted DataFrame that highlights the null values and the issues detected by `Datalab`.

We will use Pandas' styler to add custom styles for better visualization.

In [None]:
# Create a column of separators
separator = pd.DataFrame([''] * len(sorted_df), columns=['|'])

# Join the sorted DataFrame, separator, and issues DataFrame
combined_df = pd.concat([sorted_df, separator, null_issues], axis=1)

# Define functions to highlight null values and Datalab columns
def highlight_null_values(val):
    if pd.isnull(val):
        return 'background-color: yellow'
    return ''

def highlight_datalab_columns(column):
    return 'background-color: lightblue'

def highlight_is_null_issue(val):
    if val:
        return 'background-color: orange'
    return ''

# Apply styles to the combined DataFrame
styled_df = (
    combined_df
    .style.map(highlight_null_values) # Highlight null and NaT values
    .map(highlight_datalab_columns, subset=null_issues.columns) # Highlight columns provided by Datalab
    .map(highlight_is_null_issue, subset=['is_null_issue']) # Highlight rows with null issues
)

# Display the styled DataFrame
display(styled_df)

### 7. Next Steps

This section focused on identifying null values, but `Datalab` can detect a variety of other issues such as (near) duplicates, outliers, label issues and more. Explore our additional tutorials to learn more about how Datalab can enhance your data quality workflows.

If you want to learn more about the null issue type, you can read about it [here](issue_type_description.html#null-issue).

## Find Underperforming Groups in a Dataset

In this section, we will demonstrate how to use `Datalab` to identify underperforming groups in a dataset. This tutorial will guide you through generating a synthetic dataset, training a classifier, and identifying groups that are underperforming.


### 1. Generate a Synthetic Dataset

First, we will generate a synthetic dataset with blobs. This dataset will include some noisy labels in one of the blobs.

In [None]:
from sklearn.datasets import make_blobs
import numpy as np

# Generate synthetic data with blobs
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42, cluster_std=1.0, shuffle=False)

# Add noise to the labels
n_noisy_labels = 30
y[:n_noisy_labels] = np.random.randint(0, 2, n_noisy_labels)

### 2. Train a Classifier and Obtain Predicted Probabilities
Next, we will train a classifier using a stacking approach and obtain predicted probabilities for the dataset using cross-validation.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

# Obtain predicted probabilities using cross-validation
clf = LogisticRegression()
pred_probs = cross_val_predict(clf, X, y, cv=3, method="predict_proba")


### 3. (Optional) Cluster the Data

To identify underperforming groups, clustering can be useful.
In this step, we optionally use KMeans clustering to find clusters within the data.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.model_selection import GridSearchCV

# Function to use in GridSearchCV for silhouette score
def silhouette_scorer(estimator, X):
    cluster_labels = estimator.fit_predict(X)
    return silhouette_score(X, cluster_labels)

# Use GridSearchCV to determine the optimal number of clusters
param_grid = {"n_clusters": range(2, 10)}
grid_search = GridSearchCV(KMeans(random_state=0), param_grid, cv=3, scoring=silhouette_scorer)
grid_search.fit(X)

# Get the best estimator and predict clusters
best_kmeans = grid_search.best_estimator_
cluster_ids = best_kmeans.fit_predict(X)


### 4. Identify Underperforming Groups with Datalab

We will use `Datalab` to find underperforming groups in the dataset based on the predicted probabilities and optionally the cluster assignments.

In [None]:
from cleanlab import Datalab

# Initialize Datalab with the dataset
lab = Datalab(data={"X": X, "y": y}, label_name="y", task="classification")

# Find issues related to underperforming groups, optionally using cluster_ids
lab.find_issues(
    # features=X  # Uncomment this line if 'cluster_ids' is not provided to allow Datalab to run clustering automatically.
    pred_probs=pred_probs,
    issue_types={
        "underperforming_group": {
            "threshold": 0.75,          # Set a custom threshold for identifying underperforming groups.
                                        # The default threshold is lower, optimized for higher precision (fewer false positives),
                                        # but for this toy example, a higher threshold increases sensitivity to underperforming groups.
            "cluster_ids": cluster_ids  # Optional: Provide cluster IDs if clustering is used.
                                        # If not provided, Datalab will automatically run clustering under the hood.
                                        # In that case, you need to provide the 'features' array as an additional argument.
            },
    },
)

# Collect the identified issues
underperforming_group_issues = lab.get_issues("underperforming_group").query("is_underperforming_group_issue")

# Display the issues along with given and predicted labels
display(underperforming_group_issues.join(pd.DataFrame({"given_label": y, "predicted_label": pred_probs.argmax(axis=1)})))


### 5. (Optional) Visualize the Results

Finally, we will optionally visualize the dataset, highlighting the underperforming groups identified by `Datalab`.


In [None]:
import matplotlib.pyplot as plt

# Plot the original data points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap="tab10")

# Highlight the underperforming group (if any issues are detected)
if not underperforming_group_issues.empty:    
    plt.scatter(
        X[underperforming_group_issues.index, 0], X[underperforming_group_issues.index, 1], 
        s=100, facecolors='none', edgecolors='r', alpha=0.3, label="Underperforming Group", linewidths=2.0
    )
else:
    print("No underperforming group issues detected.")

# Add title and legend
plt.title("Underperforming Groups in the Dataset")
plt.legend()
plt.show()


If you want to learn more about the underperforming group issue type, you can read about it [here](issue_type_description.html#underperforming-group-issue).

## Perform Data Valuation on a Dataset

In this section, we will show how to use `Datalab` to perform data valuation on a dataset. Data valuation helps you understand the importance of each data point, where you can identify valuable and less valuable data points for your machine learning models.

We will use a text dataset for this example, but in principle, this can be applied to any dataset.

### 1. Load and Prepare the Dataset
We will use a subset of the 20 Newsgroups dataset, which is a collection of newsgroup documents suitable for text classification tasks.
For demonstration purposes, we'll classify documents from two categories: "alt.atheism" and "sci.space".

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# Load the 20 Newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'], remove=('headers', 'footers', 'quotes'))

# Create a DataFrame with the text data and labels
df_text = pd.DataFrame({"Text": newsgroups_train.data, "Label": newsgroups_train.target})
df_text["Label"] = df_text["Label"].map({i: category for (i, category) in enumerate(newsgroups_train.target_names)})

# Display the first few samples
df_text.head()

### 2. Vectorize the Text Data
We will use a `TfidfVectorizer` to convert the text data into a numerical format suitable for machine learning models.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Transform the text data into a feature matrix
X_vectorized = vectorizer.fit_transform(df_text["Text"])

# Convert the sparse matrix to a dense matrix
X = X_vectorized.toarray()

### 3. Perform Data Valuation with Datalab

Next, we will initialize `Datalab` and perform data valuation to assess the value of each data point in the dataset.

In [None]:
from cleanlab import Datalab

# Initialize Datalab with the dataset
lab = Datalab(data=df_text, label_name="Label", task="classification")

# Perform data valuation
lab.find_issues(features=X, issue_types={"data_valuation": {}})

# Collect the identified issues
data_valuation_issues = lab.get_issues("data_valuation")

# Display the data valuation issues
display(data_valuation_issues)


### 4. (Optional) Visualize Data Valuation Scores
Finally, we will visualize the data valuation scores using a histogram to understand the distribution of scores across different labels.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Prepare the data for plotting a histogram
plot_data = (
    data_valuation_issues
    # Optionally, add a 'given_label' column to distinguish between labels in the histogram
    .join(pd.DataFrame({"given_label": df_text["Label"]}))
)

# Plot histograms of data valuation scores for each label
sns.histplot(
    data=plot_data,
    hue="given_label",  # Comment out if no labels should be used in the visualization
    x="data_valuation_score",
    bins=15,
    element="step",
    multiple="stack",  # Stack histograms for different labels
)

# Set y-axis to a logarithmic scale for better visualization of wide-ranging counts
plt.yscale("log")
plt.yscale("log")
plt.title("Data Valuation Scores by Label")
plt.xlabel("Data Valuation Score")
plt.ylabel("Count (log scale)")
plt.show()


If you want to learn more about the data valuation issue type, you can read about it [here](issue_type_description.html#data-valuation-issue).

## Accelerate Multiple Issue Checks with Pre-computed kNN Graphs

In this section, we will demonstrate how to use pre-computed k-nearest neighbors (kNN) graphs to accelerate the identification of multiple issues in your dataset using `Datalab`. This method leverages the power of kNN graphs to efficiently find and analyze data issues.

While we use a toy dataset for demonstration, these steps can be applied to any dataset.

### 1. Load and Prepare Your Dataset

First, load your dataset. For this example, we'll generate a synthetic dataset, but you should replace this with your own dataset loading process.

In [None]:
from sklearn.datasets import make_classification

# Replace this section with your own dataset loading
# For demonstration, we create a synthetic classification dataset
X, y = make_classification(
    n_samples=5000,
    n_features=5,
    n_informative=5,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=2,
    flip_y=0.02,
    class_sep=2.0,
    shuffle=False,
    random_state=None,
)

# Example: Add a duplicate example to the dataset
X[-1] = X[-2] + np.random.rand(5) * 0.001


### 2. Compute kNN Graph
We will compute the kNN graph using FAISS, a library for efficient similarity search. This step involves creating a kNN graph that represents the nearest neighbors for each point in your dataset.

In [None]:
import faiss
import numpy as np

# Faiss uses single precision, so we need to convert the data type
X_faiss = np.float32(X)

# Normalize the vectors for inner product similarity (effectively cosine similarity)
faiss.normalize_L2(X_faiss)

# Build the index using FAISS
index = faiss.index_factory(X_faiss.shape[1], "HNSW32,Flat", faiss.METRIC_INNER_PRODUCT)

# Add the dataset to the index
index.add(X_faiss)

# Perform the search to find k-nearest neighbors
k = 10  # Number of neighbors to consider
D, I = index.search(X_faiss, k + 1)  # Include the point itself during search

# Remove the first column (self-distances)
D, I = D[:, 1:], I[:, 1:]

# Convert cosine similarity to cosine distance
np.clip(1 - D, a_min=0, a_max=None, out=D)

# Create the kNN graph
from scipy.sparse import csr_matrix

def create_knn_graph(distances: np.ndarray, indices: np.ndarray) -> csr_matrix:
    """
    Create a K-nearest neighbors (KNN) graph in CSR format from provided distances and indices.

    Parameters:
    distances (np.ndarray): 2D array of shape (n_samples, n_neighbors) containing distances to nearest neighbors.
    indices (np.ndarray): 2D array of shape (n_samples, n_neighbors) containing indices of nearest neighbors.

    Returns:
    scipy.sparse.csr_matrix: KNN graph in CSR format.
    """
    assert distances.shape == indices.shape, "distances and indices must have the same shape"
    
    n_samples, n_neighbors = distances.shape

    # Convert to 1D arrays for CSR matrix creation
    indices_1d = indices.ravel()
    distances_1d = distances.ravel()
    indptr = np.arange(0, n_samples * n_neighbors + 1, n_neighbors)

    # Create the CSR matrix
    return csr_matrix((distances_1d, indices_1d, indptr), shape=(n_samples, n_samples))

knn_graph = create_knn_graph(D, I)

# Ensure the kNN graph is sorted by row values
from sklearn.neighbors import sort_graph_by_row_values
sort_graph_by_row_values(knn_graph, copy=False, warn_when_not_sorted=False)


### 3. Train a Classifier and Obtain Predicted Probabilities

Train a classifier on your dataset and obtain predicted probabilities for the dataset. This step is necessary to identify label-related issues.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

# Obtain predicted probabilities using cross-validation
clf = LogisticRegression()
pred_probs = cross_val_predict(clf, X, y, cv=3, method="predict_proba")


### 4. Identify Data Issues Using Datalab
Use the pre-computed kNN graph and predicted probabilities to find issues in the dataset using `Datalab`.

In [None]:
from cleanlab import Datalab

# Initialize Datalab with the dataset
lab = Datalab(data={"X": X, "y": y}, label_name="y", task="classification")

# Perform issue detection using the kNN graph and predicted probabilities, when possible
lab.find_issues(knn_graph=knn_graph, pred_probs=pred_probs, features=X)

# Collect the identified issues and a summary
issues = lab.get_issues()
issue_summary = lab.get_issue_summary()

# Display the issues and summary
display(issue_summary)
display(issues)


#### Explanation:

**Creating the kNN Graph:**

- Compute the kNN graph using FAISS or another library, ensuring the self-points (points referring to themselves) are omitted from the neighbors.
  - Some distance kernels or search algorithms (like those in FAISS) may return negative distances or suffer from numerical instability when comparing
    points that are extremely close to each other. This can lead to incorrect results when constructing the kNN graph.
  - **Note**: kNN graphs are generally poorly suited for detecting exact duplicates, especially when the number of exact duplicates exceeds the number of requested neighbors. The strengths of this data structure lie in the assumption that data points are similar but not identical, allowing efficient similarity searches and proximity-based analyses.
  - If you are comfortable with exploring non-public API functions in the library, you can use the following helper function to ensure that exact duplicate sets are correctly represented in the kNN graph. Please note, this function is not officially supported and is not part of the public API:

    ```python
    from cleanlab.internal.neighbor.knn_graph import correct_knn_graph

    knn_graph = correct_knn_graph(features=X_faiss, knn_graph=knn_graph)
    ```
- You may need to handle self-points yourself with third-party libraries.
- Construct the CSR (Compressed Sparse Row) matrix from the distances and indices arrays.
  - `Datalab` can automatically construct a kNN graph from a numerical `features` array if one is not provided, in an accurate and reliable manner.
- Sort the kNN graph by row values.

When using kNN graphs, it is important to understand their strengths and limitations to apply them effectively in your ML workflows.

## Detect if data is non-iid

In this section, we'll show how to userun a non-iid check of your data with `Datalab`.

For this demonstration, we'll work with a 2d dataset where the data points are not independent.

### 1. Load and Prepare the Dataset

For simplicity, we'll just work with a numerical feature embeddings that represent the dataset.

In [None]:
def generate_data_dependent(num_samples):
    a1, a2, a3 = 0.6, 0.375, -0.975
    X = [np.random.normal(1, 1, 2) for _ in range(3)]
    X.extend(a1 * X[i-1] + a2 * X[i-2] + a3 * X[i-3] for i in range(3, num_samples))
    return np.array(X)

X = generate_data_dependent(50)

### 2. Detect Non-IID Issues Using Datalab


In [None]:
from cleanlab import Datalab

# Initialize Datalab with the dataset
lab = Datalab(data={"X": X})

# Perform data valuation
lab.find_issues(features=X, issue_types={"non_iid": {}})

# Collect the identified issues
non_iid_issues = lab.get_issues("non_iid")

# Display the non-iid issues
display(non_iid_issues.head)


### 4. (Optional) Visualize the Results

Finally, we'll visualize the dataset and highlight the non-iid issues detected by `Datalab`.

Note that only the dataset as a whole can be considered to be non-iid, but no individual data point can be considered non-iid.

To be compatible with `Datalab`, the point with the lowest non-iid score is assigned the `is_non_iid_issue` flag if the entire dataset
is considered non-iid.

In [None]:
# Plot the non-iid scores
non_iid_issues["non_iid_score"].plot()

# Highlight the point assigned as a non-iid issue
idx = non_iid_issues.query("is_non_iid_issue").index
plt.scatter(idx, non_iid_issues.loc[idx,"non_iid_score"], color='red', label='Non-iid Issue', s=100)
plt.title("Non-iid Scores")
plt.xlabel("Sample Index")
plt.ylabel("Non-iid Score")
plt.legend()
plt.show()

# Visualize dataset ordering
plt.scatter(X[:, 0], X[:, 1], c=range(len(X)), cmap='coolwarm', s=100)
plt.title("Dataset with data-dependent ordering")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Add colorbar
plt.colorbar(label='Sample Index')
plt.show()

These plots help visualize the non-iid scores for each data point and the dataset ordering, highlighting potential dependencies and issues.

After detecting non-iid issues, you might be interested in quantifying the likelihood that your dataset is non-iid.

To check if your data is non-iid, Datalab computes a p-value. A low p-value (close to 0) indicates strong evidence against the null hypothesis that the data is iid, suggesting significant dependencies or variations in
distribution across the dataset.

In [None]:
print("p-value:", lab.get_info("non_iid")["p-value"])

You may be interested in [this page to learn more about the non-iid issue type](issue_type_description.html#non-iid-issue).