# Improve Consensus Labels for Multiannotator Data

This 5-minute quickstart tutorial shows how to use Cleanlab Studio for classification data that has been labeled by multiple annotators (where each example has been labeled by at least one annotator, but not every annotator has labeled every example). Compared to existing crowdsourcing tools, Cleanlab Studio helps you better analyze such data by leveraging a trained classifier model in addition to the raw annotations. With few lines of code, you can automatically compute:

- A **consensus label** for each example (i.e. *truth inference*) that aggregates the individual annotations (more accurately than algorithms from crowdsourcing like majority-vote, Dawid-Skene, or GLAD).
- a **quality score for each consensus label** which measures our confidence that this label is correct (via well-calibrated estimates that account for the: number annotators which have labeled this example, overall quality of each annotator, and quality of our trained ML models).
- An analogous **label quality score** for each individual label chosen by one annotator for a particular example.
- An **overall quality score for each annotator** which measures our confidence in the overall correctness of labels obtained from this annotator.

**Overview of what we'll do in this tutorial:**

- Obtain initial consensus labels of multiannotator data using majority vote.
- Train a model on the initial consensus labels and obtain out-of-sample predicted class probabilities.
- Get improved consensus labels that more accurately reflect the ground truth.
- View other information about your multiannotator dataset, such as consensus and annotator quality scores, agreement between annotators, detailed label quality scores and more!

## 1. Install and import required dependencies

You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install cleanlab cleanlab-studio
```

In [None]:
# Package installation (hidden on docs website).
dependencies = ["cleanlab", "cleanlab_studio"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    dependencies_test = [dependency.split('>')[0] if '>' in dependency 
                         else dependency.split('<')[0] if '<' in dependency 
                         else dependency.split('=')[0] for dependency in dependencies]
    missing_dependencies = []
    for dependency in dependencies_test:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

# Suppress benign warnings: 
import warnings 
warnings.filterwarnings("ignore", "Lazy modules are a new feature.*")

Let’s import some of the packages needed throughout this tutorial.

In [None]:
import numpy as np
import pandas as pd

from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label
from cleanlab_studio import Studio

studioID = 'f85d63c6f5e1401db4846c279aa1efa7'   # set your studioID
studio = Studio(studioID)

## 2. Create the data (can skip these details)

For this tutorial we will generate a toy dataset that has 50 annotators and 300 examples. There are three possible classes, `0`, `1` and `2`. 

Each annotator annotates approximately 10% of the examples. We also synthetically made the last 5 annotators in our toy dataset have much noisier labels than the rest of the annotators.

Solely for evaluating Cleanlab Studio's consensus labels against other consensus methods, we here also generate the true labels for this example dataset. However, true labels are not required for any of Cleanlab Studio multiannotator functions (and they usually are not available in real applications).
To generate our multiannotator data, we define a `make_data()` method (can skip these details).

<details><summary>See the code for data generation **(click to expand)**</summary>
    
```ipython3
# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.
    
from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels

SEED = 111 # set to None for non-reproducible randomness
np.random.seed(seed=SEED)

def make_data(
    means=[[3, 2], [7, 7], [0, 8]],
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],
    sizes=[150, 75, 75],
    num_annotators=50,
):
    
    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Compute p(true_label=k)
    py = np.bincount(true_labels_train) / float(len(true_labels_train))
    
    noise_matrix_better = generate_noise_matrix_from_trace(
        m,
        trace=0.8 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )
    
    noise_matrix_worse = generate_noise_matrix_from_trace(
        m,
        trace=0.35 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )

    # Generate our noisy labels using the noise_matrix for specified number of annotators.
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, noise_matrix_better)
                if i < num_annotators - 5
                else generate_noisy_labels(true_labels_train, noise_matrix_worse)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset
    # (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]

    row_NA_check = pd.notna(s).any(axis=1)

    return {
        "X_train": X_train[row_NA_check],
        "true_labels_train": true_labels_train[row_NA_check],
        "multiannotator_labels": s[row_NA_check].reset_index(drop=True),
    }
```
    
</details>

In [None]:
from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels

SEED = 111 # set to None for non-reproducible randomness
np.random.seed(seed=SEED)

def make_data(
    means=[[3, 2], [7, 7], [0, 8]],
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],
    sizes=[150, 75, 75],
    num_annotators=50,
):
    
    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Compute p(true_label=k)
    py = np.bincount(true_labels_train) / float(len(true_labels_train))
    
    noise_matrix_better = generate_noise_matrix_from_trace(
        m,
        trace=0.8 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )
    
    noise_matrix_worse = generate_noise_matrix_from_trace(
        m,
        trace=0.35 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )

    # Generate our noisy labels using the noise_matrix for specified number of annotators.
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, noise_matrix_better)
                if i < num_annotators - 5
                else generate_noisy_labels(true_labels_train, noise_matrix_worse)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset
    # (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]

    row_NA_check = pd.notna(s).any(axis=1)

    return {
        "X_train": X_train[row_NA_check],
        "true_labels_train": true_labels_train[row_NA_check],
        "multiannotator_labels": s[row_NA_check].reset_index(drop=True),
    }

In [None]:
data_dict = make_data()

X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"] # used for comparing the accuracy of consensus labels

Let's view the first few rows of the data used for this tutorial. Here are the labels selected by each annotator for the first few examples:

In [None]:
multiannotator_labels.head()

Here are the corresponding features for these examples:

In [None]:
X[:5]

`multiannotator_labels` contains the class labels that each annotator chose for each example, with examples that a particular annotator did not label represented using `np.nan`. 
`X` contains the features for each example, which happen to be numeric in this tutorial but any feature modality can be used.

<div class="alert alert-info">
Bringing Your Own Data (BYOD)?

You can easily replace the above with your own multiannotator labels and features, then continue with the rest of the tutorial.
 
`multiannotator_labels` should be a numpy array or pandas DataFrame with each column representing an annotator and each row representing an example. Your labels should be represented as integer indices 0, 1, ..., num_classes - 1, where examples that are not annotated by a particular annotator are represented using `np.nan` or `pd.NA`. If you have string labels or other labels that do not fit the required format, you can convert them to the proper format using `cleanlab.internal.multiannotator_utils.format_multiannotator_labels`. 
    
Your features can be represented however you like as long as they are in a format accepted by Cleanlab Studio! 

</div>


## 3. Get majority vote labels and compute out-of-sample predicted probabilities

Before training a machine learning model, we must first obtain the consensus labels from the annotators that labeled the data. The simplest way to obtain an initial set of consensus labels is to select it using majority vote.

In [None]:
majority_vote_label = get_majority_vote_label(multiannotator_labels)

Next, we will train a model on the consensus labels obtained using majority vote to compute out-of-sample predicted probabilities.

In [None]:
df_consensus = pd.DataFrame(X)
df_consensus['label'] = majority_vote_label
dataset_id = studio.upload_dataset(df_consensus, dataset_name='multiannotator_tutorial')
project_id = studio.create_project(dataset_id, 
                                   project_name='multiannotator_tutorial_itter', 
                                   modality='tabular', 
                                   model_type='fast', 
                                   label_column='label',)
cleanset_id = studio.get_latest_cleanset_id(project_id)
project_status = studio.poll_cleanset_status(cleanset_id)
if project_status:
    pred_probs = studio.experimental.download_pred_probs(cleanset_id)
    pred_probs = pred_probs.to_numpy()

## 4. Use Cleanlab Studio to get better consensus labels and other statistics

Using the annotators' labels and the out-of-sample predicted probabilities from the model, Cleanlab can help us obtain improved consensus labels for our data.

In [None]:
results = get_label_quality_multiannotator(multiannotator_labels, pred_probs, verbose=False)

Here, we use the `multiannotator.get_label_quality_multiannotator()` function which returns a dictionary containing three items:


1. `label_quality` which gives us the improved consensus labels using information from each of the annotators and the model. The DataFrame also contains information about the number of annotations, annotator agreement and consensus quality score for each example.


In [None]:
results["label_quality"].head()

2. `detailed_label_quality` which returns the label quality score for each label given by every annotator

In [None]:
results["detailed_label_quality"].head()

3. `annotator_stats` which gives us the annotator quality score for each annotator, alongisde other information such as the number of examples each annotator labeled, their agreement with the consensus labels and the class they perform the worst at. 

In [None]:
results["annotator_stats"].head(10)

The `annotator_stats` DataFrame is sorted by increasing `annotator_quality`, showing us the worst annotators first.

Notice that in the above table annotators with ids A0046 to A0050 have the worst annotator quality score, which is expected because we made the last 5 annotators systematically worse than the rest.

### Comparing improved consensus labels

We can get the improved consensus labels from the `label_quality` DataFrame shown above.

In [None]:
improved_consensus_label = results["label_quality"]["consensus_label"].values

Since our toy dataset is synthetically generated by adding noise to each annotator's labels, we know the ground truth labels for each example. Hence we can compare the accuracy of the consensus labels obtained using majority vote, and the improved consensus labels obtained using cleanlab.

In [None]:
majority_vote_accuracy = np.mean(true_labels == majority_vote_label)
cleanlab_label_accuracy = np.mean(true_labels == improved_consensus_label)

print(f"Accuracy of majority vote labels = {majority_vote_accuracy}")
print(f"Accuracy of cleanlab consensus labels = {cleanlab_label_accuracy}")

We can see that the accuracy of the consensus labels improved as a result of using cleanlab, which not only takes the annotators' labels into account, but also a model to compute better consensus labels.

### Inspecting consensus quality scores to find potential consensus label errors

We can get the consensus quality score from the `label_quality` DataFrame shown above.

In [None]:
consensus_quality_score = results["label_quality"]["consensus_quality_score"]

Besides obtaining improved consensus labels, cleanlab also computes consensus quality scores for each example. The lower scores represent potential consensus label errors in the dataset.

Here, we will extract 15 examples that have the lowest consensus quality score, and we can compare their average accuracy when compared to the true labels. We will also compute the average accuracy for the rest of the examples for comparison.

In [None]:
sorted_consensus_quality_score = consensus_quality_score.sort_values()
worst_quality = sorted_consensus_quality_score.index[:15]
better_quality = sorted_consensus_quality_score.index[15:]

worst_quality_accuracy = np.mean(true_labels[worst_quality] == improved_consensus_label[worst_quality])
better_quality_accuracy = np.mean(true_labels[better_quality] == improved_consensus_label[better_quality])

print(f"Accuracy of 15 worst quality examples = {worst_quality_accuracy}")
print(f"Accuracy of better quality examples = {better_quality_accuracy}")

We observe that the 15 worst-consensus-quality-score examples have a lower average accuracy compared to the rest of the examples. 

## 5. Retrain model using improved consensus labels

After obtaining the improved consensus labels, we can now use Studio to retrain a better version of our machine learning model using these newly obtained labels. 

In [None]:
df_consensus = pd.DataFrame(X)
df_consensus['label'] = improved_consensus_label
dataset_id = studio.upload_dataset(df_consensus, dataset_name='multiannotator_tutorial_improved')
project_id = studio.create_project(dataset_id, 
                                   project_name='multiannotator_tutorial_improved_itter', 
                                   modality='tabular', 
                                   model_type='fast', 
                                   label_column='label',)
cleanset_id = studio.get_latest_cleanset_id(project_id)
project_status = studio.poll_cleanset_status(cleanset_id)
if project_status:
    improved_pred_probs = studio.experimental.download_pred_probs(cleanset_id).to_numpy()
else:
    print("There was an error generating your cleanset.")

## Further model improvements 
You can also repeatedly iterate this process of getting better consensus labels using the model's out-of-sample predicted probabilities and then retraining the model with the improved labels to get even better predicted probabilities!

## How does our algorithm work?

All estimates above are produced via the CROWDLAB algorithm, described in this paper that contains extensive benchmarks which show CROWDLAB can produce better estimates than popular methods like Dawid-Skene and GLAD:

[CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators](https://arxiv.org/abs/2210.06812)

In [None]:
# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

if majority_vote_accuracy >= cleanlab_label_accuracy:  # check cleanlab has improved prediction accuracy
    raise Exception("Cleanlab training failed to improve consensus label accuracy")

if worst_quality_accuracy > better_quality_accuracy: # check bad consensus quality score corresponds to bad consensus
    raise Exception("Cleanlab consensus quality score failed to detect bad consensus labels")
    
annotator_stats = results["annotator_stats"]
bad_annotator_idx = ["A0046", "A0047", "A0048", "A0049", "A0050"]
bad_annotator_mask = annotator_stats.index.isin(bad_annotator_idx)

avg_annotator_quality_bad = np.mean(annotator_stats[bad_annotator_mask]["annotator_quality"])
avg_annotator_quality_good = np.mean(annotator_stats[~bad_annotator_mask]["annotator_quality"])

if avg_annotator_quality_bad >= avg_annotator_quality_good: # check bad annotator get bad quality scores 
    raise Exception("Low quality annotators have higher quality scores than good quality annotators")