# Active Learning for caltech256 using Cleanlab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/examples/blob/master/active_learning_multiannotator/active_learning.ipynb)

This notebook demonstrates a practical approach to active learning for training classification models with cleanlab. In active learning, we aim to construct a labeled dataset by collecting the fewest labels that still allow us to train an accurate classifier model. Here we assume data labeling is done in **batches**, and between these data labeling rounds, we retrain our classifier to decide what examples (i.e. datapoints) to label next round. We consider labeling a batch of examples per round and are limited to a single annotation per example.

cleanlab provides an active learning score quantifying how desirable it is to collect an additional label for every possible unlabeled example

This notebook demonstrates how to compute these easily for use in sequential active learning, showing how a classification model iteratively improves after labeling more examples for multiple rounds.
This notebook implements the following steps:

1. Establish some already labeled data. Use this data to train a classifier model and then obtain out-of-sample predicted probabilities for each labeled and unlabeled example.
2. Compute active learning scores for every example, which estimate our current confidence in knowing its true label.
3. Collect additional labels for the unlabeled examples with the lowest active learning scores. These are the most potentially informative examples whose true label we are least certain of. 
4. Repeat the steps above to collect as many labels as your budget permits.

The accuracy of the model trained on the resulting dataset will generally match that of the same model trained on a much larger set of randomly collected labels. I.e. this is the most cost-effective way to train an accurate classifier!

In this example we use the caltch256 dataset combined with autogluon's new MultiModalPredictor architecture.

## Import dependencies and get data

In [2]:
import time
import numpy as np
import pandas as pd
from autogluon.multimodal import MultiModalPredictor
from autogluon.vision import ImageDataset
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from cleanlab.multiannotator import get_majority_vote_label, get_label_quality_multiannotator, get_active_learning_scores
from utils.model_training_autogluon import cross_val_predict_autogluon_classification_dataset

  from .autonotebook import tqdm as notebook_tqdm


We load the following datafiles:

- `dataset` is a DataFrame that contains labels and image paths for each example

We will then randomly split the dataset into train and test splits. Test data will just be used to measure the accuracy in our model after each active learning round. The train data will further be split into a labeled and unlabeled part. For the df_labeled we will use the labels to train the model while the df_unlabeled will simulate active learning by allowing us to artifically insert mode labeled data in between rounds.

We will choose to start with num_labeled_per_class = 8

In [3]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/ActiveLearning/Caltech256/256_ObjectCategories.zip' && unzip -o -q 256_ObjectCategories.zip

File ‘256_ObjectCategories.zip’ already there; not retrieving.



In [4]:
dataset = ImageDataset.from_folder('./256_ObjectCategories/')
dataset = dataset.replace(257, 256) # no class class in dataset is labeled as 257, we need to reindex

In [5]:
# get train-test split
X, y = np.arange(len(dataset) * 2).reshape((len(dataset), 2)), range(len(dataset))
_, _, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)

df_train = dataset.iloc[y_train]
df_test = dataset.iloc[y_test]
df_test = df_test.reset_index(drop=True)

In [6]:
df_train, df_test = train_test_split(dataset, test_size=0.33, random_state=123)

In [7]:
def get_labeled(dataset,  num_labeled_per_class=15):
    """Splits provided dataset into two datasets. With df_labeled containing num_labeled_per_class labeles for 
    each class and df_unlabeled containing the rest of the rows in dataset"""
    
    df_labeled = dataset.groupby("label").sample(n=num_labeled_per_class, random_state=123)
    labeled_index = list(df_labeled.index)
    unlabeled_index = [i for i in range(len(dataset)) if i not in labeled_index]
    df_unlabeled = dataset.iloc[unlabeled_index]
    df_unlabeled = df_unlabeled.reset_index(drop=True)
    df_labeled = df_labeled.reset_index(drop=True)    
    return labeled_index, df_labeled, unlabeled_index, df_unlabeled

labeled_index, df_labeled, unlabeled_index, df_unlabeled = get_labeled(dataset, num_labeled_per_class=8)

## Train model to obtain predicted probabilites

First, we train our model on a set of labels obtained by `get_labeled` to get the out-of-sample predicted class probabilities for both the labeled and unlabeled data. 

The train function will return two sets of predicted probabilites, `pred_probs_labeled` are the predicted probabilites for examples that have existing annotator labels (they correspond directly with the rows in `df_labeled`), whereas `pred_probs_unlabeled` are the predicted probabilites for examples that do not have any annotator labels (they correspond directly with the rows in `df_unlabeled`). These predicted probabilities will later be used to compute the active learning score.

If working with your own dataset, you should consider modifying this `cross_val_predict_autogluon_classification_dataset` function so that it is better fitted for training your specific dataset.

In [None]:
pred_probs_labeled, pred_probs_unlabeled, labels_labeled, images = cross_val_predict_autogluon_classification_dataset(
                                                                                    dataset,
                                                                                    out_folder=None,
                                                                                    cv_n_folds=3,
                                                                                    df_predict=df_unlabeled,
                                                                                    hypterparameters={},
                                                                                    time_limit=180,)

Global seed set to 123
No path specified. Models will be saved in: "AutogluonModels/ag-20230315_210329/"


----
Running Cross-Validation on Split: 0


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


## Obtain active learning scores

Next, we will get the active learning scores for each datapoint (both labeled and unlabeled) by using a combination of the annotators' agremeent and model confidence. These scores represent how confident we are about an example's true label based on the currently obtained annotations; examples with the lowest scores are those for which additional labels should be collected (i.e. likely the most informative). These scores are estimated via an **Active CROWDLAB** algorithm developed by the Cleanlab team, and may sometimes prioritize an already-labeled example over an unlabeled example if the annotations for the labeled example are deemed unreliable (Active CROWDLAB appropriately estimates the value of collecting new annotations for unlabeled data vs already-labeled data). 

Since we only have a single label for each datapoint, we only consider scores for previously unlabeled examples.

In [None]:
# compute active learning scores
active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
    labels_labeled, pred_probs_labeled, pred_probs_unlabeled
)

In [None]:
# sample of active learning scores
active_learning_scores[:5]

## Get index to relabel

Lastly, we can ranks the examples by their active learning scores, and obtain the index of the examples with the lowest scores; these are the least confident examples which we will want to collect more labels for.

The code cell below shows how to obtain their respective indices to collect more labels.

In [None]:
def get_idx_to_label(active_learning_scores_unlabeled, batch_size_to_label):
    """Function to get indices of examples with the lowest active learning score to collect more labels for."""
    return np.argsort(active_learning_scores_unlabeled)[:batch_size_to_label]

In [None]:
batch_size_to_label = 100 # you can pick how many examples to collect more labels for at each round

# get next idx to label based on batch_size_to_label and magnitude of each example's active learning score
next_idx_to_label = get_idx_to_label(active_learning_scores_unlabeled, batch_size_to_label=batch_size_to_label)
next_idx_to_label[:5]

## Improving model accuracy over 15 rounds of active learning (collecting new labels) 

The code below shows a full demonstration of how we can repeatedly use the functions demonstrated above for multiple rounds in order to select which examples to collect new labels for, ask annotators to provide these new labels (via a noisy simulation in this example), and use the newly collected labels to train an improved classification model.

This demonstration runs this active learning loop for 15 rounds, choosing 100 examples to collect more labels for each round. Each round, we use labeled examples to train a classifier (here we used autogluon's `MultiModalPredictor` classifier) and obtain out-of-sample predicted probabilites, which are then used to compute the active learning scores for every example. We then synthetically collect new labels (this process is meant to simulate getting a new annotator to annotated a selection of examples) and repeat the active learning loop. 

[Optional step] We also measure the model performance on a test set each round to demonstrate the improvement of the model.

In [None]:
def setup_next_iter_data(df_labeled, df_unlabeled, relabel_idx_unlabeled):
    """Updates inputs after additional labels have been collected in a single active learning round,
    this ensures that the inputs will be well formatted for the next round of active learning."""

    df_labeled = pd.concat([df_labeled,df_unlabeled.iloc[relabel_idx_unlabeled]], ignore_index=True)
    df_unlabeled = df_unlabeled.drop(relabel_idx_unlabeled)
    df_unlabeled = df_unlabeled.reset_index(drop=True)
    df_labeled = df_labeled.reset_index(drop=True)  
    return df_labeled, df_unlabeled

In [None]:
num_rounds = 30
batch_size_to_label = 100
hypterparameters = {}

In [None]:
model_accuacy_arr = np.full(num_rounds, np.nan)

for i in range(num_rounds):
    # train model to get out-of-sample predicted probabilites
    out_folder = None
    
    print('fitting xval model')
    pred_probs_labeled, pred_probs_unlabeled, labels_labeled, images = cross_val_predict_autogluon_classification_dataset(df_labeled,
                                                                                    out_folder=out_folder,
                                                                                    cv_n_folds=3,
                                                                                    df_predict=df_unlabeled,
                                                                                    hypterparameters=hypterparameters,)
    # train a model on the full set of labeled data to evaluate model accuracy for the current round,
    # this is an optional step for demonstration purposes, in practical applications 
    # you may not have ground truth labels
    print('fitting full model')
    predictor = MultiModalPredictor(label="label", path=None, problem_type="classification",warn_if_exist=False)
    # train model on train indices in this split
    predictor.fit(
        train_data=df_labeled,
        hyperparameters=hypterparameters,
    )
    # predicted probabilities for test split
    pred_labels = predictor.predict(data=df_test)
    true_labels_test = np.array(df_test['label'].tolist())
    model_accuacy_arr[i] = np.mean(pred_labels == true_labels_test)
    
    print('test round: ', i, 'accuracy: ', np.mean(pred_labels == true_labels_test))
        
    print('computing active learning scores')
    # compute active learning scores
    active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
        labels_labeled, pred_probs_labeled, pred_probs_unlabeled
    )
    
    print('getting idx to relabel')
    # get the indices of examples to collect more labels for
    relabel_idx_unlabeled = get_idx_to_label(
        active_learning_scores_unlabeled=active_learning_scores_unlabeled,
        batch_size_to_label=batch_size_to_label,
    )
    
    relabel_idx_unlabeled = np.random.choice(range(df_unlabeled.shape[0]), batch_size_to_label, replace=False)
    print('setting up next iter')
    # format the data for the next round of active learning, ie. moving some unlabeled 
    # examples to the labeled pool because we are collecting labels for them
    df_labeled, df_unlabeled = setup_next_iter_data(df_labeled, df_unlabeled, relabel_idx_unlabeled)
    

## Evaluate results

In [None]:
print(f"Initial model test accuracy: {model_accuacy_arr[0]:.3}")
print(f"Final model test accuracy (after 15 rounds of active learning): {model_accuacy_arr[-1]:.3}")

In [None]:
np.save("model_acc_30_rounds_activelab", model_accuacy_arr)

In [None]:
plt.plot(model_accuacy_arr)
plt.xticks(range(num_rounds))
plt.xlabel("Round")
plt.ylabel("Model Accuracy")
plt.show()

plt.savefig('model_acc_30_rounds_activelab.png')

From the plot above, we can see that the model accuracy increases steadily with each additional round of collecting more labels, getting improved consensus labels, and model training.