# Active Learning with Autogluon
Train accurate classifier models with minimal data labeling (and minimal code) via active learning and AutoML

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/examples/blob/master/active_learning_single_annotator/active_learning_single_annotator.ipynb)

This notebook demonstrates a practical approach to active learning for training an accurate image classifier with AutoGluon and cleanlab. We consider standard active learning settings with a pool of unlabeled examples, where we label a batch of examples at a time and collect **at most one label** per example. 

In **Active Learning**, we aim to construct a labeled dataset by collecting the fewest labels that still allow us to train an accurate classifier model. Here we assume data labeling is done in **batches**, and between these data labeling rounds, we retrain our classifier to decide what previously unlabeled examples (i.e. datapoints) to label next round.


This notebook demonstrates how to compute these scores easily for use in sequential active learning, showing how a classification model iteratively improves after labeling more examples for multiple rounds with the following steps:

1. Establish an initially labeled dataset, `df_labeled` to train the model on. This is a small subset of our training data, `df_train`. The rest of the training data is marked as `df_unlabeled`.
2. Train the model on the labeled data and get predictions for the unlabeled data, `pred_probs_unlabeled`.
3. Compute active learning scores for all unlabeled examples and select which samples to add to the dataset.
4. Add the selected samples from `df_unlabeld` into `df_labeled`.
5. Repeat the steps above to collect as many labels as your budget permits.

The accuracy of the model trained on the resulting dataset will generally match that of the same model trained on a much larger set of randomly collected labels i.e. this is the most cost-effective way to train an accurate classifier!

## Import dependencies and data

In this example we use the [Caltech-256](https://data.caltech.edu/records/nyy15-4j048)[1] image classification dataset. Any dataset in the same format can be substituted instead.

In [7]:
import time
import numpy as np
import pandas as pd
from autogluon.multimodal import MultiModalPredictor
from gluoncv.auto.data.dataset import ImageClassificationDataset
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from cleanlab.multiannotator import get_label_quality_scores, get_active_learning_scores
from utils.model_training_autogluon import predict_autogluon_classification

In [8]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/ActiveLearning/Caltech256/256_ObjectCategories.zip' && unzip -o -q 256_ObjectCategories.zip

File ‘256_ObjectCategories.zip’ already there; not retrieving.



## Select initial labeled dataset
We load the following datafiles:

- `dataset` is a DataFrame that contains labels and file paths for each image (i.e. example) from Caltech-256

We then randomly split the dataset into train and test splits. Test data are just used to measure the accuracy in our model after each active learning round (you may not have this in your applications).  The train data will further be split into labeled and unlabeled pools. For the `df_labeled` we will use the labels to train the model while the `df_unlabeled` will simulate active learning by allowing us to artificially insert more labeled data in between rounds.

In [9]:
dataset = ImageClassificationDataset.from_folder('./256_ObjectCategories/')
dataset = dataset.replace(257, 256) # no class class in dataset is labeled as 257, we need to reindex

# Split data into train and test
df_train, df_test = train_test_split(dataset, test_size=0.33, random_state=123)

The train data will further be split into a labeled and unlabeled part, `df_labeled` and `df_unlabeled` respectively. 

We will use the labels in `df_labeled` to train the model while `df_unlabeled` will simulate active learning by allowing us to artificially insert more labeled data in between rounds.

We will arbitrarily choose to start with `num_labeled_per_class = 8`

In [10]:
def get_labeled(dataset,  num_labeled_per_class=8):
    """Splits provided dataset into two datasets. With df_labeled containing num_labeled_per_class labeles for 
    each class and df_unlabeled containing the rest of the rows in dataset"""
    
    df_labeled = dataset.groupby("label").sample(n=num_labeled_per_class, random_state=123)
    labeled_index = list(df_labeled.index)
    unlabeled_index = [i for i in range(len(dataset)) if i not in labeled_index]
    df_unlabeled = dataset.iloc[unlabeled_index]
    df_unlabeled = df_unlabeled.reset_index(drop=True)
    df_labeled = df_labeled.reset_index(drop=True)    
    return df_labeled, df_unlabeled

# Split the train data into labeled and unlabeled with 8 labeled per each class
df_labeled, df_unlabeled = get_labeled(df_train, num_labeled_per_class=8)

## Train model on labeled data & obtain predicted probabilites for unlabeled data

First, we train our model on the labeled data obtained from `get_labeled` and get the probabilities for the unlabeled data. The train function returns our `predictor` fitted to `df_labeled` and `pred_probs_unlabeled` which are the predicted probabilities for examples that do not have any annotator labels (they correspond directly with the rows in `df_unlabeled`). These predicted probabilities will later be used to compute the active learning score.

If working with your own model, you should consider modifying this `predict_autogluon_classification` function so that it is better fitted for training your specific model.

In [None]:
predictor, pred_probs_unlabeled = predict_autogluon_classification( df_labeled,
                                                                    out_folder=None,
                                                                    df_predict=df_unlabeled,
                                                                    time_limit=30)

Global seed set to 123
No path specified. Models will be saved in: "AutogluonModels/ag-20230317_201329/"
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                            | Params
----------------------------------------------------------------------
0 | model             | TimmAutoModelForImagePrediction | 87.0 M
1 | validation_metric | Accuracy                        | 0     
2 | loss_func         | CrossEntropyLoss                | 0     
----------------------------------------------------------------------
87.0 M    Trainable params
0         Non-trainable params
87.0 M    Total params
174.013   Total estimated model params size (MB)


Epoch 0:  33%|█████████████████████████▉                                                    | 103/310 [00:13<00:27,  7.60it/s, loss=5.58, v_num=]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|                                                                                                         | 0/52 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                                                                                            | 0/52 [00:00<?, ?it/s][A
Epoch 0:  34%|██████████████████████████▏                                                   | 104/310 [00:13<00:27,  7.54it/s, loss=5.58, v_num=][A
Epoch 0:  34%|██████████████████████████▍                                                   | 105/310 [00:13<00:27,  7.59it/s, loss=5.58, v_num=][A
Epoch 0:  34%|██████████████████████████▋                                                   | 106/310 [00:13<00:26,  7.63it/s, loss=5.58, v_num=][A
Epoch 0:  35%|██████████████████████████▉                                  

Epoch 0:  50%|███████████████████████████████████████                                       | 155/310 [00:16<00:16,  9.48it/s, loss=5.58, v_num=][A
                                                                                                                                                 [A

Epoch 0, global step 6: 'val_accuracy' reached 0.00728 (best 0.00728), saving model to '/home/ubuntu/examples/active_learning_single_annotator/AutogluonModels/ag-20230317_201329/epoch=0-step=6.ckpt' as top 3


Epoch 0:  70%|██████████████████████████████████████████████████████▊                       | 218/310 [00:29<00:12,  7.29it/s, loss=5.47, v_num=]

Time limit reached. Elapsed time is 0:00:30. Signaling Trainer to stop.


Epoch 0:  71%|███████████████████████████████████████████████████████                       | 219/310 [00:30<00:12,  7.29it/s, loss=5.49, v_num=]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|                                                                                                         | 0/52 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                                                                                            | 0/52 [00:00<?, ?it/s][A
Epoch 0:  71%|███████████████████████████████████████████████████████▎                      | 220/310 [00:30<00:12,  7.27it/s, loss=5.49, v_num=][A
Epoch 0:  71%|███████████████████████████████████████████████████████▌                      | 221/310 [00:30<00:12,  7.29it/s, loss=5.49, v_num=][A
Epoch 0:  72%|███████████████████████████████████████████████████████▊                      | 222/310 [00:30<00:12,  7.31it/s, loss=5.49, v_num=][A
Epoch 0:  72%|████████████████████████████████████████████████████████     

Epoch 0:  87%|████████████████████████████████████████████████████████████████████▏         | 271/310 [00:32<00:04,  8.25it/s, loss=5.49, v_num=][A
                                                                                                                                                 [A

Epoch 0, global step 10: 'val_accuracy' reached 0.01699 (best 0.01699), saving model to '/home/ubuntu/examples/active_learning_single_annotator/AutogluonModels/ag-20230317_201329/epoch=0-step=10.ckpt' as top 3


Epoch 0:  87%|████████████████████████████████████████████████████████████████████▏         | 271/310 [00:48<00:07,  5.55it/s, loss=5.49, v_num=]


INFO:automm:Start to fuse 2 checkpoints via the greedy soup algorithm.


Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.89it/s]
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.80it/s]
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.71it/s]


INFO:automm:Models and intermediate outputs are saved to /home/ubuntu/examples/active_learning_single_annotator/AutogluonModels/ag-20230317_201329 


Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████▋| 595/597 [01:40<00:00,  5.90it/s]

## Obtain active learning scores

Next, we will compute active learning scores that estimate the informativeness of labeling each datapoint. Since we will collect at most one annotation per example, we only care about active learning scores for the unlabeled data: `active_learning_scores_unlabeled`. 

These scores represent how confident we are about an example's true label based on the currently obtained annotations; examples with the lowest scores are those for which additional labels should be collected (i.e. likely the most informative). These scores are estimated via [ActiveLab](https://arxiv.org/abs/2301.11856), an algorithm developed by the Cleanlab team. 

In [None]:
# hack active learning multiannotator with dummy pred_probs
dummy_pred_probs = np.zeros((df_labeled.shape[0], pred_probs_unlabeled.shape[1]))

In [None]:
# compute active learning scores
_, active_learning_scores_unlabeled = get_active_learning_scores(
    df_labeled['label'].to_numpy(), dummy_pred_probs, pred_probs_unlabeled
)

In [None]:
# sample of active learning scores
active_learning_scores_unlabeled[:5]

## Get index to relabel

Lastly, we can rank the examples by their active learning scores, and obtain the index of the examples with the lowest scores; these are the **unlabeled** examples whose true label our current model is least confident about. We will want to prioritize these examples for labeling next.

The code cell below shows how to obtain their respective indices in order to collect labels for these examples.

In [None]:
def get_idx_to_label(active_learning_scores_unlabeled, batch_size_to_label):
    """Function to get indices of examples with the lowest active learning score to collect more labels for."""
    
    return np.argsort(active_learning_scores_unlabeled)[:batch_size_to_label]

In [None]:
batch_size_to_label = 100 # you can pick how many examples to collect more labels for at each round

# get next idx to label based on batch_size_to_label and magnitude of each example's active learning score
next_idx_to_label = get_idx_to_label(active_learning_scores_unlabeled, batch_size_to_label=batch_size_to_label)
next_idx_to_label[:5],active_learning_scores_unlabeled[next_idx_to_label[:5]]

## Improving model accuracy over 10 rounds of active learning (collecting new labels) 

The code below shows a full demonstration of how we can repeatedly use the functions demonstrated above for multiple rounds in order to select which examples to collect labels for and use the newly collected labels to train an improved classification model.


This demonstration runs this active learning loop for 10 rounds, choosing 100 new unlabeled examples to collect more labels for each round. Each round, we use labeled examples to train a classifier (here we used autogluon's `MultiModalPredictor` classifier) and obtain predicted probabilities for the unlabeled data, which are then used to compute the active learning scores for every example. We then synthetically collect new labels (this process is meant to simulate getting annotations for a selection of examples) using `setup_next_iter_data` and repeat the active learning loop. 

[Optional step] We also measure the model performance on a test set each round to demonstrate the improvement of the model.

In [None]:
def setup_next_iter_data(df_labeled, df_unlabeled, relabel_idx_unlabeled):
    """Updates inputs after additional labels have been collected in a single active learning round,
    this ensures that the inputs will be well formatted for the next round of active learning."""

    df_labeled = pd.concat([df_labeled,df_unlabeled.iloc[relabel_idx_unlabeled]], ignore_index=True)
    df_unlabeled = df_unlabeled.drop(relabel_idx_unlabeled)
    df_unlabeled = df_unlabeled.reset_index(drop=True)
    df_labeled = df_labeled.reset_index(drop=True)  
    return df_labeled, df_unlabeled

In [None]:
num_rounds = 5
batch_size_to_label = 100

In [None]:
model_accuacy_arr = np.full(num_rounds, np.nan)

for i in range(num_rounds):
    # train model to get out-of-sample predicted probabilites    
    print('fitting model')
    predictor, pred_probs_unlabeled = predict_autogluon_classification( df_labeled,
                                                                        out_folder=None,
                                                                        df_predict=df_unlabeled,
                                                                        time_limit=30)
    # train a model on the full set of labeled data to evaluate model accuracy for the current round,
    # this is an optional step for demonstration purposes, in practical applications 
    # you may not have ground truth labels
    print('predicting probabilities for test split')
    pred_labels = predictor.predict(data=df_test)
    true_labels_test = np.array(df_test['label'].tolist())
    model_accuacy_arr[i] = np.mean(pred_labels == true_labels_test)
    print('test round: ', i, 'accuracy: ', np.mean(pred_labels == true_labels_test))
        
    print('computing active learning scores')
    # compute active learning scores
    dummy_pred_probs = np.zeros((df_labeled.shape[0], pred_probs_unlabeled.shape[1]))
    _, active_learning_scores_unlabeled = get_active_learning_scores(
        df_labeled['label'].to_numpy(), dummy_pred_probs, pred_probs_unlabeled
    )
    
    print('getting idx to relabel')
    # get the indices of examples to collect more labels for
    relabel_idx_unlabeled = get_idx_to_label(
        active_learning_scores_unlabeled=active_learning_scores_unlabeled,
        batch_size_to_label=batch_size_to_label,
    )
    
    print('setting up next iter')
    # format the data for the next round of active learning, ie. moving some unlabeled 
    # examples to the labeled pool because we are collecting labels for them
    df_labeled, df_unlabeled = setup_next_iter_data(df_labeled, df_unlabeled, relabel_idx_unlabeled)

## Evaluate results

From the plot below, we can see that the model accuracy increases steadily with each additional round of collecting more labels and model training.

In [None]:
print(f"Initial model test accuracy: {model_accuacy_arr[0]:.3}")
print(f"Final model test accuracy (after 15 rounds of active learning): {model_accuacy_arr[-1]:.3}")

In [None]:
np.save("model_acc_30_rounds_activelab", model_accuacy_arr)

In [None]:
plt.plot(model_accuacy_arr)
plt.xticks(range(num_rounds))
plt.xlabel("Round")
plt.ylabel("Model Accuracy")
plt.show()

plt.savefig('model_acc_30_rounds_activelab.png')

[1] Griffin, G., Holub, A., & Perona, P. (2022). Caltech 256 (1.0) [Data set]. CaltechDATA. https://doi.org/10.22002/D1.20087