# Select the best data to label next with Active Learning

**Prerequisites**:
You need to have encord-active [installed](https://docs.encord.com/active/docs/installation).

This notebook shows you how to plug your model in Encord Active and use it to select the best data to label next in the MNIST sandbox project.

It follows four steps:
1. Download the MNIST sandbox project.
2. Train a model with labeled data from the project.
3. Run Entropy acquisition function powered by the model to score project data.
4. Rank and sample unlabelled data to label next.
   1. \[Optional\] Visualize sampled data and scores.

**Note**: As the MNIST dataset is completely labeled from the start, we train the model only on a subset of the project data and use that knowledge to infer what data to label next in the complement.

## Download the MNIST sandbox project

In [None]:
from pathlib import Path
from encord_active.lib.project.project_file_structure import ProjectFileStructure
from encord_active.lib.project.sandbox_projects import fetch_prebuilt_project

project_name = "[open-source][test]-mnist-dataset"

# Choose where to store the project
project_path = Path.cwd() / project_name

# Download the project
fetch_prebuilt_project(project_name, project_path)

project_fs = ProjectFileStructure(project_path)

class_name = "digit" # name of the text classification to work with
subset_size = 5000 # amount of data samples used to train the model
batch_size_to_label = 100 # amount of data samples selected to label next

## Train a model with labeled data from the project

In [None]:
from typing import List
import numpy as np
from PIL import Image
from sklearn.ensemble import RandomForestClassifier
from encord_active.lib.common.active_learning import get_data, get_data_hashes_from_project
from encord_active.lib.metrics.acquisition_metrics.common import SKLearnClassificationModel

def transform_image_data(images: List[Image])->List[np.ndarray]:
    return [np.asarray(image).flatten() / 255 for image in images]

forest = RandomForestClassifier(n_estimators = 500)

# Wrap the model to interface its behaviour with the one expected in the acquisition function
w_model = SKLearnClassificationModel(forest)

# Read and transform the project data with `SKLearnModelWrapper.prepare_data()` function
data_hashes = get_data_hashes_from_project(project_fs, subset_size)
X, y = get_data(project_fs, data_hashes, class_name)
X = transform_image_data(X)

w_model._model.fit(X, y)

## Run Entropy acquisition function powered by the model to score project data

In [None]:
from encord_active.lib.common.active_learning import get_metric_results
from encord_active.lib.metrics.acquisition_metrics.acquisition_functions import Entropy
from encord_active.lib.metrics.execute import execute_metrics

acq_func = Entropy(w_model)

# Run the acquisition function
execute_metrics([acq_func], data_dir=project_fs.project_dir, use_cache_only=True)

# Get the data scores
acq_func_results = get_metric_results(project_fs, acq_func)

## Rank and sample unlabelled data to label next

In [None]:
from encord_active.lib.common.active_learning import get_n_best_ranked_data_samples

data_to_label_next, scores = get_n_best_ranked_data_samples(
    acq_func_results, 
    batch_size_to_label, 
    rank_by="desc", 
    exclude_data_hashes=data_hashes)

### [Optional] Visualize sampled data and scores

In [None]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from encord_active.lib.common.active_learning import get_data_from_data_hashes

image_paths, _ = get_data_from_data_hashes(project_fs, data_to_label_next, class_name)

rows, cols = 10, 10
fix, axs = plt.subplots(rows, cols, figsize=(10, 8))

for i in range(rows):
    for j in range(cols):
        index = i * cols + j
        axs[i, j].imshow(mpimg.imread(image_paths[index]))
        axs[i, j].set_title(round(scores[index], 2))
        axs[i, j].axis('off')
        
plt.subplots_adjust(wspace=1.5, hspace=0.4)
plt.show()