<a href="https://colab.research.google.com/github/akanksha0911/Assignment-5-Continual-ML-and-Active-Learning/blob/main/Part2_activeLearning_with_lightly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **End to end demonstratation of active learning with lightly- Classification Task**

Ref: https://docs.lightly.ai/tutorials/platform/tutorial_active_learning.html#active-learning

I have used logistic regression on top of the embeddings as a classifier. This is the same as using the embedding model as a pretrained backbone and putting a single layer classification head on top of it while fine-tuning only the classification head on the labeled dataset, but keeping the backbone frozen. Since the embeddings are already computed, we can use them directly as input to the classification head.

This workflow has the following structure:

1. Choose an initial subset of your dataset, e.g. using one of our selection strategies like the CORESET selection strategy. Label this initial subset and train your model on it.

Next, the active learning loop starts:

Train a classifier on the labeled set.

Use the classifier to predict on the unlabeled set.

Calculate active learning scores from the prediction.

Use an active learning agent to choose the next samples to be labeled based on the scores.

Update the labeled set to include the newly chosen samples and remove them from the unlabeled set.

I have used the clothing-dataset-small.

## **Downloading the dataset:**

The dataset’s images are RGB images with a few hundred pixels in width and height. They show clothes from 10 different classes, like dresses, hats or t-shirts. The dataset is already split into a train, test, and validation set, and all images for one class are put into one folder.

In [None]:
!git clone https://github.com/alexeygrigorev/clothing-dataset-small.git

### **Creation of the dataset on the Lightly Platform with embeddings**


In [None]:

!pip install lightly

In [None]:
!lightly-magic input_dir="./clothing-dataset-small/train" trainer.max_epochs=0 token="9c33f696ce0cc1c2d0ab268254904105ae3669a2d75cdfa1"     new_dataset_name="active_learning_clothing_dataset" loader.num_workers=8


In [None]:
!pip install numpy
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Active learning**

In [None]:
import os
import csv
from typing import List, Dict, Tuple
import numpy as np
from sklearn.linear_model import LogisticRegression

from lightly.active_learning.agents.agent import ActiveLearningAgent
from lightly.active_learning.config.selection_config import SelectionConfig
from lightly.active_learning.scorers.classification import ScorerClassification
from lightly.api.api_workflow_client import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import SamplingMethod

Define the dataset for the classifier based on the embeddings.csv
The LogisticRegression classifier needs the embeddings as features for its classification.
Thus we define a class to create such a dataset out of the embeddings.csv.
It also allows to choose only a subset of all samples dependant on the filenames given.



In [None]:
class CSVEmbeddingDataset:
    def __init__(self, path_to_embeddings_csv: str):
        with open(path_to_embeddings_csv, 'r') as f:
            data = csv.reader(f)

            rows = list(data)
            header_row = rows[0]
            rows_without_header = rows[1:]

            index_filenames = header_row.index('filenames')
            filenames = [row[index_filenames] for row in rows_without_header]

            index_labels = header_row.index('labels')
            labels = [row[index_labels] for row in rows_without_header]

            embeddings = rows_without_header
            indexes_to_delete = sorted([index_filenames, index_labels], reverse=True)
            for embedding_row in embeddings:
                for index_to_delete in indexes_to_delete:
                    del embedding_row[index_to_delete]

        # create the dataset as a dictionary mapping from the filename to a tuple of the embedding and the label
        self.dataset: Dict[str, Tuple[np.ndarray, int]] = \
            dict([(filename, (np.array(embedding_row, dtype=float), int(label)))
                  for filename, embedding_row, label in zip(filenames, embeddings, labels)])

    def get_features(self, filenames: List[str]) -> np.ndarray:
        features_array = np.array([self.dataset[filename][0] for filename in filenames])
        return features_array

    def get_labels(self, filenames: List[str]) -> np.ndarray:
        labels = np.array([self.dataset[filename][1] for filename in filenames])
        return labels

First we read the variables we set before as environment variables via the console



In [None]:
token = "9c33f696ce0cc1c2d0ab268254904105ae3669a2d75cdfa1"
path_to_embeddings_csv = "/content/lightly_outputs/2022-11-12/06-12-31/embeddings.csv"
dataset_id = "636f3962bd19c4ff7601b735"

# We define the client to the Lightly Platform API
api_workflow_client = ApiWorkflowClient(token=token, dataset_id=dataset_id)


We define the dataset, the classifier and the active learning agent



In [None]:
dataset = CSVEmbeddingDataset(path_to_embeddings_csv=path_to_embeddings_csv)
classifier = LogisticRegression(max_iter=1000)
agent = ActiveLearningAgent(api_workflow_client=api_workflow_client,)

1. Choose an initial subset of your dataset.
We want to start with 200 samples and use the CORESET selection strategy for selecting them.



In [None]:
print("Starting the initial selection")
selection_config = SelectionConfig(n_samples=200, method=SamplingMethod.CORESET, name='initial-selection')
agent.query(selection_config=selection_config)
print(f"There are {len(agent.labeled_set)} samples in the labeled set.")

Starting the initial selection
There are 200 samples in the labeled set.


2. Train a classifier on the labeled set.



In [None]:
labeled_set_features = dataset.get_features(agent.labeled_set)
labeled_set_labels = dataset.get_labels(agent.labeled_set)
classifier.fit(X=labeled_set_features, y=labeled_set_labels)

LogisticRegression(max_iter=1000)

3. Use the classifier to predict on the query set.



In [None]:
query_set_features = dataset.get_features(agent.query_set)
predictions = classifier.predict_proba(X=query_set_features)

4. Calculate active learning scores from the prediction.



In [None]:
active_learning_scorer = ScorerClassification(model_output=predictions)

5. Use an active learning agent to choose the next samples to be labeled based on the active learning scores.
We want to sample another 100 samples to have 300 samples in total and use the active learning strategy CORAL for it.



In [None]:
selection_config = SelectionConfig(n_samples=300, method=SamplingMethod.CORAL, name='al-iteration-1')
agent.query(selection_config=selection_config, al_scorer=active_learning_scorer)
print(f"There are {len(agent.labeled_set)} samples in the labeled set.")

There are 300 samples in the labeled set.


6. Update the labeled set to include the newly chosen samples and remove them from the unlabeled set.
This is already done internally inside the ActiveLearningAgent - no work for you :)



In [None]:
labeled_set_features = dataset.get_features(agent.labeled_set)
labeled_set_labels = dataset.get_labels(agent.labeled_set)
classifier.fit(X=labeled_set_features, y=labeled_set_labels)

# evaluate on unlabeled set
unlabeled_set_features = dataset.get_features(agent.unlabeled_set)
unlabeled_set_labels = dataset.get_labels(agent.unlabeled_set)
accuracy_on_unlabeled_set = classifier.score(X=unlabeled_set_features, y=unlabeled_set_labels)
print(f"accuracy on unlabeled set: {accuracy_on_unlabeled_set}")

accuracy on unlabeled set: 0.38078034682080925


Here we can use the newly chosen labeled set to retrain classifier on it. I  evaluated it e.g. on the unlabeled set, or on embeddings of a test set, generated before. If we are not satisfied with the performance, we can run steps 2 to 5 again.

*************************************************************