# **DISTIL Usage Example: Iris**

Here, we show how to use DISTIL to perform active learning on tabular data (Iris). This notebook can be easily executed on Google Colab.

## Installations and Imports

In [None]:
# Get DISTIL
!git clone https://github.com/decile-team/distil.git
!pip install -r distil/requirements/requirements.txt

# Get Iris dataset, which is kept in our datasets repository
!git clone https://github.com/decile-team/datasets.git

import numpy as np
import pandas as pd
import sys
import torch

from sklearn.preprocessing import StandardScaler
from torch import nn
from torch.utils.data import TensorDataset

sys.path.append('distil/')
from distil.utils.utils import LabeledToUnlabeledDataset                        # Converts a PyTorch dataset with labels to one without labels
from distil.active_learning_strategies.core_set import CoreSet                   # Our choice of active learning strategy for this example
from distil.utils.models.simple_net import TwoLayerNet                          # Our choice of model for this example
from distil.utils.train_helper import data_train                                # The training loop used in between AL selections

## Preparing Iris

The data for Iris is prepared in this step. Here, we load the data, normalize it, partition the train dataset into a labeled seed set and unlabeled set, and formulate PyTorch datasets. The main output of this step is to create the PyTorch dataset objects used in training and by DISTIL.

In [None]:
data_path = 'datasets/iris.csv'
test_path = 'datasets/iris_test.csv'

# Iris is a very simple dataset. There are only 3 classes with 50 examples each.
nclasses = 3    

# Use pandas to load the train csv. Use pandas dataframe to get input features and corresponding labels
df = pd.read_csv(data_path)
X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:, -1].to_numpy()

# Here, we define our initial labeled seed set. We simply take the first 10 points.
X_tr = X[:10]    
y_tr = y[:10]

# Here, we define the unlabeled set as the remainder of the original train set.
X_unlabeled = X[10:]
y_unlabeled = y[10:]		

# As before, load the test csv.
df_test = pd.read_csv(test_path)
X_test = df_test.iloc[:,:-1].to_numpy()
y_test = df_test.iloc[:, -1].to_numpy()

# Record the number of samples and the dimensionality of the samples
nSamps, dim = np.shape(X)

# Lastly, we create the PyTorch dataset objects. Here, the unlabeled dataset technically has labels;
# however, we will explicitly remove these labels when it is used by DISTIL's active learning strategy.
# It only contains the labels in this notebook for the sake of experimental design.
training_dataset = TensorDataset(torch.tensor(X_tr, dtype=torch.float), torch.tensor(y_tr, dtype=torch.long))
unlabeled_dataset = TensorDataset(torch.tensor(X_unlabeled, dtype=torch.float), torch.tensor(y_unlabeled, dtype=torch.long))
test_dataset = TensorDataset(torch.tensor(X_test, dtype=torch.float), torch.tensor(y_test, dtype=torch.long))

## Preparing the Model

Here, we use DISTIL's two-layer network, which consists of a hidden layer of ReLU activations. We specify the input dimension via the first argument, the number of output classes via the second argument, and the number of hidden units via the third argument. The network then has its weights initialized. 

In [None]:
def init_weights(m):
    """
    Used to initialize network weights
    """

    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net = TwoLayerNet(dim, nclasses, dim*2)
net.apply(init_weights)

## Defining the Active Learning Strategy

We now have all that we need to create the active learning strategy object. For this example, we use [CoreSet](https://arxiv.org/abs/1708.00489). The CoreSet strategy takes the current labeled dataset (training_dataset), the current unlabeled dataset (unlabeled_dataset, which has its labels stripped via the LabeledToUnlabeledDataset wrapper), the model (net), the number of classes (num_cls), and various strategy arguments (strategy_args).

In [None]:
# We specify a specific batch size that should be used when loading/handling data 
# within the strategy.
strategy_args = {'batch_size' : 10}
strategy = CoreSet(training_dataset, LabeledToUnlabeledDataset(unlabeled_dataset), net, nclasses, strategy_args)

## Perform the AL Loop

We can now begin the active learning loop. Here, we define our training loop through DISTIL's utility training loop. We continuously select points using margin sampling, label them, add them to the train dataset, retrain the model, and repeat for a certain number of rounds.

In [None]:
# Define the training loop arguments. Here, we specify that training should stop after 
# 150 epochs. Internally, the training class also stops at 0.95 training accuracy. This 
# can be changed by specifying a max_accuracy parameter here.
train_args = {'n_epoch':150, 'lr':float(0.001), 'batch_size': 5, 'optimizer':'sgd'}
n_rounds = 11     # Number of AL rounds to perform
budget = 4        # Size of subset to label during each AL selection

# Keep track of the test accuracy obtained at each round to measure progress
acc = np.zeros(n_rounds)

# Create the training loop class.
dt = data_train(training_dataset, net, train_args)

# Do one round of training. To make accurate selections, the model must at least be trained on the seed set data.
clf = dt.train()

# Update the active learning strategy's stored model.
strategy.update_model(clf)

# Use the active learning strategy's predict() method to obtain model predictions on the test features.
# Obtain initial test accuracy using predictions.
y_pred = strategy.predict(LabeledToUnlabeledDataset(test_dataset)).cpu().numpy()
acc[0] = (1.0*(y_test == y_pred)).sum().item() / len(y_test)
print('Initial Testing accuracy:', round(acc[0], 3), flush=True)

# User-Controlled Loop
for rd in range(1, n_rounds):
    print('-------------------------------------------------')
    print('Round', rd) 
    print('-------------------------------------------------')

    # The main functionality of the active learning class: the select() function.
    # It retrieves the indices of points in the unlabeled set that should be labeled 
    # and added to the training set.
    idx = strategy.select(budget)
    print('New data points added -', len(idx))

    # Add the new points to the training set. Here, we do so by modifying the underlying
    # numpy arrays. Here, the selected features are concatenated to the training set 
    # features.
    X_tr = np.concatenate((X_tr, X_unlabeled[idx]), axis=0)
    X_unlabeled = np.delete(X_unlabeled, idx, axis = 0)

    # Here, we concatenate the labels of the selected point to the labels of teh training set.
    # This step is done by the human in actual applications; here, it is done automatically 
    # via our a priori knowledge.
    y_tr = np.concatenate((y_tr, y_unlabeled[idx]), axis = 0)
    y_unlabeled = np.delete(y_unlabeled, idx, axis = 0)
    print('Number of training points -',X_tr.shape[0])
    print('Number of labels -', y_tr.shape[0])
    print('Number of unlabeled points -', X_unlabeled.shape[0])

    # Update the PyTorch dataset objects.
    training_dataset = TensorDataset(torch.tensor(X_tr, dtype=torch.float), torch.tensor(y_tr, dtype=torch.long))
    unlabeled_dataset = TensorDataset(torch.tensor(X_unlabeled, dtype=torch.float), torch.tensor(y_unlabeled, dtype=torch.long))

    # Update the data used in the active learning strategy and the training loop
    strategy.update_data(training_dataset, LabeledToUnlabeledDataset(unlabeled_dataset))
    dt.update_data(training_dataset)

    # Retrain the model using the new labeled data. Update the active learning strategy using 
    # the newly trained model.
    clf = dt.train()
    strategy.update_model(clf)

    # Get the test accuracy as before.
    y_pred = strategy.predict(LabeledToUnlabeledDataset(test_dataset)).cpu().numpy()
    acc[rd] = round(1.0 * (y_test == y_pred).sum().item() / len(y_test), 3)
    print('Testing accuracy:', acc[rd], flush=True)

    # We add an additional condition here to stop once test accuracy exceeds 0.98. Ideally,
    # you'd want to stop at a target test accuracy, anyways.
    if acc[rd] > 0.98:
        print('Testing accuracy reached above 98%, stopping training!')
        break
        
print('Training Completed')