# **DISTIL Usage Example: IJCNN1**

Here, we show how to use DISTIL to perform active learning on tabular data (IJCNN1). This notebook can be easily executed on Google Colab.

## Installations and Imports

In [None]:
# Get DISTIL
!git clone https://github.com/decile-team/distil.git
!pip install -r distil/requirements/requirements.txt

# Get IJCNN1 dataset, which is kept in our datasets repository
!git clone https://github.com/decile-team/datasets.git

import numpy as np
import sys
import torch

from sklearn.preprocessing import StandardScaler
from torch import nn
from torch.utils.data import TensorDataset

sys.path.append('distil/')
from distil.utils.utils import LabeledToUnlabeledDataset          # Converts a PyTorch dataset with labels to one without labels
from distil.active_learning_strategies.glister import GLISTER     # Our choice of active learning strategy for this example
from distil.utils.models.simple_net import TwoLayerNet            # Our choice of model for this example
from distil.utils.train_helper import data_train                  # The training loop used in between AL selections

## Preparing IJCNN1

The data for IJCNN1 is prepared in this step. Here, we load the data, normalize it, partition the train dataset into a labeled seed set and unlabeled set, and formulate PyTorch datasets. The main output of this step is to create the PyTorch dataset objects used in training and by DISTIL.

In [None]:
def libsvm_file_load(path,dim, save_data=False):
    """
    Used to load IJCNN. Returns a tuple of numpy arrays representing 
    the tabular features and their corresponding labels.
    """

    data = []
    target = []
    with open(path) as fp:
       line = fp.readline()
       while line:
        temp = [i for i in line.strip().split(" ")]
        target.append(int(float(temp[0]))) # Class Number. # Not assumed to be in (0, K-1)
        temp_data = [0]*dim
        
        for i in temp[1:]:
            ind,val = i.split(':')
            temp_data[int(ind)-1] = float(val)
        data.append(temp_data)
        line = fp.readline()
    X_data = np.array(data,dtype=np.float32)
    Y_label = np.array(target)
    if save_data:
        # Save the numpy files to the folder where they come from
        data_np_path = path + '.data.npy'
        target_np_path = path + '.label.npy'
        np.save(data_np_path, X_data)
        np.save(target_np_path, Y_label)
    return (X_data, Y_label)

# Specify locations of IJCNN1's train/val/test data
trn_file = 'datasets/ijcnn1/ijcnn1.trn'
val_file = 'datasets/ijcnn1/ijcnn1.val'
tst_file = 'datasets/ijcnn1/ijcnn1.tst'

# IJCNN1 has 22 input features and a binary label
data_dims = 22
num_cls = 2

# Retrieve numpy arrays for each part of the dataset
x_trn, y_trn = libsvm_file_load(trn_file, dim=data_dims)
x_val, y_val = libsvm_file_load(val_file, dim=data_dims)
x_tst, y_tst = libsvm_file_load(tst_file, dim=data_dims)
    
# The class labels are (-1,1). Transform them to (0,1).
y_trn[y_trn < 0] = 0
y_val[y_val < 0] = 0
y_tst[y_tst < 0] = 0    

# Normalize the data according to mean/std taken from train dataset
sc = StandardScaler()
x_trn = sc.fit_transform(x_trn)
x_val = sc.transform(x_val)
x_tst = sc.transform(x_tst)

# Record the number of samples in train dataset
nSamps, dim = np.shape(x_trn)

# Randomly choose indices of the train dataset on which to split
np.random.seed(42)
start_idxs = np.random.choice(nSamps, size=32, replace=False)

# Split the features into labeled seed set features and unlabeled set features
X_tr = x_trn[start_idxs]
X_unlabeled = np.delete(x_trn, start_idxs, axis = 0)

# Split the labels into labeled seed set labels and unlabeled set labels (which are typically not known a priory; however, we use them to automatically label points)
y_tr = y_trn[start_idxs]
y_unlabeled = np.delete(y_trn, start_idxs, axis = 0)

# Lastly, we create the PyTorch dataset objects. Here, the unlabeled dataset technically has labels;
# however, we will explicitly remove these labels when it is used by DISTIL's active learning strategy.
# It only contains the labels in this notebook for the sake of experimental design.
training_dataset = TensorDataset(torch.tensor(X_tr), torch.tensor(y_tr, dtype=torch.long))
unlabeled_dataset = TensorDataset(torch.tensor(X_unlabeled), torch.tensor(y_unlabeled, dtype=torch.long))
test_dataset = TensorDataset(torch.tensor(x_tst), torch.tensor(y_tst, dtype=torch.long))

## Preparing the Model

Here, we use DISTIL's two-layer network, which consists of a hidden layer of ReLU activations. We specify the input dimension via the first argument, the number of output classes via the second argument, and the number of hidden units via the third argument. The network then has its weights initialized. 

In [None]:
def init_weights(m):
    """
    Used to initialize network weights
    """

    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net = TwoLayerNet(dim, num_cls,100)
net.apply(init_weights)

## Defining the Active Learning Strategy

We now have all that we need to create the active learning strategy object. For this example, we use [GLISTER-ACTIVE](https://arxiv.org/abs/2012.10630). The GLISTER strategy takes the current labeled dataset (training_dataset), the current unlabeled dataset (unlabeled_dataset, which has its labels stripped via the LabeledToUnlabeledDataset wrapper), the model (net), the number of classes (num_cls), various strategy arguments (strategy_args), a validation dataset (which we set to none), the type of regularization that GLISTER uses (typeOf), and the corresponding regularization coefficient (lam).

In [None]:
# Define the strategy args for GLISTER, which needs the learning rate used in training.
# We also specify a specific batch size that should be used when loading/handling data 
# within the strategy.
strategy_args = {'batch_size' : 100, 'lr':float(0.001)} 
strategy = GLISTER(training_dataset, LabeledToUnlabeledDataset(unlabeled_dataset), net, num_cls, strategy_args, validation_dataset = None, typeOf='Diversity', lam=10)

## Perform the AL Loop

We can now begin the active learning loop. Here, we define our training loop through DISTIL's utility training loop. We continuously select points using GLISTER, label them, add them to the train dataset, retrain the model, and repeat for a certain number of rounds.

In [None]:
# Define the training loop arguments. Here, we specify that training should stop after 
# 150 epochs. Internally, the training class also stops at 0.95 training accuracy. This 
# can be changed by specifying a max_accuracy parameter here.
train_args = {'n_epoch':150, 'lr':float(0.001)}
n_rounds = 10     # Number of active learning rounds
budget = 32       # The budget of AL selection; defines how many points should be retrieved from the unlabeled dataset.

# Keep track of the test accuracy obtained at each round to measure progress
acc = np.zeros(n_rounds)

# Create the training loop class.
dt = data_train(training_dataset, net, train_args)

# Do one round of training. To make accurate selections, the model must at least be trained on the seed set data.
clf = dt.train()

# Update the active learning strategy's stored model.
strategy.update_model(clf)

# Use the active learning strategy's predict() method to obtain model predictions on the test features.
# Obtain initial test accuracy using predictions.
y_pred = strategy.predict(LabeledToUnlabeledDataset(test_dataset)).cpu().numpy()
acc[0] = (1.0*(y_tst == y_pred)).sum().item() / len(y_tst)
print('Initial Testing accuracy:', round(acc[0], 3), flush=True)

# User-Controlled Loop
for rd in range(1, n_rounds):
    print('-------------------------------------------------')
    print('Round', rd) 
    print('-------------------------------------------------')

    # The main functionality of the active learning class: the select() function.
    # It retrieves the indices of points in the unlabeled set that should be labeled 
    # and added to the training set.
    idx = strategy.select(budget)
    print('New data points added -', len(idx))

    # Add the new points to the training set. Here, we do so by modifying the underlying
    # numpy arrays. Here, the selected features are concatenated to the training set 
    # features.
    X_tr = np.concatenate((X_tr, X_unlabeled[idx]), axis=0)
    X_unlabeled = np.delete(X_unlabeled, idx, axis = 0)

    # Here, we concatenate the labels of the selected point to the labels of teh training set.
    # This step is done by the human in actual applications; here, it is done automatically 
    # via our a priori knowledge.
    y_tr = np.concatenate((y_tr, y_unlabeled[idx]), axis = 0)
    y_unlabeled = np.delete(y_unlabeled, idx, axis = 0)
    print('Number of training points -',X_tr.shape[0])
    print('Number of labels -', y_tr.shape[0])
    print('Number of unlabeled points -', X_unlabeled.shape[0])

    # Update the PyTorch dataset objects.
    training_dataset = TensorDataset(torch.tensor(X_tr), torch.tensor(y_tr, dtype=torch.long))
    unlabeled_dataset = TensorDataset(torch.tensor(X_unlabeled), torch.tensor(y_unlabeled, dtype=torch.long))

    # Update the data used in the active learning strategy and the training loop
    strategy.update_data(training_dataset, LabeledToUnlabeledDataset(unlabeled_dataset))
    dt.update_data(training_dataset)

    # Retrain the model using the new labeled data. Update the active learning strategy using 
    # the newly trained model.
    clf = dt.train()
    strategy.update_model(clf)

    # Get the test accuracy as before.
    y_pred = strategy.predict(LabeledToUnlabeledDataset(test_dataset)).cpu().numpy()
    acc[rd] = round(1.0 * (y_tst == y_pred).sum().item() / len(y_tst), 3)
    print('Testing accuracy:', acc[rd], flush=True)

    # We add an additional condition here to stop once test accuracy exceeds 0.98. Ideally,
    # you'd want to stop at a target test accuracy, anyways.
    if acc[rd] > 0.98:
        print('Testing accuracy reached above 98%, stopping training!')
        break
        
print('Training Completed')