# ***Subset selection:***
This notebook aims at demonstrating the use cases for the functions in spear library for subset selection. Subset selection is selecting a small subset of unlabeled data(or the data labeled by LFs, in case of supervised subset selection) so that it can be labeled and use that small labeled data(the L dataset) for effective training of <b>JL algorithm</b>. Finding the best subset makes best use of the labeling efforts.

In [5]:
import sys
sys.path.append('../../')
import numpy as np

### **Random subset selection**
Here we select a random subset of instances to label. We need number of instances available and number of instances we intend to label to get a sorted numpy array of indices

In [6]:
from spear.JL import rand_subset

indices = rand_subset(n_all = 20, n_instances = 5) #select 5 instances from a total of 20 instances
print("indices selected by rand_subset: ", indices)
print("return type of rand_subset: ", type(indices))

indices selected by rand_subset:  [ 0  4  8 11 16]
return type of rand_subset:  <class 'numpy.ndarray'>


### **Unsupervised subset selection**
Here we select a unsupervised subset(for more on this, please refer [here](https://arxiv.org/abs/2008.09887) ) of instances to label. We need feature matrix(of shape (num_instaces, num_features)) and number of instances we intend to label and we get a sorted numpy array of indices. For any other arguments to unsup_subset(or to sup_subset_indices or sup_subset_save_files) please refer documentation.
<p>For this let's first get some data(feature matrix), say from sms_pickle_U.pkl(in data_pipeline folder). For more on this pickle file, please refer the other notebook named sms_cage_jl.ipynb</p>

In [7]:
from spear.utils import get_data, get_classes

U_path_pkl = 'data_pipeline/sms_pickle_U.pkl' #unlabelled data - don't have true labels
data_U = get_data(U_path_pkl, check_shapes=True)
x_U = data_U[0] #the feature matrix
print("x_U shape: ", x_U.shape)
print("x_U type: ", type(x_U))

x_U shape:  (4000, 1024)
x_U type:  <class 'numpy.ndarray'>


Now that we have feature matrix, let's select the indices to label from it. After labeling(through a trustable means) those instances, whose indices(index with respect to feature matrix) are given by the following function, one can pass them as gold_labels to the PreLabels class in the process for labeling the subset-selected data and forming a pickle file.

In [8]:
from spear.JL import unsup_subset

indices = unsup_subset(x_train = x_U, n_unsup = 20)
print("first 10 indices given by unsup_subset: ", indices[:10])
print("return type of unsup_subset: ", type(indices))

first 10 indices given by unsup_subset:  [ 298  307  991 1041 1067 1160 1490 2001 2068 2094]
return type of unsup_subset:  <class 'numpy.ndarray'>


### **Supervised subset selection**
Here we select a supervised subset(for more on this, please refer [here](https://arxiv.org/abs/2008.09887) ) of instances to label. We need 
* path to json file having information about classes
* path to pickle file generated by feature matrix after labeling using LFs
* number of instances we intend to label and we get a sorted numpy array of indices.
<p>For this let's use sms_json.json, sms_pickle_U.pkl(in data_pipeline folder). For more on this json/pickle file, please refer the other notebook named sms_cage_jl.ipynb</p>

In [9]:
from spear.JL import sup_subset_indices

U_path_pkl = 'data_pipeline/sms_pickle_U.pkl' #unlabelled data - don't have true labels
path_json = 'data_pipeline/sms_json.json'
indices = sup_subset_indices(path_json = path_json, path_pkl = U_path_pkl, n_sup = 50, qc = 0.85)

print("first 10 indices given by sup_subset: ", indices[:10])
print("return type of sup_subset: ", type(indices))

first 10 indices given by sup_subset:  [ 25 114 129 294 322 544 561 590 627 797]
return type of sup_subset:  <class 'numpy.ndarray'>


Instead of just getting indices to already labeled data(stored in pickle format, using LFs), we also provide the following utility to split the input pickle file and save two pickle files on the basis of subset selection. Make sure that path_save_L and path_save_U are <b>EMPTY</b> pickle file. You can still get the return value of subset-selected indices.

In [10]:
from spear.JL import sup_subset_save_files

U_path_pkl = 'data_pipeline/sms_pickle_U.pkl' #unlabelled data - don't have true labels
path_json = 'data_pipeline/sms_json.json'
path_save_L = 'data_pipeline/sup_subset_L.pkl'
path_save_U = 'data_pipeline/sup_subset_U.pkl'

indices = sup_subset_save_files(path_json = path_json, path_pkl = U_path_pkl, path_save_L = path_save_L, \
                             path_save_U = path_save_U, n_sup = 50, qc = 0.85)

print("first 10 indices given by sup_subset: ", indices[:10])
print("return type of sup_subset: ", type(indices))

first 10 indices given by sup_subset:  [ 25 114 129 294 322 544 561 590 627 797]
return type of sup_subset:  <class 'numpy.ndarray'>


### **Inserting true labels into pickle files**
Now after doing supervised subset selection, say we get two pickle files path_save_L and path_save_U. Now say you labeled the instances of path_save_L and want to insert them into pickle file. So here, instead of going over the process of generating pickle through PreLabels again, you can use the following function to create a new pickle file, which now contain true labels, using path_save_L pickle file. There is no return value to this function. Make sure that path_save, the pickle file path that is to be formed with the data in path_save_L file and true labels, is <b>EMPTY</b>

In [11]:
from spear.JL import insert_true_labels

path_save_L = 'data_pipeline/sup_subset_L.pkl'
path_save = 'data_pipeline/sup_subset_labeled_L.pkl'
labels = np.random.randint(0,2,[50, 1])
'''
Above is just a random association of labels used for demo. In real time user has to label the instances in
path_save_L with a trustable means and use it here.

Note that the shape of labels is (num_instances, 1) and just for reference, feature_matrix(the first element
in pickle file) in path_save_L has shape (num_instances, num_features).
'''
insert_true_labels(path = path_save_L, path_save = path_save, labels = labels)

A similar function as insert_true_labels called replace_in_pkl is also made available to make changes to pickle file. replace_in_pkl usage is demonstrated below. Note that replace_in_pkl doesn't edit the pickle file, instead creates a new pickle file. Make sure that path_save, the pickle file path that is to be formed with the data in path file and a new numpy array, is <b>EMPTY</b>. There is no return value for this function too.
<p>It is highly advised to use insert_true_labels function for the purpose of inserting the labels since it does some other necessary changes.</p>

In [12]:
from spear.JL import replace_in_pkl

path = 'data_pipeline/sup_subset_labeled_L.pkl'
path_save = 'data_pipeline/sup_subset_altered_L.pkl'
np_array = np.random.randint(0,2,[50, 1]) #we are just replacing the labels we inserted before
index = 3 
'''
index refers to the element we intend to replace. Refer documentaion(specifically 
spear.utils.data_editor.get_data) to understand which numpy array an index value
maps to(order the contents of pickle file from 0). Index should be in range [0,8].
'''

replace_in_pkl(path = path, path_save = path_save, np_array = np_array, index = index)