Merge pull request #24 from decile-team/main

Merge main into doc_plots
decile-team · May 6, 2021 · 3d8767b · 3d8767b
2 parents ba274b2 + 77813ae
commit 3d8767b
Show file tree

Hide file tree

Showing 48 changed files with 1,596 additions and 419 deletions.
diff --git a/README.md b/README.md
@@ -40,8 +40,9 @@
 - [Evaluation of Active Learning Strategies](#evaluation-of-active-learning-strategies)
 - [Testing Individual Strategies and Running Examples](#testing-individual-strategies-and-running-examples)
 - [Mailing List](#mailing-list)
-- [Publications](#publications)
 - [Acknowledgement](#acknowledgement)
+- [Team](#team)
+- [Publications](#publications)
 
 ## What is DISTIL?
 <p align="center">
@@ -126,9 +127,11 @@ DISTIL makes it extremely easy to integrate your custom models with active learn
     * Check the models included in DISTIL for examples!
 
 * Data Handler
-    * Your DataHandler class should have a boolean attribute “select”:
+    * Your DataHandler class should have a boolean attribute “select=True” with default value True:
         * If True: Your __getitem__(self, index) method should return (input, index)
         * If False: Your __getitem__(self, index) method should return (input, label, index)
+    * Your DataHandler class should have a boolean attribute “use_test_transform=False” with default value False.
+
     * Check the DataHandler classes included in DISTIL for examples!
 
 To get a clearer idea about how to incorporate DISTIL with your own models, refer to [Getting Started With DISTIL & Active Learning Blog](https://decile-research.medium.com/getting-started-with-distil-active-learning-ba7fafdbe6f3)
@@ -142,7 +145,7 @@ To get a clearer idea about how to incorporate DISTIL with your own models, refe
 
 ## Active Learning Benchmarks using DISTIL
 #### Experimentation Method
-The models used below were first trained on n randomly selected points, where n is the budget of the experiment. For each set of new points added, the model was trained from scratch until the training accuracy crossed the max accuracy threshold. The test accuracy was then reported before the next selection round.
+The models used below were first trained on an initial random set of points (equal to the budget). For each set of new points added, the model was trained from scratch until the training accuracy crossed the max accuracy threshold. The test accuracy was then reported before the next selection round. The results below are *preliminary* results each obtained only with one run. We are doing a more thorough benchmarking experiment, with more runs and report standard deviations etc. We will also link to a preprint which will include the benchmarking results.
 
 #### CIFAR10
 Model: Resnet18
@@ -216,6 +219,11 @@ To receive updates about DISTIL and to be a part of the community, join the Deci
 ```
 https://groups.google.com/forum/#!forum/Decile_DISTIL_Dev/join 
 ```
+## Acknowledgement
+This library takes inspiration, builds upon, and uses pieces of code from several open source codebases. These include [Kuan-Hao Huang's deep active learning repository](https://github.com/ej0cl6/deep-active-learning), [Jordan Ash's Badge repository](https://github.com/JordanAsh/badge), and [Andreas Kirsch's and Joost van Amersfoort's BatchBALD repository](https://github.com/BlackHC/batchbald_redux). Also, DISTIL uses [Apricot](https://github.com/jmschrei/apricot) for submodular optimization.
+
+## Team
+DISTIL is created and maintained by Nathan Beck, [Durga Sivasubramanian](https://www.linkedin.com/in/durga-s-352831105), [Apurva Dani](https://apurvadani.github.io/index.html), [Rishabh Iyer](https://www.rishiyer.com), and [Ganesh Ramakrishnan](https://www.cse.iitb.ac.in/~ganesh/). We look forward to have DISTIL more community driven. Please use it and contribute to it for your active learning research, and feel free to use it for your commercial projects. We will add the major contributors here.
 
 ## Publications
 
@@ -239,5 +247,3 @@ https://groups.google.com/forum/#!forum/Decile_DISTIL_Dev/join
 
 [10] Gal, Yarin, Riashat Islam, and Zoubin Ghahramani. "Deep bayesian active learning with image data." International Conference on Machine Learning. PMLR, 2017.
 
-## Acknowledgement
-This library takes inspiration and also uses pieces of code from [Kuan-Hao Huang's deep active learning repository](https://github.com/ej0cl6/deep-active-learning), [Jordan Ash's Badge repository](https://github.com/JordanAsh/badge), and [Andreas Kirsch's and Joost van Amersfoort's BatchBALD repository](https://github.com/BlackHC/batchbald_redux). Also, DISTIL uses [Apricot](https://github.com/jmschrei/apricot) for submodular optimization.
diff --git a/distil/active_learning_strategies/adversarial_bim.py b/distil/active_learning_strategies/adversarial_bim.py
@@ -4,52 +4,54 @@
 from .strategy import Strategy
 
 class AdversarialBIM(Strategy):
-    def __init__(self, X, Y, unlabeled_x, net, handler, nclasses, args={}):
-        """
+    """
+    Implements Adversial Bim Strategy which is motivated by the fact that often the distance
+    computation from decision boundary is difficult and intractable for margin based methods. This 
+    technique avoids estimating distance by using BIM(Basic Iterative Method)
+    :footcite:`tramer2017ensemble` to estimate how much adversarial perturbation is required to 
+    cross the boundary. Smaller the required the perturbation, closer the point is to the boundary.
+ 
+    **Basic Iterative Method (BIM)**: Given a base input, the approach is to perturb each
+    feature in the direction of the gradient by magnitude :math:`\\epsilon`, where is a
+    parameter that determines perturbation size. For a model with loss
+    :math:`\\nabla J(\\theta, x, y)`, where :math:`\\theta` represents the model parameters,
+    x is the model input, and y is the label of x, the adversarial sample is generated
+    iteratively as,
+
+    .. math::
+
+        \\begin{eqnarray}
+            x^*_0 & = &x,
+    
+            x^*_i & = & clip_{x,e} (x^*_{i-1} + sign(\\nabla_{x^*_{i-1}} J(\\theta, x^*_{i-1} , y)))
+        \\end{eqnarray}
+
+    Parameters
+    ----------
+    X: numpy array
+        Present training/labeled data   
+    y: numpy array
+        Labels of present training data
+    unlabeled_x: numpy array
+        Data without labels
+    net: class
+        Pytorch Model class
+    handler: class
+        Data Handler, which can load data even without labels.
+    nclasses: int
+        Number of unique target variables
+    args: dict
+        Specify optional parameters
         
-        Implements Adversial Bim Strategy which is motivated by the fact that often the distance 
-        computation from decision boundary is often difficult and intractable for margin based 
-        methods. This technique avoids estimating distance by using BIM(Basic Iterative Method) 
-        :footcite:`tramer2017ensemble` to estimate how much adversarial perturbation is required 
-        to cross the boundary. Smaller the required the perturbation, closer the point is to the 
-        boundary. 
-
-        **Basic Iterative Method (BIM)**: Given a base input, the approach is to perturb each 
-        feature in the direction of the gradient by magnitude :math:`\\epsilon`, where is a 
-        parameter that determines perturbation size. For a model with loss 
-        :math:`\\nabla J(\\theta, x, y)`, where :math:`\\theta` represents the model parameters, 
-        x is the model input, and y is the label of x, the adversarial sample is generated 
-        iteratively as,
-
-        .. math:: 
-            x*{\\*}_0 = x,
-            x*{\\*}_i = clip_{x,e} (x*{\\*}_{i-1} + sign(\\nabla_{x*{\\*}_{i-1}} J(\\theta, x*{\\*}_{i-1} , y)))
-
+        `batch_size`- Batch size to be used inside strategy class (int, optional)
 
-        Parameters
-        ----------
-        X: numpy array
-            Present training/labeled data   
-        y: numpy array
-            Labels of present training data
-        unlabeled_x: numpy array
-            Data without labels
-        net: class
-            Pytorch Model class
-        handler: class
-            Data Handler, which can load data even without labels.
-        nclasses: int
-            Number of unique target variables
-        args: dict
-            Specify optional parameters
-            
-            batch_size 
-            Batch size to be used inside strategy class (int, optional)
-
-            eps
-            epsilon value for gradients
+        `eps`-epsilon value for gradients
+    """
+
+    def __init__(self, X, Y, unlabeled_x, net, handler, nclasses, args={}):
+        """
+        Constructor method
         """
-
         if 'eps' in args:
             self.eps = args['eps']
         else:

diff --git a/distil/active_learning_strategies/adversarial_deepfool.py b/distil/active_learning_strategies/adversarial_deepfool.py
@@ -8,7 +8,7 @@ class AdversarialDeepFool(Strategy):
     Implements Adversial Deep Fool Strategy :footcite:`ducoffe2018adversarial`, a Deep-Fool based 
     Active Learning strategy that selects unlabeled samples with the smallest adversarial 
     perturbation. This technique is motivated by the fact that often the distance computation 
-    from decision boundary is often difficult and intractable for margin-based methods. This 
+    from decision boundary is difficult and intractable for margin-based methods. This 
     technique avoids estimating distance by using Deep-Fool :footcite:`Moosavi-Dezfooli_2016_CVPR` 
     like techniques to estimate how much adversarial perturbation is required to cross the boundary. 
     The smaller the required perturbation, the closer the point is to the boundary.
@@ -47,6 +47,7 @@ def __init__(self, X, Y, unlabeled_x, net, handler, nclasses, args={}):
         super(AdversarialDeepFool, self).__init__(X, Y, unlabeled_x, net, handler, nclasses, args={})
 
     def cal_dis(self, x):
+
         nx = Variable(torch.unsqueeze(x, 0), requires_grad=True)
         eta = Variable(torch.zeros(nx.shape))
 

diff --git a/distil/active_learning_strategies/badge.py b/distil/active_learning_strategies/badge.py
@@ -57,22 +57,24 @@ class BADGE(Strategy):
     hypothesised labels. Then to select the points to be labeled are selected by applying 
     k-means++ on these loss gradients. 
     
-    Parameters.
+    Parameters
     ----------
-    X: Numpy array 
-        Features of the labled set of points 
-    Y: Numpy array
-        Lables of the labled set of points 
-    unlabeled_x: Numpy array
-        Features of the unlabled set of points 
-    net: class object
-        Model architecture used for training. Could be instance of models defined in `distil.utils.models` or something similar.
-    handler: class object
-        It should be a subclasses of torch.utils.data.Dataset i.e, have __getitem__ and __len__ methods implemented, so that is could be passed to pytorch DataLoader.Could be instance of handlers defined in `distil.utils.DataHandler` or something similar.
-    nclasses: int 
-        No. of classes in tha dataset
-    args: dictionary
-        This dictionary should have 'batch_size' as a key. 
+    X: numpy array
+        Present training/labeled data   
+    Y: numpy array
+        Labels of present training data
+    unlabeled_x: numpy array
+        Data without labels
+    net: class
+        Pytorch Model class
+    handler: class
+        Data Handler, which can load data even without labels.
+    nclasses: int
+        Number of unique target variables
+    args: dict
+        Specify optional parameters.
+        `batch_size` 
+        Batch size to be used inside strategy class (int, optional)
     """
 
     def __init__(self, X, Y, unlabeled_x, net, handler,nclasses, args):

diff --git a/distil/active_learning_strategies/bayesian_active_learning_disagreement_dropout.py b/distil/active_learning_strategies/bayesian_active_learning_disagreement_dropout.py
@@ -3,9 +3,11 @@
 
 class BALDDropout(Strategy):
     """
-    Implementation of BALDDropout Strategy.
-    This class extends :class:`active_learning_strategies.strategy.Strategy`
-    to include entropy sampling technique to select data points for active learning.
+    
+    Implements Bayesian Active Learning by Disagreement (BALD) Strategy :footcite:`houlsby2011bayesian`,
+    which assumes a Basiyan setting and selects points which maximise the mutual information 
+    between the predicted labels and model parameters. This implementation is an adaptation for a 
+    non-bayesian setting, with the assumption that there is a dropout layer in the model. 
 
     Parameters
     ----------

diff --git a/distil/active_learning_strategies/core_set.py b/distil/active_learning_strategies/core_set.py
@@ -13,7 +13,7 @@ class CoreSet(Strategy):
     ----------
     X: numpy array
         Present training/labeled data   
-    y: numpy array
+    Y: numpy array
         Labels of present training data
     unlabeled_x: numpy array
         Data without labels

diff --git a/distil/active_learning_strategies/entropy_sampling.py b/distil/active_learning_strategies/entropy_sampling.py
@@ -9,7 +9,7 @@ class EntropySampling(Strategy):
     we use entropy and therefore select points which have maximum entropy. 
 
     Suppose the model has `nclasses` output nodes and each output node is denoted by :math:`z_j`. Thus,  
-    :math:`j \\in [1,nclasses]`. Then for a output node :math:`z_i` from the model, the correponding 
+    :math:`j \\in [1,nclasses]`. Then for a output node :math:`z_i` from the model, the corresponding
     softmax would be 
 
     .. math::

diff --git a/distil/active_learning_strategies/entropy_sampling_dropout.py b/distil/active_learning_strategies/entropy_sampling_dropout.py
@@ -10,7 +10,7 @@ class EntropySamplingDropout(Strategy):
     which have maximum entropy. 
 
     Suppose the model has `nclasses` output nodes and each output node is denoted by :math:`z_j`. Thus,  
-    :math:`j \in [1,nclasses]`. Then for a output node :math:`z_i` from the model, the correponding 
+    :math:`j \in [1,nclasses]`. Then for a output node :math:`z_i` from the model, the corresponding 
     softmax would be 
 
     .. math::

diff --git a/distil/active_learning_strategies/fass.py b/distil/active_learning_strategies/fass.py
@@ -11,6 +11,17 @@ class FASS(Strategy):
     'facility_location' , 'graph_cut', 'saturated_coverage', 'sum_redundancy', 'feature_based' 
     is applied to get the final set of points.
 
+    We select a subset :math:`F` of size :math:`\\beta` based on uncertainty sampling, such 
+    that :math:`\\beta \\ge k`.
+      
+    Then select a subset :math:`S` by solving 
+    
+    .. math::
+        \\max \\{f(S) \\text{ such that } |S| \\leq k, S \\subseteq F\\} 
+    
+    where :math:`k` is the is the `budget` and :math:`f` can be one of these functions - 
+    'facility location' , 'graph cut', 'saturated coverage', 'sum redundancy', 'feature based'. 
+
     Parameters
     ----------
     X: numpy array
@@ -26,16 +37,14 @@ class FASS(Strategy):
     nclasses: int
         Number of unique target variables
     args: dict
-        Specify optional parameters
-        
-        batch_size 
+        Specify optional parameters - `batch_size` 
         Batch size to be used inside strategy class (int, optional)
 
-        submod: str
-        Choice of submodular function - 'facility_location' | 'graph_cut' | 'saturated_coverage' | 'sum_redundancy' | 'feature_based'
-        
-        selection_type: str
-        Choice of selection strategy - 'PerClass' | 'Supervised'
+    submod: str
+    Choice of submodular function - 'facility_location' | 'graph_cut' | 'saturated_coverage' | 'sum_redundancy' | 'feature_based'
+    
+    selection_type: str
+    Choice of selection strategy - 'PerClass' | 'Supervised'
     """
 
     def __init__(self, X, Y, unlabeled_x, net, handler, nclasses, args={}):

diff --git a/distil/active_learning_strategies/gradmatch_active.py b/distil/active_learning_strategies/gradmatch_active.py
@@ -17,7 +17,20 @@ class GradMatchActive(Strategy):
     hypothesized labels of the loss function and are matched to either the full gradient of these hypothesized 
     examples or a supplied validation gradient. The indices returned are the ones selected by this algorithm.
 
-    
+    .. math::
+        Err(X_t, L, L_T, \\theta_t) = \\left |\\left| \\sum_{i \\in X_t} \\nabla_\\theta L_T^i (\\theta_t) - \\frac{k}{N} \\nabla_\\theta L(\\theta_t) \\right | \\right|
+
+    where,
+
+        - Each gradient is computed with respect to the last layer's parameters
+        - :math:`\\theta_t` are the model parameters at selection round :math:`t`
+        - :math:`X_t` is the queried set of points to label at selection round :math:`t`
+        - :math:`k` is the budget
+        - :math:`N` is the number of points contributing to the full gradient :math:`\\nabla_\\theta L(\\theta_t)`
+        - :math:`\\nabla_\\theta L(\\theta_t)` is either the complete hypothesized gradient or a validation gradient
+        - :math:`\\sum_{i \\in X_t} \\nabla_\\theta L_T^i (\\theta_t)` is the subset's hypothesized gradient with :math:`|X_t| = k`
+
+
     Parameters
     ----------
     X: Numpy array 

diff --git a/distil/active_learning_strategies/least_confidence.py b/distil/active_learning_strategies/least_confidence.py
@@ -9,7 +9,7 @@ class LeastConfidence(Strategy):
     
     Suppose the model has `nclasses` output nodes denoted by :math:`\\overrightarrow{\\boldsymbol{z}}` 
     and each output node is denoted by :math:`z_j`. Thus, :math:`j \\in [1, nclasses]`. 
-    Then for a output node :math:`z_i` from the model, the correponding softmax would be 
+    Then for a output node :math:`z_i` from the model, the corresponding softmax would be 
 
     .. math::
         \\sigma(z_i) = \\frac{e^{z_i}}{\\sum_j e^{z_j}} 
@@ -18,7 +18,7 @@ class LeastConfidence(Strategy):
     confidence as follows, 
     
     .. math::
-        arg\\min_{{S \\subseteq {\\mathcal U}, |S| \\leq k}}{(arg\\max_j{(\\sigma(\\overrightarrow{\\boldsymbol{z}}))})}  
+        \\mbox{argmin}_{{S \\subseteq {\\mathcal U}, |S| \\leq k}}{\\sum_S(\mbox{argmax}_j{(\\sigma(\\overrightarrow{\\boldsymbol{z}}))})}  
     
 
     where :math:`\\mathcal{U}` denotes the Data without lables i.e. `unlabeled_x` and :math:`k` is the `budget`.

diff --git a/distil/active_learning_strategies/least_confidence_dropout.py b/distil/active_learning_strategies/least_confidence_dropout.py
@@ -8,7 +8,7 @@ class LeastConfidenceDropout(Strategy):
     
     Suppose the model has `nclasses` output nodes denoted by :math:`\\overrightarrow{\\boldsymbol{z}}` 
     and each output node is denoted by :math:`z_j`. Thus, :math:`j \\in [1, nclasses]`. 
-    Then for a output node :math:`z_i` from the model, the correponding softmax would be 
+    Then for a output node :math:`z_i` from the model, the corresponding softmax would be 
 
     .. math::
         \\sigma(z_i) = \\frac{e^{z_i}}{\\sum_j e^{z_j}} 
@@ -17,7 +17,7 @@ class LeastConfidenceDropout(Strategy):
     confidence as follows, 
     
     .. math::
-        arg\\min_{{S \\subseteq {\\mathcal U}, |S| \\leq k}}{(arg\\max_j{(\\sigma(\\overrightarrow{\\boldsymbol{z}}))})}  
+        \\mbox{argmin}_{{S \\subseteq {\\mathcal U}, |S| \\leq k}}{\\sum_S(\\mbox{argmax}_j{(\\sigma(\\overrightarrow{\\boldsymbol{z}}))})}  
     
 
     where :math:`\\mathcal{U}` denotes the Data without lables i.e. `unlabeled_x` and :math:`k` is the `budget`.

diff --git a/distil/active_learning_strategies/margin_sampling.py b/distil/active_learning_strategies/margin_sampling.py
@@ -10,20 +10,20 @@ class MarginSampling(Strategy):
     
     Suppose the model has `nclasses` output nodes denoted by :math:`\\overrightarrow{\\boldsymbol{z}}` 
     and each output node is denoted by :math:`z_j`. Thus, :math:`j \\in [1, nclasses]`. 
-    Then for a output node :math:`z_i` from the model, the correponding softmax would be 
+    Then for a output node :math:`z_i` from the model, the corresponding softmax would be 
 
     .. math::
         \\sigma(z_i) = \\frac{e^{z_i}}{\\sum_j e^{z_j}} 
 
     Let,
 
     .. math::
-        m = arg\\max_j{(\\sigma(\\overrightarrow{\\boldsymbol{z}}))}
+        m = \\mbox{argmax}_j{(\\sigma(\\overrightarrow{\\boldsymbol{z}}))}
         
     Then using softmax, Margin Sampling Strategy would pick `budget` no. of elements as follows, 
     
     .. math::
-        arg\\min_{{S \\subseteq {\\mathcal U}, |S| \\leq k}}{(arg\\max_j {(\\sigma(\\overrightarrow{\\boldsymbol{z}}))}) - (arg\\max_{j \\ne m} {(\\sigma(\\overrightarrow{\\boldsymbol{z}}))})}  
+        \\mbox{argmin}_{{S \\subseteq {\\mathcal U}, |S| \\leq k}}{\\sum_S(\\mbox{argmax}_j {(\\sigma(\\overrightarrow{\\boldsymbol{z}}))}) - (\\mbox{argmax}_{j \\ne m} {(\\sigma(\\overrightarrow{\\boldsymbol{z}}))})}  
     
 
     where :math:`\\mathcal{U}` denotes the Data without lables i.e. `unlabeled_x` and :math:`k` is the `budget`.