Merge branch 'develop'

amaiya · Jan 20, 2020 · cfaa18e · cfaa18e
2 parents 0ce9f45 + 704034e
commit cfaa18e
Show file tree

Hide file tree

Showing 8 changed files with 162 additions and 16 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,21 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+
+## 0.8.2 (TBD)
+
+### New:
+- initial base `ktrain.Dataset` class for use as a Sequence wrapper to better support custom datasets/models
+
+### Changed:
+- N/A
+
+### Fixed:
+- N/A
+
+
+
+
 ## 0.8.1 (2020-01-15)
 
 ### New:

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 
 ### News and Announcements
 - **2020-01-14:**  
-  - ***ktrain*** **v0.8.x is released** and now includes a thin and easy-to-use wrapper to [HuggingFace Transformers](https://github.com/huggingface/transformers) for text classification.  See [this tutorial notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/tutorials/tutorial-A3-hugging_face_transformers.ipynb) for more details. 
+  - ***ktrain*** **v0.8.x is released** and now includes a thin and easy-to-use wrapper to [HuggingFace Transformers](https://github.com/huggingface/transformers) for text classification.  See [this tutorial notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb) for more details. 
   - As of v0.8.x, *ktrain* now uses **TensorFlow 2**. TensorFlow 1.x is no longer supported.  If you're using Google Colab and `import tensorflow as tf;  print(tf.__version__)` shows v1.15 is installed, you must install TensorFlow 2: `!pip3 install -q tensorflow_gpu==2.0`.  Remember to import Keras modules like this:  `from tensorflow.keras.layers import Dense`.  (That is, don't do this:  `from keras.layers import Dense`.)
 - **Coming Soon**:
   - better support for custom data formats and models
@@ -26,7 +26,7 @@
      - **Sequence Labeling**:  [Bidirectional LSTM-CRF](https://arxiv.org/abs/1603.01360) with optional pretrained word embeddings <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-06-sequence-tagging.ipynb)]</sup></sub>
      - **Unsupervised Topic Modeling** with [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-topic_modeling.ipynb)]</sup></sub>
      - **Document Similarity with One-Class Learning**:  given some documents of interest, find and score new documents that are semantically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
-     - **Document Recommendation Engine**:  given text from a sample document, recommend documents that are semantically similar to it from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
+     - **Document Recommendation Engine**:  given text from a sample document, recommend documents that are semantically-related to it from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
   - `vision` data:
     - **image classification** (e.g., [ResNet](https://arxiv.org/abs/1512.03385), [Wide ResNet](https://arxiv.org/abs/1605.07146), [Inception](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/vision/dogs_vs_cats-ResNet50.ipynb)]</sup></sub>
   - `graph` data:
@@ -48,7 +48,7 @@ Please see the following tutorial notebooks for a guide on how to use *ktrain* o
 * Tutorial 7: [Graph Node Classification](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-07-graph-node_classification.ipynb) with Graph Neural Networks
 * Tutorial A1: [Additional tricks](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A1-additional-tricks.ipynb), which covers topics such as previewing data augmentation schemes, inspecting intermediate output of Keras models for debugging, setting global weight decay, and use of built-in and custom callbacks.
 * Tutorial A2: [Explaining Predictions and Misclassifications](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A2-explaining-predictions.ipynb)
-* Tutorial A3: [Text Classification with Hugging Face Transformers](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/tutorials/tutorial-A3-hugging_face_transformers.ipynb)
+* Tutorial A3: [Text Classification with Hugging Face Transformers](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb)
 
 
 Some blog tutorials about *ktrain* are shown below:

diff --git a/ktrain/__init__.py b/ktrain/__init__.py
@@ -5,10 +5,12 @@
 from .text.learner import BERTTextClassLearner, TransformerTextClassLearner
 from .text.ner.learner import NERLearner
 from .graph.learner import NodeClassLearner
+from .data import Dataset
 
 from . import utils as U
 
-__all__ = ['get_learner', 'get_predictor', 'load_predictor', 'release_gpu_memory']
+__all__ = ['get_learner', 'get_predictor', 'load_predictor', 'release_gpu_memory',
+           'Dataset']
 
 
 

diff --git a/ktrain/core.py b/ktrain/core.py
@@ -1185,12 +1185,50 @@ def layer_output(self, layer_id, example_id=0, batch_id=0, use_val=False):
         return layer_out
 
 
+    #def view_top_losses(self, n=4, preproc=None, val_data=None):
+    #    """
+    #    Views observations with top losses in validation set.
+    #    Musta be overridden by Learner subclasses.
+    #    """
+    #    raise NotImplementedError('view_top_losses must be overriden by GenLearner subclass')
     def view_top_losses(self, n=4, preproc=None, val_data=None):
         """
         Views observations with top losses in validation set.
-        Musta be overridden by Learner subclasses.
+        Typically over-ridden by Learner subclasses.
+        Args:
+         n(int or tuple): a range to select in form of int or tuple
+                          e.g., n=8 is treated as n=(0,8)
+         preproc (Preprocessor): A TextPreprocessor or ImagePreprocessor.
+                                 For some data like text data, a preprocessor
+                                 is required to undo the pre-processing
+                                 to correctly view raw data.
+          val_data:  optional val_data to use instead of self.val_data
+        Returns:
+            list of n tuples where first element is either 
+            filepath or id of validation example and second element
+            is loss.
+
         """
-        raise NotImplementedError('view_top_losses must be overriden by GenLearner subclass')
+        val = self._check_val(val_data)
+
+
+        # get top losses and associated data
+        tups = self.top_losses(n=n, val_data=val, preproc=preproc)
+
+        # get multilabel status and class names
+        classes = preproc.get_classes() if preproc is not None else None
+        # iterate through losses
+        for tup in tups:
+
+            # get data
+            idx = tup[0]
+            loss = tup[1]
+            truth = tup[2]
+            pred = tup[3]
+
+            print('----------')
+            print("id:%s | loss:%s | true:%s | pred:%s)\n" % (idx, round(loss,2), truth, pred))
+        return
 
 
 

diff --git a/ktrain/data.py b/ktrain/data.py
@@ -0,0 +1,84 @@
+from .imports import *
+
+
+class Dataset(Sequence):
+    def __init__(self, batch_size=32):
+        self.batch_size = batch_size
+
+    # required by keras.utils.Sequence instances
+    def __len__(self):
+        raise NotImplemented
+
+    # required by keras.utils.Sequence instances
+    def __getitem__(self, idx):
+        raise NotImplemented
+
+    # required: used by Learner instances
+    def nsamples(self):
+        raise NotImplemented
+
+    # required: used by Learner instances
+    def get_y(self):
+        raise NotImplemented
+
+    # optional: to modify dataset between epochs (e.g., shuffle)
+    def on_epoch_end(self):
+        pass
+
+    # optional
+    def ondisk(self):
+        return False
+
+    # optional: used only if invoking *_classifier functions
+    def xshape(self):
+        raise NotImplemented
+
+    # optional: used only if invoking *_classifier functions
+    def nclasses(self):
+        raise NotImplemented
+
+
+
+class MultiArrayDataset(Dataset):
+    def __init__(self, x, y, batch_size=32):
+        if type(x) != np.ndarray or type(y) != np.ndarray:
+            raise ValueError('x and y must be numpy arrays')
+        if len(x.shape) != 3:
+            raise valueError('x must have 3 dimensions')
+        super().__init__(batch_size=batch_size)
+        self.x, self.y = x, y
+        self.indices = np.arange(self.x[0].shape[0])
+        self.n_inputs = x.shape[0]
+
+    def __len__(self):
+        return math.ceil(self.x[0].shape[0] / self.batch_size)
+
+    def __getitem__(self, idx):
+        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
+        batch_x = []
+        for i in range(self.n_inputs):
+            batch_x.append(self.x[i][inds])
+        batch_y = self.y[inds]
+        return tuple(batch_x), batch_y
+
+    def on_epoch_end(self):
+        np.random.shuffle(self.indices)
+
+    def xshape(self):
+        return self.x.shape
+
+    def nsamples(self):
+        if self.n_inputs == 1:
+            return self.x.shape[0]
+        else:
+            return self.x.shape[1]
+
+    def nclasses(self):
+        return self.y.shape[1]
+
+    def get_y(self):
+        return self.y
+
+    def ondisk(self):
+        return False
+
diff --git a/ktrain/utils.py b/ktrain/utils.py
@@ -1,4 +1,5 @@
 from .imports import *
+from .data import Dataset
 
 
 #------------------------------------------------------------------------------
@@ -156,7 +157,8 @@ def is_multilabel(data):
 def shape_from_data(data):
     err_msg = 'could not determine shape from %s' % (type(data))
     if is_iter(data):
-        if is_ner(data=data): return (len(data.x), data[0][0][0].shape[1])  # NERSequence
+        if isinstance(data, Dataset): return data.xshape()
+        elif is_ner(data=data): return (len(data.x), data[0][0][0].shape[1])  # NERSequence
         elif is_huggingface(data=data):  # HF Transformer
             return (len(data.x), data[0][0][0].shape[1])
         elif is_nodeclass(data=data):                                 # NodeSequence
@@ -180,16 +182,19 @@ def shape_from_data(data):
 
 
 def ondisk(data):
+    if hasattr(data, 'ondisk'): return data.ondisk()
+
     ondisk = is_iter(data) and \
-             (type(data).__name__ not in  ['NumpyArrayIterator', 'NERSequence', 
+             (type(data).__name__ not in  ['ArrayDataset', 'NumpyArrayIterator', 'NERSequence',
                                            'NodeSequenceWrapper', 'TransformerSequence'])
     return ondisk
 
 
 def nsamples_from_data(data):
     err_msg = 'could not determine number of samples from %s' % (type(data))
     if is_iter(data):
-        if is_ner(data=data): return len(data.x)      # NERSequence
+        if isinstance(data, Dataset): return data.nsamples()
+        elif is_ner(data=data): return len(data.x)      # NERSequence
         elif is_huggingface(data=data):  # HuggingFace Transformer
             return len(data.x)
         elif is_nodeclass(data=data):           # NodeSequenceWrapper
@@ -212,7 +217,8 @@ def nsamples_from_data(data):
 
 def nclasses_from_data(data):
     if is_iter(data):
-        if is_ner(data=data): return len(data.p._label_vocab._id2token)    # NERSequence
+        if isinstance(data, Dataset): return data.nsamples()
+        elif is_ner(data=data): return len(data.p._label_vocab._id2token)    # NERSequence
         elif is_huggingface(data=data):         # Hugging Face Transformer
             return data.y.shape[1]
         elif is_nodeclass(data=data):                                # NodeSequenceWrapper
@@ -233,7 +239,8 @@ def nclasses_from_data(data):
 
 def y_from_data(data):
     if is_iter(data):
-        if is_ner(data=data): return data.y    # NERSequence
+        if isinstance(data, Dataset): return data.get_y()
+        elif is_ner(data=data): return data.y    # NERSequence
         if is_huggingface(data=data):  # Hugging Face Transformer
             return data.y
         elif is_nodeclass(data=data):      # NodeSequenceWrapper
@@ -258,7 +265,7 @@ def is_iter(data, ignore=False):
     iter_classes = ["NumpyArrayIterator", "DirectoryIterator",
                     "DataFrameIterator", "Iterator", "Sequence", 
                     "NERSequence", "NodeSequenceWrapper", "TransformerSequence"]
-    return data.__class__.__name__ in iter_classes
+    return data.__class__.__name__ in iter_classes or isinstance(data, Dataset)
 
 
 
@@ -271,7 +278,7 @@ def data_arg_check(train_data=None, val_data=None, train_required=False, val_req
     if train_data is not None and not is_iter(train_data, ndarray_only):
         if bad_data_tuple(train_data):
             err_msg = 'data must be tuple of numpy.ndarrays'
-            if not ndarray_only: err_msg += ' or an instance of Iterator'
+            if not ndarray_only: err_msg += ' or an instance of ktrain.Dataset'
             raise ValueError(err_msg)
     if val_data is not None and not is_iter(val_data, ndarray_only):
         if bad_data_tuple(val_data):

diff --git a/ktrain/version.py b/ktrain/version.py
@@ -1,2 +1,2 @@
 __all__ = ['__version__']
-__version__ = '0.8.1'
+__version__ = '0.8.2'
diff --git a/setup.py b/setup.py
@@ -18,7 +18,7 @@
   author = 'Arun S. Maiya',
   author_email = 'arun@maiya.net',
   url = 'https://github.com/amaiya/ktrain',
-  keywords = ['keras', 'deep learning', 'machine learning'],
+  keywords = ['tensorflow', 'keras', 'deep learning', 'machine learning'],
   install_requires=[
           'scikit-learn == 0.21.3',
           'matplotlib >= 3.0.0',
@@ -45,7 +45,7 @@
     #   3 - Alpha
     #   4 - Beta
     #   5 - Production/Stable
-    'Development Status :: 3 - Alpha',
+    'Development Status :: 4 - Beta',
 
     # Indicate who your project is intended for
     'Intended Audience :: Developers',