Merge branch 'develop'

amaiya · Mar 26, 2021 · 23baf7c · 23baf7c
2 parents 4038cec + b1be655
commit 23baf7c
Show file tree

Hide file tree

Showing 11 changed files with 134 additions and 30 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,18 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+## 0.26.2 (2021-03-26)
+
+### New:
+- N/A
+
+### Changed
+- `NERPredictor.predict` now optionally accepts lists of sentences to make sequence-labeling predictions in batches (as all other `Predictor` instances already do).
+
+### Fixed:
+- N/A
+
+
 ## 0.26.1 (2021-03-11)
 
 ### New:

diff --git a/FAQ.md b/FAQ.md
@@ -70,6 +70,10 @@
 
 - [How do I make quantized predictions with `transformers` models?](#how-do-i-make-quantized-predictions-with-transformers-models)
 
+- [How do I increase batch size for predictions?](#how-do-i-increase-batch-size-for-predictions)
+
+- [How do I speed up predictions?](#how-do-i-increase-batch-size-for-predictions)
+
 
 
 ---
@@ -991,6 +995,25 @@ def reset_random_seeds(seed=2):
 [[Back to Top](#frequently-asked-questions-about-ktrain)]
 
 
+### How do I increase batch size for predictions?
+
+Increasing the batch size used for inference and predictions can potentially speed up predictions on lists of examples.
+
+The `get_predictor` and `load_predictor` functions both accept a `batch_size` argument that will be used when making predictions on lists of examples. The default is 32.  The `batch_size` for `Predictor` instances can also be set manually:
+```python
+predictor = ktrain.load_predictor('/tmp/my_predictor')
+predictor.batch_size = 128
+predictor.predict(list_of_examples)
+```
+
+The `get_learner` function accepts an `eval_batch_size` argument that will be used by the `Learner` instance when evaluating a validation dataset (e.g., `learner.predict`).
+
+
+[[Back to Top](#frequently-asked-questions-about-ktrain)]
+
+
+
+
 ### What kinds of applications have been built with *ktrain*?
 
 Examples include:

diff --git a/README.md b/README.md
@@ -352,6 +352,7 @@ The above should be all you need on Linux systems and cloud computing environmen
 
 **Some important things to note about installation:**
 - If using **ktrain** with `tensorflow<=2.1`, you must also downgrade the **transformers** library to `transformers==3.1`.
+- If `load_predictor` fails with the error "`AttributeError: 'str' object has no attribute 'decode'`", then downgrade **h5py**: `pip install h5py==2.10.0`
 - As of v0.21.x, **ktrain** no longer installs TensorFlow 2 automatically.  As indicated above, you should install TensorFlow 2 yourself before installing and using **ktrain**.  On Google Colab, TensorFlow 2 should be already installed.  You should be able to use **ktrain**  with any version of [TensorFlow 2](https://www.tensorflow.org/install/pip?lang=python3). Note, however, that there is a bug in TensorFlow 2.2 and 2.3 that affects the *Learning-Rate-Finder* [that will not be fixed until TensorFlow 2.4](https://github.com/tensorflow/tensorflow/issues/41174#issuecomment-656330268).  The bug causes the learning-rate-finder to complete all epochs even after loss has diverged (i.e., no automatic-stopping).
 - If using **ktrain** on a local machine with a GPU (versus Google Colab, for example), you'll need to [install GPU support for TensorFlow 2](https://www.tensorflow.org/install/gpu).
 - Since some **ktrain** dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues), 

diff --git a/examples/text/ktrain-ONNX-TFLite-examples.ipynb b/examples/text/ktrain-ONNX-TFLite-examples.ipynb
@@ -65,7 +65,7 @@
     "!pip install ktrain\n",
     "\n",
     "# load text data\n",
-    "categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']\n",
+    "categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']\n",
     "from sklearn.datasets import fetch_20newsgroups\n",
     "train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)\n",
     "test_b = fetch_20newsgroups(subset='test',categories=categories, shuffle=True)\n",

diff --git a/ktrain/text/ner/predictor.py b/ktrain/text/ner/predictor.py
@@ -26,45 +26,63 @@ def get_classes(self):
         return self.c
 
 
-    def predict(self, sentence, return_proba=False, merge_tokens=False, custom_tokenizer=None):
+    def predict(self, sentences, return_proba=False, merge_tokens=False, custom_tokenizer=None):
         """
         Makes predictions for a string-representation of a sentence
         Args:
-          sentence(str): sentence of text
+          sentences(list|str): either a single sentence as a string or a list of sentences
           return_proba(bool): If return_proba is True, returns probability distribution for each token
           merge_tokens(bool):  If True, tokens will be merged together by the entity
                                to which they are associated:
                                ('Paul', 'B-PER'), ('Newman', 'I-PER') becomes ('Paul Newman', 'PER')
           custom_tokenizer(Callable): If specified, sentence will be tokenized based on custom tokenizer
 
         Returns:
-          list: list of tuples representing each token.
+          list: If sentences is a string representation of single sentence:
+                     list containing a tuple for each token in sentence
+                IF sentences is a list of sentences:
+                     list  of lists:  Each inner list represents a sentence and contains a tuple for each token in sentence
         """
-        if not isinstance(sentence, str):
-            raise ValueError('Param sentence must be a string-representation of a sentence')
+        is_array = not isinstance(sentences, str)
+        if not isinstance(sentences, (str, list)):
+            raise ValueError('Param sentence must be either string-representation of a sentence or a list of sentence strings.')
         if return_proba and merge_tokens:
             raise ValueError('return_proba and merge_tokens are mutually exclusive with one another.')
-        lang = TU.detect_lang([sentence])
-        nerseq = self.preproc.preprocess([sentence], lang=lang, custom_tokenizer=custom_tokenizer)
-        if not nerseq.prepare_called:
-            nerseq.prepare()
-        nerseq.batch_size = self.batch_size
-        x_true, _ = nerseq[0]
-        lengths = nerseq.get_lengths(0)
-        y_pred = self.model.predict_on_batch(x_true)
-        y_labels = self.preproc.p.inverse_transform(y_pred, lengths)
-        y_labels = y_labels[0]
-        if return_proba:
-            try:
-                probs = np.max(y_pred, axis=2)[0]
-            except:
-                probs = y_pred[0].numpy().tolist() # TODO: remove after confirmation (#316)
-            return list(zip(nerseq.x[0], y_labels, probs))
-        else:
-            result =  list(zip(nerseq.x[0], y_labels))
-            if merge_tokens:
-                result = self.merge_tokens(result, lang)
-            return result
+        if isinstance(sentences, str): sentences = [sentences]
+        lang = TU.detect_lang(sentences)
+
+        # batchify
+        num_chunks = math.ceil(len(sentences)/self.batch_size)
+        batches = U.list2chunks(sentences, n=num_chunks)
+
+        # process batches
+        results = []
+        for batch in batches:
+            nerseq = self.preproc.preprocess(batch, lang=lang, custom_tokenizer=custom_tokenizer)
+            if not nerseq.prepare_called:
+                nerseq.prepare()
+            nerseq.batch_size = len(batch)
+            x_true, _ = nerseq[0]
+            lengths = nerseq.get_lengths(0)
+            y_pred = self.model.predict_on_batch(x_true)
+            y_labels = self.preproc.p.inverse_transform(y_pred, lengths)
+            if return_proba:
+                try:
+                    probs = np.max(y_pred, axis=2)
+                except:
+                    probs = y_pred[0].numpy().tolist() # TODO: remove after confirmation (#316)
+                for x, y, prob in zip(nerseq.x, y_labels, probs):
+                    result = [(x[i], y[i], prob[i]) for i in range(len(x))]
+                    results.append(result)
+            else:
+                for x,y in zip(nerseq.x, y_labels):
+                    result =  list(zip(x,y))
+                    if merge_tokens:
+                        result = self.merge_tokens(result, lang)
+                    results.append(result)
+        if not is_array: results = results[0]
+        return results
+
 
 
     def merge_tokens(self, annotated_sentence, lang):
@@ -105,5 +123,7 @@ def merge_tokens(self, annotated_sentence, lang):
             elif tag and current_token:  #  prefix I
                 current_token = current_token + sep + token
                 current_tag = tag
+        if current_token and current_tag:
+            entities.append((current_token, current_tag))
         return entities
 
diff --git a/ktrain/text/predictor.py b/ktrain/text/predictor.py
@@ -49,8 +49,8 @@ def predict(self, texts, return_proba=False):
         if U.is_huggingface(model=self.model):
             tseq = self.preproc.preprocess_test(texts, verbose=0)
             tseq.batch_size = self.batch_size
-            texts = tseq.to_tfdataset(train=False)
-            preds = self.model.predict(texts)
+            tfd = tseq.to_tfdataset(train=False)
+            preds = self.model.predict(tfd)
             if type(preds).__name__ == 'TFSequenceClassifierOutput': # dep_fix: undocumented breaking change in transformers==4.0.0
                 preds = preds.logits
 

diff --git a/ktrain/utils.py b/ktrain/utils.py
@@ -704,6 +704,8 @@ def apply(self, df, train=True):
         for i, col in enumerate(new_lab_cols):
             df[col] = targets[:,i]
         df[new_lab_cols] = targets
+        print(new_lab_cols)
+        print(df[new_lab_cols].head())
         df[new_lab_cols] = df[new_lab_cols].astype('float32')
 
         return df

diff --git a/ktrain/version.py b/ktrain/version.py
@@ -1,2 +1,2 @@
 __all__ = ['__version__']
-__version__ = '0.26.1'
+__version__ = '0.26.2'
diff --git a/tutorials/tutorial-01-introduction.ipynb b/tutorials/tutorial-01-introduction.ipynb
@@ -804,6 +804,21 @@
     "predictor = ktrain.load_predictor('/tmp/mymnist')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Both the `load_predictor` and `get_predictor` functions accept an optional `batch_size` argument that is set to 32 by default.  For instance, the `batch_size` used for inference and predictions can be increased with either of the following:\n",
+    "```python\n",
+    "# you can set the batch_size as an argument to load_predictor\n",
+    "predictor = ktrain.load_predictor('/tmp/mymnist', batch_size=64)\n",
+    "\n",
+    "# you can also set the batch_size used for predictions this way\n",
+    "predictor.batch_size = 64\n",
+    "```\n",
+    "Larger batch sizes can potentially speed predictions."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 17,

diff --git a/tutorials/tutorial-04-text-classification.ipynb b/tutorials/tutorial-04-text-classification.ipynb
@@ -474,6 +474,21 @@
     "predictor.predict(['Groundhog Day is my favorite movie of all time!'])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that both the `load_predictor` and `get_predictor` functions accept an optional `batch_size` argument that is set to 32 by default. The `batch_size` can also be set manually on the `Predictor` instance.  That is, the `batch_size` used for inference and predictions can be increased with either of the following:\n",
+    "```python\n",
+    "# you can set the batch_size as an argument to load_predictor (or get_predictor)\n",
+    "predictor = ktrain.load_predictor('/tmp/my_moviereview_predictor', batch_size=128)\n",
+    "\n",
+    "# you can also set the batch_size used for predictions this way\n",
+    "predictor.batch_size = 128\n",
+    "```\n",
+    "Larger batch sizes can potentially speed predictions when `predictor.predict` is supplied with a list of examples."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/tutorials/tutorial-06-sequence-tagging.ipynb b/tutorials/tutorial-06-sequence-tagging.ipynb
@@ -530,6 +530,22 @@
     "reloaded_predictor.predict('Paul Newman is my favorite American actor.')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `predict` method also can accept a list of sentences.  And, larger batch sizes can potentially speed predictions when `predictor.predict` is supplied with a list of examples.\n",
+    "\n",
+    "Both the `load_predictor` and `get_predictor` functions accept an optional `batch_size` argument used for predictions, which is set to 32 by default. The `batch_size` can also be set manually on the `Predictor` instance.  That is, the `batch_size` used for inference and predictions can be increased with either of the following:\n",
+    "```python\n",
+    "# you can set the batch_size as an argument to load_predictor (or get_predictor)\n",
+    "predictor = ktrain.load_predictor('/tmp/mypred', batch_size=128)\n",
+    "\n",
+    "# you can also set the batch_size used for predictions this way\n",
+    "predictor.batch_size = 128\n",
+    "```\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},