Merge branch 'develop'

amaiya · Jul 13, 2020 · aba86c6 · aba86c6
2 parents aa33b7b + b4df378
commit aba86c6
Show file tree

Hide file tree

Showing 9 changed files with 138 additions and 50 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,19 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+## 0.18.3 (2020-07-12)
+
+### New:
+- added `batch_size` argument to `ZeroShotClassifier.predict` that can be increased to speed up predictions.
+  This is especially useful if `len(topic_strings)` is large.
+
+### Changed
+- N/A
+
+### Fixed:
+- fixed typo in `load_predictor` error message
+
+
 ## 0.18.2 (2020-07-08)
 
 ### New:

diff --git a/FAQ.md b/FAQ.md
@@ -77,8 +77,8 @@ Here is how you can quickly get started using *ktrain*:
 4. Make sure the notebook is setup to use a GPU: `Runtime --> Change runtime type` and select `GPU` in the menu.
 5. Click on each cell in the notebook and execute it by pressing `SHIFT` and `ENTER` at the same time. The notebook shows you how to build a neural network that recoginizes cats vs. dogs in photos.
 
-
-- For more information on `ktrain`, see [the tutorials](https://github.com/amaiya/ktrain#tutorials).
+Next, you can go through [the tutorials](https://github.com/amaiya/ktrain#tutorials) to learn more.  If you have questions about a method or function, 
+type a question mark before the method and press ENTER in a Google Colab or Jupyter notebook to learn more.  Example: `?learner.autofit`.
 
 - For more information on Python, see [here](https://learnpythonthehardway.org/).
 
@@ -132,7 +132,7 @@ learner.fit_onecycle(2e-5, 1)
 
 The `checkpoint_folder` argument (e.g., `learner.autofit(1e-4, 4, checkpoint_folder='/tmp/saved_weights')`), saves the weights only of the model after each epoch. 
 The weights of any epoch can be reloaded into the model using the `model.load_weights` method as you normally would in `tf.Keras`.  You just need to first re-create
-the model first.  For instance, if training an NER model, it would work as follows:
+the model.  For instance, if training an NER model, it would work as follows:
 ```python
 # recreate model from scratch
 import ktrain

diff --git a/README.md b/README.md
@@ -10,6 +10,8 @@
 - **2020-07-07:**  
   - ***ktrain*** **v0.18.x is released** and now includes support for TensorFlow 2.2.0. Due to various TensorFlow 2.2.0 bugs, TF 2.2.0 is only installed if Python 3.8 is being used. 
     Otherwise,  TensorFlow 2.1.0 is always installed (i.e., on Python 3.6 and 3.7 systems).
+- **2020-06-28:**  
+  - Hamiz Ahmed published his Medium article: [Finetuning BERT using ktrain for Disaster Tweets Classification](https://medium.com/analytics-vidhya/finetuning-bert-using-ktrain-for-disaster-tweets-classification-18f64a50910b) 
 - **2020-06-26:**  
   - ***ktrain*** **v0.17.x is released** and includes support for **language translation**. See the [example language translation notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/language_translation_example.ipynb) for more information.  <sub><sup>(This feature currently requires that PyTorch be installed.)</sup></sub>
 ```python
@@ -102,6 +104,8 @@ Some blog tutorials about *ktrain* are shown below:
 
 > [**Build an Open-Domain Question-Answering System With BERT in 3 Lines of Code**](https://towardsdatascience.com/build-an-open-domain-question-answering-system-with-bert-in-3-lines-of-code-da0131bc516b)
 
+> [**Finetuning BERT using ktrain for Disaster Tweets Classification**](https://medium.com/analytics-vidhya/finetuning-bert-using-ktrain-for-disaster-tweets-classification-18f64a50910b) by Hamiz Ahmed
+
 
 
 
@@ -285,16 +289,14 @@ Using *ktrain* on **Google Colab**?  See these Colab examples:
 
 ### Installation
 
-*ktrain* currently uses [TensorFlow 2.1.0 or 2.2.0](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*. 
-TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems.  TensorFlow 2.2.0 will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
- On systems where Python 3.8 is the default (e.g., Ubuntu 20.04), we strongly recommend installing and using Python 3.6/3.7 and TensorFlow 2.1.0 with *ktrain* due to problems that currently exist 
-in versions of TensorFlow >= 2.2.0.
-
-1.  Make sure pip is up-to-date with: `pip3 install -U pip`
+1. Make sure pip is up-to-date with: `pip3 install -U pip`
 
 2. Install *ktrain*: `pip3 install ktrain`
 
-**Some things to note:**
+**Some important things to note about installation:**
+- *ktrain* currently uses [TensorFlow 2.1.0 or 2.2.0](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*. 
+TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems.  TensorFlow 2.2.0 will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
+ On systems where Python 3.8 is the default (e.g., Ubuntu 20.04), we recommend installing and using Python 3.6/3.7 and TensorFlow 2.1.0 with *ktrain* due to unresolved bugs in versions of TensorFlow >= 2.2.0.
 - Since some *ktrain* dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues), 
   *ktrain* is temporarily using forked versions of some libraries. Specifically, *ktrain* uses forked versions of the `eli5` and `stellargraph` libraries.  If not installed, *ktrain* will complain  when a method or function needing 
   either of these libraries is invoked.

diff --git a/examples/text/zero_shot_learning_with_nli.ipynb b/examples/text/zero_shot_learning_with_nli.ipynb
@@ -11,7 +11,7 @@
     "%matplotlib inline\n",
     "import os\n",
     "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
-    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\" "
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"1\" "
    ]
   },
   {
@@ -63,11 +63,11 @@
     {
      "data": {
       "text/plain": [
-       "[('politics', 0.9829113483428955),\n",
-       " ('elections', 0.9880988001823425),\n",
-       " ('sports', 0.00030677582253701985),\n",
-       " ('films', 0.0008969294722191989),\n",
-       " ('television', 0.00045271270209923387)]"
+       "[('politics', 0.9791899),\n",
+       " ('elections', 0.98745817),\n",
+       " ('sports', 0.0005765463),\n",
+       " ('films', 0.0022924456),\n",
+       " ('television', 0.0010546101)]"
       ]
      },
      "execution_count": 4,
@@ -98,11 +98,11 @@
     {
      "data": {
       "text/plain": [
-       "[('politics', 0.0001159722960437648),\n",
-       " ('elections', 0.00015142698248382658),\n",
-       " ('sports', 0.00011554622324183583),\n",
-       " ('films', 0.035863082855939865),\n",
-       " ('television', 0.9755581617355347)]"
+       "[('politics', 0.00015667638),\n",
+       " ('elections', 0.00032881147),\n",
+       " ('sports', 0.00013884966),\n",
+       " ('films', 0.075576425),\n",
+       " ('television', 0.9813269)]"
       ]
      },
      "execution_count": 5,
@@ -130,11 +130,11 @@
     {
      "data": {
       "text/plain": [
-       "[('politics', 0.8382046818733215),\n",
-       " ('elections', 0.009549508802592754),\n",
-       " ('sports', 0.003681211732327938),\n",
-       " ('films', 0.045103102922439575),\n",
-       " ('television', 0.9293773174285889)]"
+       "[('politics', 0.8049428),\n",
+       " ('elections', 0.01889327),\n",
+       " ('sports', 0.0055048335),\n",
+       " ('films', 0.05876928),\n",
+       " ('television', 0.8776824)]"
       ]
      },
      "execution_count": 6,
@@ -169,11 +169,11 @@
     {
      "data": {
       "text/plain": [
-       "[('politics', 0.0003102553600911051),\n",
-       " ('elections', 0.00048395441262982786),\n",
-       " ('sports', 0.9848700761795044),\n",
-       " ('films', 0.9717175364494324),\n",
-       " ('television', 0.9505334496498108)]"
+       "[('politics', 0.0005349868),\n",
+       " ('elections', 0.0007852868),\n",
+       " ('sports', 0.98488265),\n",
+       " ('films', 0.9576993),\n",
+       " ('television', 0.94114333)]"
       ]
      },
      "execution_count": 7,
@@ -186,6 +186,64 @@
     "zsl.predict(doc, topic_strings=topic_strings, include_labels=True)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prediction Time and Batch Size\n",
+    "\n",
+    "The `predict` method of `ZeroShotClassifier` generates a separate NLI prediction for each topic included in `topic_strings`.  As `len(topic_strings)` increases, the prediction time will also increase.  **You can speed up predictions by increasing the `batch_size`.**  The default `batch_size` is currently set conservatively at 8:\n",
+    "\n",
+    "#### Predicting 800 topics takes ~8 seconds on a TITAN V GPU using `batch_size=4`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 14.9 s, sys: 20.7 ms, total: 15 s\n",
+      "Wall time: 7.5 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
+    "predictions = zsl.predict(doc, topic_strings=topic_strings*160, include_labels=True, batch_size=4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Predicting 800 topics takes less than 2 seconds on a TITAN V GPU using `batch_size=64`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 1.87 s, sys: 385 ms, total: 2.26 s\n",
+      "Wall time: 1.68 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
+    "predictions = zsl.predict(doc, topic_strings=topic_strings*160, include_labels=True, batch_size=64)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/ktrain/core.py b/ktrain/core.py
@@ -1377,7 +1377,7 @@ def load_predictor(fpath, batch_size=U.DEFAULT_BS):
             #warnings.warn('could not load .preproc file as %s - attempting to load as %s' % (os.path.join(fpath, U.PREPROC_NAME), preproc_name))
             with open(preproc_name, 'rb') as f: preproc = pickle.load(f)
         except:
-            raise Exception('Could not find a .preproc file in either the post v0.16.x loction (%s) or pre v0.16.x location (%s)' % (os.path.join(fpath. U.PREPROC_NAME), fpath+'.preproc'))
+            raise Exception('Could not find a .preproc file in either the post v0.16.x loction (%s) or pre v0.16.x location (%s)' % (os.path.join(fpath, U.PREPROC_NAME), fpath+'.preproc'))
 
     # load the model
     model = _load_model(fpath, preproc=preproc)

diff --git a/ktrain/text/zsl/core.py b/ktrain/text/zsl/core.py
@@ -27,34 +27,45 @@ def __init__(self, model_name='facebook/bart-large-mnli', device=None):
         self.model = AutoModelForSequenceClassification.from_pretrained(model_name).to(self.torch_device)
 
 
-    def predict(self, doc, topic_strings=[], include_labels=False):
+    def predict(self, doc, topic_strings=[], include_labels=False, batch_size=8):
         """
         zero-shot topic classification
         Args:
           doc(str): text of document
           topic_strings(list): a list of strings representing topics of your choice
                                Example:
                                topic_strings=['political science', 'sports', 'science']
+                               NOTE: len(topic_strings) is treated as batch_size.
+                               If the number of topics is greater than a reasonable batch_size
+                               for your system, you should break up the topic_strings into 
+                               chunks and invoke predict separately on each chunk.
+          include_labels(bool): If True, will return topic labels along with topic probabilities
+          batch_size(int): batch_size to use. default:8
+                           Increase this value to speed up predictions - especially
+                           if len(topic_strings) is large.
         Returns:
           inferred probabilities
         """
         if topic_strings is None or len(topic_strings) == 0:
             raise ValueError('topic_strings must be a list of strings')
-        true_probs = []
-        for topic_string in topic_strings:
-            premise = doc
-            hypothesis = 'This text is about %s.' % (topic_string)
-            input_ids = self.tokenizer.encode(premise, hypothesis, return_tensors='pt').to(self.torch_device)
-            logits = self.model(input_ids)[0]
-
-            # we throw away "neutral" (dim 1) and take the probability of
-            # "entailment" (2) as the probability of the label being true 
-            # reference: https://joeddav.github.io/blog/2020/05/29/ZSL.html
+        if batch_size > len(topic_strings): batch_size = len(topic_strings)
+        topic_chunks = list(U.list2chunks(topic_strings, n=math.ceil(len(topic_strings)/batch_size)))
+        if len(topic_strings) >= 100 and batch_size==8:
+            warnings.warn('TIP: Try increasing batch_size to speedup ZeroShotClassifier predictions')
+        result = []
+        for topics in topic_chunks:
+            pairs = []
+            for topic_string in topics:
+                premise = doc
+                hypothesis = 'This text is about %s.' % (topic_string)
+                pairs.append( (premise, hypothesis) )
+            batch = self.tokenizer.batch_encode_plus(pairs, return_tensors='pt', padding='longest').to(self.torch_device)
+            logits = self.model(batch['input_ids'], attention_mask=batch['attention_mask'])[0]
             entail_contradiction_logits = logits[:,[0,2]]
             probs = entail_contradiction_logits.softmax(dim=1)
-            true_prob = probs[:,1].item() 
-            true_probs.append(true_prob)
-        if include_labels:
-            true_probs = list(zip(topic_strings, true_probs))
-        return true_probs
+            true_probs = list(probs[:,1].cpu().detach().numpy())
+            if include_labels:
+                true_probs = list(zip(topics, true_probs))
+            result.extend(true_probs)
+        return result
 
diff --git a/ktrain/utils.py b/ktrain/utils.py
@@ -488,3 +488,7 @@ def get_random_colors(n, name='hsv', hex_format=True):
     return np.array(result)
 
 
+def list2chunks(a, n):
+    k, m = divmod(len(a), n)
+    return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
+
diff --git a/ktrain/version.py b/ktrain/version.py
@@ -1,2 +1,2 @@
 __all__ = ['__version__']
-__version__ = '0.18.2'
+__version__ = '0.18.3'
diff --git a/tutorials/tutorial-04-text-classification.ipynb b/tutorials/tutorial-04-text-classification.ipynb
@@ -330,7 +330,7 @@
     "\n",
     "### Making Predictions\n",
     "\n",
-    "Let's predict the sntiment of new movie reviews (or comments in this case) using our trained model.\n",
+    "Let's predict the sentiment of new movie reviews (or comments in this case) using our trained model.\n",
     "\n",
     "The ```preproc``` object (returned by ```texts_from_folder```) is important here, as it is used to preprocess data in a way our model expects."
    ]