Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Jul 13, 2020
2 parents aa33b7b + b4df378 commit aba86c6
Show file tree
Hide file tree
Showing 9 changed files with 138 additions and 50 deletions.
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,19 @@ Most recent releases are shown at the top. Each release shows:
- **Changed**: Additional parameters, changes to inputs or outputs, etc
- **Fixed**: Bug fixes that don't change documented behaviour

## 0.18.3 (2020-07-12)

### New:
- added `batch_size` argument to `ZeroShotClassifier.predict` that can be increased to speed up predictions.
This is especially useful if `len(topic_strings)` is large.

### Changed
- N/A

### Fixed:
- fixed typo in `load_predictor` error message


## 0.18.2 (2020-07-08)

### New:
Expand Down
6 changes: 3 additions & 3 deletions FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,8 @@ Here is how you can quickly get started using *ktrain*:
4. Make sure the notebook is setup to use a GPU: `Runtime --> Change runtime type` and select `GPU` in the menu.
5. Click on each cell in the notebook and execute it by pressing `SHIFT` and `ENTER` at the same time. The notebook shows you how to build a neural network that recoginizes cats vs. dogs in photos.


- For more information on `ktrain`, see [the tutorials](https://github.com/amaiya/ktrain#tutorials).
Next, you can go through [the tutorials](https://github.com/amaiya/ktrain#tutorials) to learn more. If you have questions about a method or function,
type a question mark before the method and press ENTER in a Google Colab or Jupyter notebook to learn more. Example: `?learner.autofit`.

- For more information on Python, see [here](https://learnpythonthehardway.org/).

Expand Down Expand Up @@ -132,7 +132,7 @@ learner.fit_onecycle(2e-5, 1)

The `checkpoint_folder` argument (e.g., `learner.autofit(1e-4, 4, checkpoint_folder='/tmp/saved_weights')`), saves the weights only of the model after each epoch.
The weights of any epoch can be reloaded into the model using the `model.load_weights` method as you normally would in `tf.Keras`. You just need to first re-create
the model first. For instance, if training an NER model, it would work as follows:
the model. For instance, if training an NER model, it would work as follows:
```python
# recreate model from scratch
import ktrain
Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
- **2020-07-07:**
- ***ktrain*** **v0.18.x is released** and now includes support for TensorFlow 2.2.0. Due to various TensorFlow 2.2.0 bugs, TF 2.2.0 is only installed if Python 3.8 is being used.
Otherwise, TensorFlow 2.1.0 is always installed (i.e., on Python 3.6 and 3.7 systems).
- **2020-06-28:**
- Hamiz Ahmed published his Medium article: [Finetuning BERT using ktrain for Disaster Tweets Classification](https://medium.com/analytics-vidhya/finetuning-bert-using-ktrain-for-disaster-tweets-classification-18f64a50910b)
- **2020-06-26:**
- ***ktrain*** **v0.17.x is released** and includes support for **language translation**. See the [example language translation notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/language_translation_example.ipynb) for more information. <sub><sup>(This feature currently requires that PyTorch be installed.)</sup></sub>
```python
Expand Down Expand Up @@ -102,6 +104,8 @@ Some blog tutorials about *ktrain* are shown below:
> [**Build an Open-Domain Question-Answering System With BERT in 3 Lines of Code**](https://towardsdatascience.com/build-an-open-domain-question-answering-system-with-bert-in-3-lines-of-code-da0131bc516b)
> [**Finetuning BERT using ktrain for Disaster Tweets Classification**](https://medium.com/analytics-vidhya/finetuning-bert-using-ktrain-for-disaster-tweets-classification-18f64a50910b) by Hamiz Ahmed



Expand Down Expand Up @@ -285,16 +289,14 @@ Using *ktrain* on **Google Colab**? See these Colab examples:

### Installation

*ktrain* currently uses [TensorFlow 2.1.0 or 2.2.0](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*.
TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems. TensorFlow 2.2.0 will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
On systems where Python 3.8 is the default (e.g., Ubuntu 20.04), we strongly recommend installing and using Python 3.6/3.7 and TensorFlow 2.1.0 with *ktrain* due to problems that currently exist
in versions of TensorFlow >= 2.2.0.

1. Make sure pip is up-to-date with: `pip3 install -U pip`
1. Make sure pip is up-to-date with: `pip3 install -U pip`

2. Install *ktrain*: `pip3 install ktrain`

**Some things to note:**
**Some important things to note about installation:**
- *ktrain* currently uses [TensorFlow 2.1.0 or 2.2.0](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*.
TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems. TensorFlow 2.2.0 will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
On systems where Python 3.8 is the default (e.g., Ubuntu 20.04), we recommend installing and using Python 3.6/3.7 and TensorFlow 2.1.0 with *ktrain* due to unresolved bugs in versions of TensorFlow >= 2.2.0.
- Since some *ktrain* dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues),
*ktrain* is temporarily using forked versions of some libraries. Specifically, *ktrain* uses forked versions of the `eli5` and `stellargraph` libraries. If not installed, *ktrain* will complain when a method or function needing
either of these libraries is invoked.
Expand Down
100 changes: 79 additions & 21 deletions examples/text/zero_shot_learning_with_nli.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"%matplotlib inline\n",
"import os\n",
"os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\" "
"os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"1\" "
]
},
{
Expand Down Expand Up @@ -63,11 +63,11 @@
{
"data": {
"text/plain": [
"[('politics', 0.9829113483428955),\n",
" ('elections', 0.9880988001823425),\n",
" ('sports', 0.00030677582253701985),\n",
" ('films', 0.0008969294722191989),\n",
" ('television', 0.00045271270209923387)]"
"[('politics', 0.9791899),\n",
" ('elections', 0.98745817),\n",
" ('sports', 0.0005765463),\n",
" ('films', 0.0022924456),\n",
" ('television', 0.0010546101)]"
]
},
"execution_count": 4,
Expand Down Expand Up @@ -98,11 +98,11 @@
{
"data": {
"text/plain": [
"[('politics', 0.0001159722960437648),\n",
" ('elections', 0.00015142698248382658),\n",
" ('sports', 0.00011554622324183583),\n",
" ('films', 0.035863082855939865),\n",
" ('television', 0.9755581617355347)]"
"[('politics', 0.00015667638),\n",
" ('elections', 0.00032881147),\n",
" ('sports', 0.00013884966),\n",
" ('films', 0.075576425),\n",
" ('television', 0.9813269)]"
]
},
"execution_count": 5,
Expand Down Expand Up @@ -130,11 +130,11 @@
{
"data": {
"text/plain": [
"[('politics', 0.8382046818733215),\n",
" ('elections', 0.009549508802592754),\n",
" ('sports', 0.003681211732327938),\n",
" ('films', 0.045103102922439575),\n",
" ('television', 0.9293773174285889)]"
"[('politics', 0.8049428),\n",
" ('elections', 0.01889327),\n",
" ('sports', 0.0055048335),\n",
" ('films', 0.05876928),\n",
" ('television', 0.8776824)]"
]
},
"execution_count": 6,
Expand Down Expand Up @@ -169,11 +169,11 @@
{
"data": {
"text/plain": [
"[('politics', 0.0003102553600911051),\n",
" ('elections', 0.00048395441262982786),\n",
" ('sports', 0.9848700761795044),\n",
" ('films', 0.9717175364494324),\n",
" ('television', 0.9505334496498108)]"
"[('politics', 0.0005349868),\n",
" ('elections', 0.0007852868),\n",
" ('sports', 0.98488265),\n",
" ('films', 0.9576993),\n",
" ('television', 0.94114333)]"
]
},
"execution_count": 7,
Expand All @@ -186,6 +186,64 @@
"zsl.predict(doc, topic_strings=topic_strings, include_labels=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prediction Time and Batch Size\n",
"\n",
"The `predict` method of `ZeroShotClassifier` generates a separate NLI prediction for each topic included in `topic_strings`. As `len(topic_strings)` increases, the prediction time will also increase. **You can speed up predictions by increasing the `batch_size`.** The default `batch_size` is currently set conservatively at 8:\n",
"\n",
"#### Predicting 800 topics takes ~8 seconds on a TITAN V GPU using `batch_size=4`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 14.9 s, sys: 20.7 ms, total: 15 s\n",
"Wall time: 7.5 s\n"
]
}
],
"source": [
"%%time\n",
"doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
"predictions = zsl.predict(doc, topic_strings=topic_strings*160, include_labels=True, batch_size=4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Predicting 800 topics takes less than 2 seconds on a TITAN V GPU using `batch_size=64`"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.87 s, sys: 385 ms, total: 2.26 s\n",
"Wall time: 1.68 s\n"
]
}
],
"source": [
"%%time\n",
"doc = 'I am extremely dissatisfied with the President and will definitely vote in 2020.'\n",
"predictions = zsl.predict(doc, topic_strings=topic_strings*160, include_labels=True, batch_size=64)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
2 changes: 1 addition & 1 deletion ktrain/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1377,7 +1377,7 @@ def load_predictor(fpath, batch_size=U.DEFAULT_BS):
#warnings.warn('could not load .preproc file as %s - attempting to load as %s' % (os.path.join(fpath, U.PREPROC_NAME), preproc_name))
with open(preproc_name, 'rb') as f: preproc = pickle.load(f)
except:
raise Exception('Could not find a .preproc file in either the post v0.16.x loction (%s) or pre v0.16.x location (%s)' % (os.path.join(fpath. U.PREPROC_NAME), fpath+'.preproc'))
raise Exception('Could not find a .preproc file in either the post v0.16.x loction (%s) or pre v0.16.x location (%s)' % (os.path.join(fpath, U.PREPROC_NAME), fpath+'.preproc'))

# load the model
model = _load_model(fpath, preproc=preproc)
Expand Down
43 changes: 27 additions & 16 deletions ktrain/text/zsl/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,34 +27,45 @@ def __init__(self, model_name='facebook/bart-large-mnli', device=None):
self.model = AutoModelForSequenceClassification.from_pretrained(model_name).to(self.torch_device)


def predict(self, doc, topic_strings=[], include_labels=False):
def predict(self, doc, topic_strings=[], include_labels=False, batch_size=8):
"""
zero-shot topic classification
Args:
doc(str): text of document
topic_strings(list): a list of strings representing topics of your choice
Example:
topic_strings=['political science', 'sports', 'science']
NOTE: len(topic_strings) is treated as batch_size.
If the number of topics is greater than a reasonable batch_size
for your system, you should break up the topic_strings into
chunks and invoke predict separately on each chunk.
include_labels(bool): If True, will return topic labels along with topic probabilities
batch_size(int): batch_size to use. default:8
Increase this value to speed up predictions - especially
if len(topic_strings) is large.
Returns:
inferred probabilities
"""
if topic_strings is None or len(topic_strings) == 0:
raise ValueError('topic_strings must be a list of strings')
true_probs = []
for topic_string in topic_strings:
premise = doc
hypothesis = 'This text is about %s.' % (topic_string)
input_ids = self.tokenizer.encode(premise, hypothesis, return_tensors='pt').to(self.torch_device)
logits = self.model(input_ids)[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true
# reference: https://joeddav.github.io/blog/2020/05/29/ZSL.html
if batch_size > len(topic_strings): batch_size = len(topic_strings)
topic_chunks = list(U.list2chunks(topic_strings, n=math.ceil(len(topic_strings)/batch_size)))
if len(topic_strings) >= 100 and batch_size==8:
warnings.warn('TIP: Try increasing batch_size to speedup ZeroShotClassifier predictions')
result = []
for topics in topic_chunks:
pairs = []
for topic_string in topics:
premise = doc
hypothesis = 'This text is about %s.' % (topic_string)
pairs.append( (premise, hypothesis) )
batch = self.tokenizer.batch_encode_plus(pairs, return_tensors='pt', padding='longest').to(self.torch_device)
logits = self.model(batch['input_ids'], attention_mask=batch['attention_mask'])[0]
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
true_prob = probs[:,1].item()
true_probs.append(true_prob)
if include_labels:
true_probs = list(zip(topic_strings, true_probs))
return true_probs
true_probs = list(probs[:,1].cpu().detach().numpy())
if include_labels:
true_probs = list(zip(topics, true_probs))
result.extend(true_probs)
return result

4 changes: 4 additions & 0 deletions ktrain/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -488,3 +488,7 @@ def get_random_colors(n, name='hsv', hex_format=True):
return np.array(result)


def list2chunks(a, n):
k, m = divmod(len(a), n)
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

2 changes: 1 addition & 1 deletion ktrain/version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__all__ = ['__version__']
__version__ = '0.18.2'
__version__ = '0.18.3'
2 changes: 1 addition & 1 deletion tutorials/tutorial-04-text-classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,7 @@
"\n",
"### Making Predictions\n",
"\n",
"Let's predict the sntiment of new movie reviews (or comments in this case) using our trained model.\n",
"Let's predict the sentiment of new movie reviews (or comments in this case) using our trained model.\n",
"\n",
"The ```preproc``` object (returned by ```texts_from_folder```) is important here, as it is used to preprocess data in a way our model expects."
]
Expand Down

0 comments on commit aba86c6

Please sign in to comment.