Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Mar 26, 2021
2 parents 4038cec + b1be655 commit 23baf7c
Show file tree
Hide file tree
Showing 11 changed files with 134 additions and 30 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,18 @@ Most recent releases are shown at the top. Each release shows:
- **Changed**: Additional parameters, changes to inputs or outputs, etc
- **Fixed**: Bug fixes that don't change documented behaviour

## 0.26.2 (2021-03-26)

### New:
- N/A

### Changed
- `NERPredictor.predict` now optionally accepts lists of sentences to make sequence-labeling predictions in batches (as all other `Predictor` instances already do).

### Fixed:
- N/A


## 0.26.1 (2021-03-11)

### New:
Expand Down
23 changes: 23 additions & 0 deletions FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@

- [How do I make quantized predictions with `transformers` models?](#how-do-i-make-quantized-predictions-with-transformers-models)

- [How do I increase batch size for predictions?](#how-do-i-increase-batch-size-for-predictions)

- [How do I speed up predictions?](#how-do-i-increase-batch-size-for-predictions)



---
Expand Down Expand Up @@ -991,6 +995,25 @@ def reset_random_seeds(seed=2):
[[Back to Top](#frequently-asked-questions-about-ktrain)]


### How do I increase batch size for predictions?

Increasing the batch size used for inference and predictions can potentially speed up predictions on lists of examples.

The `get_predictor` and `load_predictor` functions both accept a `batch_size` argument that will be used when making predictions on lists of examples. The default is 32. The `batch_size` for `Predictor` instances can also be set manually:
```python
predictor = ktrain.load_predictor('/tmp/my_predictor')
predictor.batch_size = 128
predictor.predict(list_of_examples)
```

The `get_learner` function accepts an `eval_batch_size` argument that will be used by the `Learner` instance when evaluating a validation dataset (e.g., `learner.predict`).


[[Back to Top](#frequently-asked-questions-about-ktrain)]




### What kinds of applications have been built with *ktrain*?

Examples include:
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,7 @@ The above should be all you need on Linux systems and cloud computing environmen

**Some important things to note about installation:**
- If using **ktrain** with `tensorflow<=2.1`, you must also downgrade the **transformers** library to `transformers==3.1`.
- If `load_predictor` fails with the error "`AttributeError: 'str' object has no attribute 'decode'`", then downgrade **h5py**: `pip install h5py==2.10.0`
- As of v0.21.x, **ktrain** no longer installs TensorFlow 2 automatically. As indicated above, you should install TensorFlow 2 yourself before installing and using **ktrain**. On Google Colab, TensorFlow 2 should be already installed. You should be able to use **ktrain** with any version of [TensorFlow 2](https://www.tensorflow.org/install/pip?lang=python3). Note, however, that there is a bug in TensorFlow 2.2 and 2.3 that affects the *Learning-Rate-Finder* [that will not be fixed until TensorFlow 2.4](https://github.com/tensorflow/tensorflow/issues/41174#issuecomment-656330268). The bug causes the learning-rate-finder to complete all epochs even after loss has diverged (i.e., no automatic-stopping).
- If using **ktrain** on a local machine with a GPU (versus Google Colab, for example), you'll need to [install GPU support for TensorFlow 2](https://www.tensorflow.org/install/gpu).
- Since some **ktrain** dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues),
Expand Down
2 changes: 1 addition & 1 deletion examples/text/ktrain-ONNX-TFLite-examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
"!pip install ktrain\n",
"\n",
"# load text data\n",
"categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']\n",
"categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)\n",
"test_b = fetch_20newsgroups(subset='test',categories=categories, shuffle=True)\n",
Expand Down
72 changes: 46 additions & 26 deletions ktrain/text/ner/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,45 +26,63 @@ def get_classes(self):
return self.c


def predict(self, sentence, return_proba=False, merge_tokens=False, custom_tokenizer=None):
def predict(self, sentences, return_proba=False, merge_tokens=False, custom_tokenizer=None):
"""
Makes predictions for a string-representation of a sentence
Args:
sentence(str): sentence of text
sentences(list|str): either a single sentence as a string or a list of sentences
return_proba(bool): If return_proba is True, returns probability distribution for each token
merge_tokens(bool): If True, tokens will be merged together by the entity
to which they are associated:
('Paul', 'B-PER'), ('Newman', 'I-PER') becomes ('Paul Newman', 'PER')
custom_tokenizer(Callable): If specified, sentence will be tokenized based on custom tokenizer
Returns:
list: list of tuples representing each token.
list: If sentences is a string representation of single sentence:
list containing a tuple for each token in sentence
IF sentences is a list of sentences:
list of lists: Each inner list represents a sentence and contains a tuple for each token in sentence
"""
if not isinstance(sentence, str):
raise ValueError('Param sentence must be a string-representation of a sentence')
is_array = not isinstance(sentences, str)
if not isinstance(sentences, (str, list)):
raise ValueError('Param sentence must be either string-representation of a sentence or a list of sentence strings.')
if return_proba and merge_tokens:
raise ValueError('return_proba and merge_tokens are mutually exclusive with one another.')
lang = TU.detect_lang([sentence])
nerseq = self.preproc.preprocess([sentence], lang=lang, custom_tokenizer=custom_tokenizer)
if not nerseq.prepare_called:
nerseq.prepare()
nerseq.batch_size = self.batch_size
x_true, _ = nerseq[0]
lengths = nerseq.get_lengths(0)
y_pred = self.model.predict_on_batch(x_true)
y_labels = self.preproc.p.inverse_transform(y_pred, lengths)
y_labels = y_labels[0]
if return_proba:
try:
probs = np.max(y_pred, axis=2)[0]
except:
probs = y_pred[0].numpy().tolist() # TODO: remove after confirmation (#316)
return list(zip(nerseq.x[0], y_labels, probs))
else:
result = list(zip(nerseq.x[0], y_labels))
if merge_tokens:
result = self.merge_tokens(result, lang)
return result
if isinstance(sentences, str): sentences = [sentences]
lang = TU.detect_lang(sentences)

# batchify
num_chunks = math.ceil(len(sentences)/self.batch_size)
batches = U.list2chunks(sentences, n=num_chunks)

# process batches
results = []
for batch in batches:
nerseq = self.preproc.preprocess(batch, lang=lang, custom_tokenizer=custom_tokenizer)
if not nerseq.prepare_called:
nerseq.prepare()
nerseq.batch_size = len(batch)
x_true, _ = nerseq[0]
lengths = nerseq.get_lengths(0)
y_pred = self.model.predict_on_batch(x_true)
y_labels = self.preproc.p.inverse_transform(y_pred, lengths)
if return_proba:
try:
probs = np.max(y_pred, axis=2)
except:
probs = y_pred[0].numpy().tolist() # TODO: remove after confirmation (#316)
for x, y, prob in zip(nerseq.x, y_labels, probs):
result = [(x[i], y[i], prob[i]) for i in range(len(x))]
results.append(result)
else:
for x,y in zip(nerseq.x, y_labels):
result = list(zip(x,y))
if merge_tokens:
result = self.merge_tokens(result, lang)
results.append(result)
if not is_array: results = results[0]
return results



def merge_tokens(self, annotated_sentence, lang):
Expand Down Expand Up @@ -105,5 +123,7 @@ def merge_tokens(self, annotated_sentence, lang):
elif tag and current_token: # prefix I
current_token = current_token + sep + token
current_tag = tag
if current_token and current_tag:
entities.append((current_token, current_tag))
return entities

4 changes: 2 additions & 2 deletions ktrain/text/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ def predict(self, texts, return_proba=False):
if U.is_huggingface(model=self.model):
tseq = self.preproc.preprocess_test(texts, verbose=0)
tseq.batch_size = self.batch_size
texts = tseq.to_tfdataset(train=False)
preds = self.model.predict(texts)
tfd = tseq.to_tfdataset(train=False)
preds = self.model.predict(tfd)
if type(preds).__name__ == 'TFSequenceClassifierOutput': # dep_fix: undocumented breaking change in transformers==4.0.0
preds = preds.logits

Expand Down
2 changes: 2 additions & 0 deletions ktrain/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,6 +704,8 @@ def apply(self, df, train=True):
for i, col in enumerate(new_lab_cols):
df[col] = targets[:,i]
df[new_lab_cols] = targets
print(new_lab_cols)
print(df[new_lab_cols].head())
df[new_lab_cols] = df[new_lab_cols].astype('float32')

return df
Expand Down
2 changes: 1 addition & 1 deletion ktrain/version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__all__ = ['__version__']
__version__ = '0.26.1'
__version__ = '0.26.2'
15 changes: 15 additions & 0 deletions tutorials/tutorial-01-introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -804,6 +804,21 @@
"predictor = ktrain.load_predictor('/tmp/mymnist')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both the `load_predictor` and `get_predictor` functions accept an optional `batch_size` argument that is set to 32 by default. For instance, the `batch_size` used for inference and predictions can be increased with either of the following:\n",
"```python\n",
"# you can set the batch_size as an argument to load_predictor\n",
"predictor = ktrain.load_predictor('/tmp/mymnist', batch_size=64)\n",
"\n",
"# you can also set the batch_size used for predictions this way\n",
"predictor.batch_size = 64\n",
"```\n",
"Larger batch sizes can potentially speed predictions."
]
},
{
"cell_type": "code",
"execution_count": 17,
Expand Down
15 changes: 15 additions & 0 deletions tutorials/tutorial-04-text-classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -474,6 +474,21 @@
"predictor.predict(['Groundhog Day is my favorite movie of all time!'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that both the `load_predictor` and `get_predictor` functions accept an optional `batch_size` argument that is set to 32 by default. The `batch_size` can also be set manually on the `Predictor` instance. That is, the `batch_size` used for inference and predictions can be increased with either of the following:\n",
"```python\n",
"# you can set the batch_size as an argument to load_predictor (or get_predictor)\n",
"predictor = ktrain.load_predictor('/tmp/my_moviereview_predictor', batch_size=128)\n",
"\n",
"# you can also set the batch_size used for predictions this way\n",
"predictor.batch_size = 128\n",
"```\n",
"Larger batch sizes can potentially speed predictions when `predictor.predict` is supplied with a list of examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
16 changes: 16 additions & 0 deletions tutorials/tutorial-06-sequence-tagging.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,22 @@
"reloaded_predictor.predict('Paul Newman is my favorite American actor.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `predict` method also can accept a list of sentences. And, larger batch sizes can potentially speed predictions when `predictor.predict` is supplied with a list of examples.\n",
"\n",
"Both the `load_predictor` and `get_predictor` functions accept an optional `batch_size` argument used for predictions, which is set to 32 by default. The `batch_size` can also be set manually on the `Predictor` instance. That is, the `batch_size` used for inference and predictions can be increased with either of the following:\n",
"```python\n",
"# you can set the batch_size as an argument to load_predictor (or get_predictor)\n",
"predictor = ktrain.load_predictor('/tmp/mypred', batch_size=128)\n",
"\n",
"# you can also set the batch_size used for predictions this way\n",
"predictor.batch_size = 128\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down

0 comments on commit 23baf7c

Please sign in to comment.