Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Jan 20, 2020
2 parents 0ce9f45 + 704034e commit cfaa18e
Show file tree
Hide file tree
Showing 8 changed files with 162 additions and 16 deletions.
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ Most recent releases are shown at the top. Each release shows:
- **Changed**: Additional parameters, changes to inputs or outputs, etc
- **Fixed**: Bug fixes that don't change documented behaviour


## 0.8.2 (TBD)

### New:
- initial base `ktrain.Dataset` class for use as a Sequence wrapper to better support custom datasets/models

### Changed:
- N/A

### Fixed:
- N/A




## 0.8.1 (2020-01-15)

### New:
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

### News and Announcements
- **2020-01-14:**
- ***ktrain*** **v0.8.x is released** and now includes a thin and easy-to-use wrapper to [HuggingFace Transformers](https://github.com/huggingface/transformers) for text classification. See [this tutorial notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/tutorials/tutorial-A3-hugging_face_transformers.ipynb) for more details.
- ***ktrain*** **v0.8.x is released** and now includes a thin and easy-to-use wrapper to [HuggingFace Transformers](https://github.com/huggingface/transformers) for text classification. See [this tutorial notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb) for more details.
- As of v0.8.x, *ktrain* now uses **TensorFlow 2**. TensorFlow 1.x is no longer supported. If you're using Google Colab and `import tensorflow as tf; print(tf.__version__)` shows v1.15 is installed, you must install TensorFlow 2: `!pip3 install -q tensorflow_gpu==2.0`. Remember to import Keras modules like this: `from tensorflow.keras.layers import Dense`. (That is, don't do this: `from keras.layers import Dense`.)
- **Coming Soon**:
- better support for custom data formats and models
Expand All @@ -26,7 +26,7 @@
- **Sequence Labeling**: [Bidirectional LSTM-CRF](https://arxiv.org/abs/1603.01360) with optional pretrained word embeddings <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-06-sequence-tagging.ipynb)]</sup></sub>
- **Unsupervised Topic Modeling** with [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-topic_modeling.ipynb)]</sup></sub>
- **Document Similarity with One-Class Learning**: given some documents of interest, find and score new documents that are semantically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
- **Document Recommendation Engine**: given text from a sample document, recommend documents that are semantically similar to it from a larger corpus <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
- **Document Recommendation Engine**: given text from a sample document, recommend documents that are semantically-related to it from a larger corpus <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
- `vision` data:
- **image classification** (e.g., [ResNet](https://arxiv.org/abs/1512.03385), [Wide ResNet](https://arxiv.org/abs/1605.07146), [Inception](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/vision/dogs_vs_cats-ResNet50.ipynb)]</sup></sub>
- `graph` data:
Expand All @@ -48,7 +48,7 @@ Please see the following tutorial notebooks for a guide on how to use *ktrain* o
* Tutorial 7: [Graph Node Classification](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-07-graph-node_classification.ipynb) with Graph Neural Networks
* Tutorial A1: [Additional tricks](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A1-additional-tricks.ipynb), which covers topics such as previewing data augmentation schemes, inspecting intermediate output of Keras models for debugging, setting global weight decay, and use of built-in and custom callbacks.
* Tutorial A2: [Explaining Predictions and Misclassifications](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A2-explaining-predictions.ipynb)
* Tutorial A3: [Text Classification with Hugging Face Transformers](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/tutorials/tutorial-A3-hugging_face_transformers.ipynb)
* Tutorial A3: [Text Classification with Hugging Face Transformers](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb)


Some blog tutorials about *ktrain* are shown below:
Expand Down
4 changes: 3 additions & 1 deletion ktrain/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@
from .text.learner import BERTTextClassLearner, TransformerTextClassLearner
from .text.ner.learner import NERLearner
from .graph.learner import NodeClassLearner
from .data import Dataset

from . import utils as U

__all__ = ['get_learner', 'get_predictor', 'load_predictor', 'release_gpu_memory']
__all__ = ['get_learner', 'get_predictor', 'load_predictor', 'release_gpu_memory',
'Dataset']



Expand Down
42 changes: 40 additions & 2 deletions ktrain/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1185,12 +1185,50 @@ def layer_output(self, layer_id, example_id=0, batch_id=0, use_val=False):
return layer_out


#def view_top_losses(self, n=4, preproc=None, val_data=None):
# """
# Views observations with top losses in validation set.
# Musta be overridden by Learner subclasses.
# """
# raise NotImplementedError('view_top_losses must be overriden by GenLearner subclass')
def view_top_losses(self, n=4, preproc=None, val_data=None):
"""
Views observations with top losses in validation set.
Musta be overridden by Learner subclasses.
Typically over-ridden by Learner subclasses.
Args:
n(int or tuple): a range to select in form of int or tuple
e.g., n=8 is treated as n=(0,8)
preproc (Preprocessor): A TextPreprocessor or ImagePreprocessor.
For some data like text data, a preprocessor
is required to undo the pre-processing
to correctly view raw data.
val_data: optional val_data to use instead of self.val_data
Returns:
list of n tuples where first element is either
filepath or id of validation example and second element
is loss.
"""
raise NotImplementedError('view_top_losses must be overriden by GenLearner subclass')
val = self._check_val(val_data)


# get top losses and associated data
tups = self.top_losses(n=n, val_data=val, preproc=preproc)

# get multilabel status and class names
classes = preproc.get_classes() if preproc is not None else None
# iterate through losses
for tup in tups:

# get data
idx = tup[0]
loss = tup[1]
truth = tup[2]
pred = tup[3]

print('----------')
print("id:%s | loss:%s | true:%s | pred:%s)\n" % (idx, round(loss,2), truth, pred))
return



Expand Down
84 changes: 84 additions & 0 deletions ktrain/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
from .imports import *


class Dataset(Sequence):
def __init__(self, batch_size=32):
self.batch_size = batch_size

# required by keras.utils.Sequence instances
def __len__(self):
raise NotImplemented

# required by keras.utils.Sequence instances
def __getitem__(self, idx):
raise NotImplemented

# required: used by Learner instances
def nsamples(self):
raise NotImplemented

# required: used by Learner instances
def get_y(self):
raise NotImplemented

# optional: to modify dataset between epochs (e.g., shuffle)
def on_epoch_end(self):
pass

# optional
def ondisk(self):
return False

# optional: used only if invoking *_classifier functions
def xshape(self):
raise NotImplemented

# optional: used only if invoking *_classifier functions
def nclasses(self):
raise NotImplemented



class MultiArrayDataset(Dataset):
def __init__(self, x, y, batch_size=32):
if type(x) != np.ndarray or type(y) != np.ndarray:
raise ValueError('x and y must be numpy arrays')
if len(x.shape) != 3:
raise valueError('x must have 3 dimensions')
super().__init__(batch_size=batch_size)
self.x, self.y = x, y
self.indices = np.arange(self.x[0].shape[0])
self.n_inputs = x.shape[0]

def __len__(self):
return math.ceil(self.x[0].shape[0] / self.batch_size)

def __getitem__(self, idx):
inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_x = []
for i in range(self.n_inputs):
batch_x.append(self.x[i][inds])
batch_y = self.y[inds]
return tuple(batch_x), batch_y

def on_epoch_end(self):
np.random.shuffle(self.indices)

def xshape(self):
return self.x.shape

def nsamples(self):
if self.n_inputs == 1:
return self.x.shape[0]
else:
return self.x.shape[1]

def nclasses(self):
return self.y.shape[1]

def get_y(self):
return self.y

def ondisk(self):
return False

21 changes: 14 additions & 7 deletions ktrain/utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .imports import *
from .data import Dataset


#------------------------------------------------------------------------------
Expand Down Expand Up @@ -156,7 +157,8 @@ def is_multilabel(data):
def shape_from_data(data):
err_msg = 'could not determine shape from %s' % (type(data))
if is_iter(data):
if is_ner(data=data): return (len(data.x), data[0][0][0].shape[1]) # NERSequence
if isinstance(data, Dataset): return data.xshape()
elif is_ner(data=data): return (len(data.x), data[0][0][0].shape[1]) # NERSequence
elif is_huggingface(data=data): # HF Transformer
return (len(data.x), data[0][0][0].shape[1])
elif is_nodeclass(data=data): # NodeSequence
Expand All @@ -180,16 +182,19 @@ def shape_from_data(data):


def ondisk(data):
if hasattr(data, 'ondisk'): return data.ondisk()

ondisk = is_iter(data) and \
(type(data).__name__ not in ['NumpyArrayIterator', 'NERSequence',
(type(data).__name__ not in ['ArrayDataset', 'NumpyArrayIterator', 'NERSequence',
'NodeSequenceWrapper', 'TransformerSequence'])
return ondisk


def nsamples_from_data(data):
err_msg = 'could not determine number of samples from %s' % (type(data))
if is_iter(data):
if is_ner(data=data): return len(data.x) # NERSequence
if isinstance(data, Dataset): return data.nsamples()
elif is_ner(data=data): return len(data.x) # NERSequence
elif is_huggingface(data=data): # HuggingFace Transformer
return len(data.x)
elif is_nodeclass(data=data): # NodeSequenceWrapper
Expand All @@ -212,7 +217,8 @@ def nsamples_from_data(data):

def nclasses_from_data(data):
if is_iter(data):
if is_ner(data=data): return len(data.p._label_vocab._id2token) # NERSequence
if isinstance(data, Dataset): return data.nsamples()
elif is_ner(data=data): return len(data.p._label_vocab._id2token) # NERSequence
elif is_huggingface(data=data): # Hugging Face Transformer
return data.y.shape[1]
elif is_nodeclass(data=data): # NodeSequenceWrapper
Expand All @@ -233,7 +239,8 @@ def nclasses_from_data(data):

def y_from_data(data):
if is_iter(data):
if is_ner(data=data): return data.y # NERSequence
if isinstance(data, Dataset): return data.get_y()
elif is_ner(data=data): return data.y # NERSequence
if is_huggingface(data=data): # Hugging Face Transformer
return data.y
elif is_nodeclass(data=data): # NodeSequenceWrapper
Expand All @@ -258,7 +265,7 @@ def is_iter(data, ignore=False):
iter_classes = ["NumpyArrayIterator", "DirectoryIterator",
"DataFrameIterator", "Iterator", "Sequence",
"NERSequence", "NodeSequenceWrapper", "TransformerSequence"]
return data.__class__.__name__ in iter_classes
return data.__class__.__name__ in iter_classes or isinstance(data, Dataset)



Expand All @@ -271,7 +278,7 @@ def data_arg_check(train_data=None, val_data=None, train_required=False, val_req
if train_data is not None and not is_iter(train_data, ndarray_only):
if bad_data_tuple(train_data):
err_msg = 'data must be tuple of numpy.ndarrays'
if not ndarray_only: err_msg += ' or an instance of Iterator'
if not ndarray_only: err_msg += ' or an instance of ktrain.Dataset'
raise ValueError(err_msg)
if val_data is not None and not is_iter(val_data, ndarray_only):
if bad_data_tuple(val_data):
Expand Down
2 changes: 1 addition & 1 deletion ktrain/version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__all__ = ['__version__']
__version__ = '0.8.1'
__version__ = '0.8.2'
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
author = 'Arun S. Maiya',
author_email = 'arun@maiya.net',
url = 'https://github.com/amaiya/ktrain',
keywords = ['keras', 'deep learning', 'machine learning'],
keywords = ['tensorflow', 'keras', 'deep learning', 'machine learning'],
install_requires=[
'scikit-learn == 0.21.3',
'matplotlib >= 3.0.0',
Expand All @@ -45,7 +45,7 @@
# 3 - Alpha
# 4 - Beta
# 5 - Production/Stable
'Development Status :: 3 - Alpha',
'Development Status :: 4 - Beta',

# Indicate who your project is intended for
'Intended Audience :: Developers',
Expand Down

0 comments on commit cfaa18e

Please sign in to comment.