Merge branch 'develop'

amaiya · Nov 12, 2019 · 75feba3 · 75feba3
2 parents e3578cf + 5d5f6bf
commit 75feba3
Show file tree

Hide file tree

Showing 16 changed files with 7,093 additions and 34 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,23 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+## 0.6.0 (2019-11-12)
+
+### New:
+- support for learning from unlabeled or partially-labeled text data
+  - unsupervised topic modeling with LDA
+  - one-class text classification to score documents based on similarity to a set of positive examples
+  - document recommendation engine
+
+### Changed:
+- N/A
+
+
+### Fixed:
+- Removed dangling reference to external 'stellargraph' dependency from `_load_model`, so that we rely solely on
+  local version of stellargraph
+
+
 ## 0.5.2 (2019-10-20)
 
 ### New:

diff --git a/README.md b/README.md
@@ -1,23 +1,30 @@
 ### News and Announcements
+- **2019-11-12:**  
+  - *ktrain* v0.6.x is released and includes pre-canned support for [learning from unlabeled or partially labeled text data](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-05-learning_from_unlabeled_text_data.ipynb).
+- **2019-10-16:**  
+  - *ktrain* v0.5.x is released and includes pre-canned support for [node classification in graphs](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/graphs/hateful_twitter_users-GraphSAGE.ipynb).
 - **Coming Soon**:
   - better support for custom data formats and models
   - support for using *ktrain* with `tf.keras`
-- **2019-10-16:**  
-  - *ktrain* v0.5.x is released and includes pre-canned support for [node classification in graphs](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/graphs/hateful_twitter_users-GraphSAGE.ipynb).
 ----
 
 
 # ktrain
-
-*ktrain* is a lightweight wrapper for the deep learning library [Keras](https://keras.io/) to help build, train, and deploy neural networks.  With only a few lines of code, ktrain allows you to easily and quickly:
+*ktrain* is a lightweight wrapper for the deep learning library [Keras](https://keras.io/) (and other libraries) to help build, train, and deploy neural networks.  With only a few lines of code, ktrain allows you to easily and quickly:
 
 - estimate an optimal learning rate for your model given your data using a Learning Rate Finder
 - utilize learning rate schedules such as the [triangular policy](https://arxiv.org/abs/1506.01186), the [1cycle policy](https://arxiv.org/abs/1803.09820), and [SGDR](https://arxiv.org/abs/1608.03983) to effectively minimize loss and improve generalization
-- employ fast and easy-to-use pre-canned models for:
-  - **text classification** (e.g., [BERT](https://arxiv.org/abs/1810.04805), [NBSVM](https://www.aclweb.org/anthology/P12-2018), [fastText](https://arxiv.org/abs/1607.01759), GRUs with [pretrained word vectors](https://fasttext.cc/docs/en/english-vectors.html))
-  - **image classification** (e.g., [ResNet](https://arxiv.org/abs/1512.03385), [Wide ResNet](https://arxiv.org/abs/1605.07146), [Inception](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf))
-  - **text sequence labeling** (e.g., [Bidirectional LSTM-CRF](https://arxiv.org/abs/1603.01360) with optional pretrained word embeddings)
-  - **graph node classification** (e.g., [GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf))
+- employ fast and easy-to-use pre-canned models for  `text`, `vision`, and `graph` data:
+  - `text` data:
+     - **Text Classification**: [BERT](https://arxiv.org/abs/1810.04805), [NBSVM](https://www.aclweb.org/anthology/P12-2018), [fastText](https://arxiv.org/abs/1607.01759), GRUs with [pretrained word vectors](https://fasttext.cc/docs/en/english-vectors.html), and other models <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/IMDb-BERT.ipynb)]</sup></sub>
+     - **Sequence Labeling**:  [Bidirectional LSTM-CRF](https://arxiv.org/abs/1603.01360) with optional pretrained word embeddings <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-06-sequence-tagging.ipynb)]</sup></sub>
+     - **Unsupervised Topic Modeling** with [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-topic_modeling.ipynb)]</sup></sub>
+     - **Document Similarity with One-Class Learning**:  given some documents of interest, find and score new documents that are semantically similar to them using [One Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
+     - **Document Recommendation Engine**:  given text from a sample document, recommend documents that are semantically similar to it from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
+  - `vision` data:
+    - **image classification** (e.g., [ResNet](https://arxiv.org/abs/1512.03385), [Wide ResNet](https://arxiv.org/abs/1605.07146), [Inception](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/vision/dogs_vs_cats-ResNet50.ipynb)]</sup></sub>
+  - `graph` data:
+    - **graph node classification** (e.g., [GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/graphs/pubmed-GraphSAGE.ipynb)]</sup></sub>
 - perform multilingual text classification (e.g., [Chinese Sentiment Analysis with BERT](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/ChineseHotelReviews-BERT.ipynb), [Arabic Sentiment Analysis with NBSVM](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-nbsvm.ipynb))
 - load and preprocess text and image data from a variety of formats 
 - inspect data points that were misclassified and [provide explanations](https://eli5.readthedocs.io/en/latest/) to help improve your model
@@ -30,10 +37,11 @@ Please see the following tutorial notebooks for a guide on how to use *ktrain* o
 * Tutorial 2:  [Tuning Learning Rates](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-02-tuning-learning-rates.ipynb)
 * Tutorial 3: [Image Classification](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-03-image-classification.ipynb)
 * Tutorial 4: [Text Classification](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-04-text-classification.ipynb)
-* Tutorial 5: [Explaining Predictions and Misclassifications](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-05-explaining-predictions.ipynb)
+* Tutorial 5: [Learning from Unlabeled Text Data](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-05-learning_from_unlabeled_text_data.ipynb)
 * Tutorial 6: [Text Sequence Tagging](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-06-sequence-tagging.ipynb) for Named Entity Recognition
 * Tutorial 7: [Graph Node Classification](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-07-graph-node_classification.ipynb) with Graph Neural Networks
 * Tutorial A1: [Additional tricks](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-A1-additional-tricks.ipynb), which covers topics such as previewing data augmentation schemes, inspecting intermediate output of Keras models for debugging, setting global weight decay, and use of built-in and custom callbacks.
+* Tutorial A2: [Explaining Predictions and Misclassifications](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/tutorial-A2-explaining-predictions.ipynb)
 
 
 Some blog tutorials about *ktrain* are shown below:

diff --git a/examples/README.md b/examples/README.md
@@ -1,13 +1,23 @@
 # Example Notebooks
 
 This directory contains various example notebooks using *ktrain*.  The directory currently has three folders:
-- **text**:  text classification examples using various models and datasets
-- **vision**:  image classification examples using various models and datasets
-- **graphs**:  node classification in graphs or networks
+- `text`:  
+  - [text classification](#textclass): examples using various text classification models and datasets
+  - [text sequence labeling](#seqlab):  sequence tagging models
+  - [topic modeling](#lda):  unsupervised learning from unlabeled text data
+  - [document similarity with one-class learning](#docsim): given a sample of interesting documents, find and score new documents that are semantically similar to it using One-Class text classification
+  - [document recommender system](#docrec):  given text from a sample document, recommend documents that are semantically similar to it from a larger corpus 
+- `vision`:  
+  - [image classification](#imageclass):  models for image datasetsimage classification examples using various models and datasets
+- `graphs`: 
+  - [node classification](#nodeclass): node classification in graphs or networks
 
-## Text Classification Datasets
 
-### [IMDb](https://ai.stanford.edu/~amaas/data/sentiment/):  Binary Classification
+## Text Data
+
+### <a name="textclass"></a>Text Classification Datasets
+
+#### [IMDb](https://ai.stanford.edu/~amaas/data/sentiment/):  Binary Classification
 
 IMDb is a dataset containing 50K movie reviews labeled as positive or negative.  The corpus is split evenly between training and validation.
 The dataset is in the form of folders of images.
@@ -16,7 +26,7 @@ The dataset is in the form of folders of images.
 - [IMDb-BERT.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  BERT text classification to predict sentiment of movie reviews.
 
 
-### [Chinese Sentiment Analysis](https://github.com/Tony607/Chinese_sentiment_analysis/tree/master/data/ChnSentiCorp_htl_ba_6000):  Binary Classification
+#### [Chinese Sentiment Analysis](https://github.com/Tony607/Chinese_sentiment_analysis/tree/master/data/ChnSentiCorp_htl_ba_6000):  Binary Classification
 
 This dataset consists of roughly 6000 hotel reviews in Chinese.  The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using *ktrain* with non-English text.
 
@@ -27,7 +37,7 @@ This dataset consists of roughly 6000 hotel reviews in Chinese.  The objective i
 - [ChineseHotelReviews-BERT.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  BERT text classification to predict sentiment of Chinese-language hotel reviews.
 
 
-### [Arabic Sentiment Analysis](https://github.com/elnagara/HARD-Arabic-Dataset):  Binary Classification
+#### [Arabic Sentiment Analysis](https://github.com/elnagara/HARD-Arabic-Dataset):  Binary Classification
 
 This dataset consists contains hotel reviews in Arabic.  The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using *ktrain* with non-English text.
 
@@ -36,7 +46,7 @@ This dataset consists contains hotel reviews in Arabic.  The objective is to pre
 - [ArabicHotelReviews-BERT.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  BERT text classification to predict sentiment of Arabic-language hotel reviews.
 
 
-### [20 News Groups](http://qwone.com/~jason/20Newsgroups/): Multiclass Classification
+#### [20 News Groups](http://qwone.com/~jason/20Newsgroups/): Multiclass Classification
 This is a small sample of the 20newsgroups dataset based on considering 4 newsgroups similar to what was done in the
 [Working with Text Data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) scikit-learn tutorial. 
 Data are in the form of arrays fetched via scikit-learn library.
@@ -45,60 +55,81 @@ These examples show the results of training on a relatively small training set.
 - [20newsgroups-BERT.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  BERT text classification in a multiclass setting.
 
 
-### [Toxic Comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge):  Multi-Label Text Classification
+#### [Toxic Comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge):  Multi-Label Text Classification
 In multi-label classification, a single document can belong to multiple classes.  The objective here is
 to categorize each text comment into one or more categories of toxic online behavior.
 Dataset is in the form of a CSV file.
 - [toxic_comments-fasttext.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  A fasttext-like model applied in a multi-label setting.
 - [toxic_comments-bigru.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  A bidirectional GRU using pretrained Glove vectors. This example shows how to use pretreained word vectors using *ktrain*.
 
 
-## Text Sequence Tagging Datasets
-### [CoNLL2003 NER Task](https://github.com/amaiya/ktrain/tree/master/ktrain/tests/conll2003):  Named Entity Recognition
+### <a name="seqlab"></a> Sequence Labeling Datasets
+#### [CoNLL2003 NER Task](https://github.com/amaiya/ktrain/tree/master/ktrain/tests/conll2003):  Named Entity Recognition
 The objective of the CoNLL2003 task is to classify sequences of words as belonging to one of several categories of concepts such as Persons or Locations. See the [original paper](https://www.aclweb.org/anthology/W03-0419) for more information on the format.
 
 - [CoNLL2003-BiLSTM_CRF.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  A simple and fast Bidirectional LSTM-CRF model with randomly initialized word embeddings.
 
 
-## Image Classification Datasets
+### <a name="docsim"></a> One-Class Text Classification
+
+#### [20 News Groups](http://qwone.com/~jason/20Newsgroups/): unsupervised learning on 20newsgroups corpus
+- [20newsgroups-topic_modeling.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  Discover latent topics and themes in the 20 newsgroups corpus 
+
+
+### <a name="docsim"></a> One-Class Text Classification
+
+#### [20 News Groups](http://qwone.com/~jason/20Newsgroups/): select a set of positive examples from 20newsgroups dataset
+- [20newsgroups-document_similarity_scorer.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  Given a selected seed set of documents from 20newsgroup corpus, find and score new documents that are semantically similar to it.
+
+
+### <a name="docrec"></a> Text Recommender System
+
+#### [20 News Groups](http://qwone.com/~jason/20Newsgroups/): recommend posts from 20newsgroups
+- [20newsgroups-recommendation_engine.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  given text from a sample document, recommend documents that are semantically similar to it from a larger corpus
+
+
+## Vision Data
+
+### <a name="imageclass"></a> Image Classification Datasets
 
-### [Dogs vs. Cats](https://www.kaggle.com/c/dogs-vs-cats):  Binary Classification
+#### [Dogs vs. Cats](https://www.kaggle.com/c/dogs-vs-cats):  Binary Classification
 - [dogs_vs_cats-ResNet50.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/vision):  ResNet50 pretrained on ImageNet.  
 
-### [MNIST](http://yann.lecun.com/exdb/mnist/):  Multiclass Classification
+#### [MNIST](http://yann.lecun.com/exdb/mnist/):  Multiclass Classification
 - [mnist-WRN22.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/vision):  A randomly-initialized Wide Residual Network applied to MNIST
 
-### [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html):  Multiclass Classification
+#### [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html):  Multiclass Classification
 - [cifar10-WRN22.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/vision):  A randomly-initialized Wide Residual Network applied to CIFAR10
 
 
-### [Pets](https://www.robots.ox.ac.uk/~vgg/data/pets/): Multiclass Classification
+#### [Pets](https://www.robots.ox.ac.uk/~vgg/data/pets/): Multiclass Classification
 - [pets-ResNet50.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/vision):  Categorizing dogs and cats by breed using a pretrained ResNet50. Uses the `images_from_fname` function, as class labels are embedded in the file names of images.
 
 
-### [Planet](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space): Multilabel Classification
+#### [Planet](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space): Multilabel Classification
 The Kaggle Planet dataset consists of satellite images - each of which are categorized into multiple categories.
 Image labels are in the form of a CSV containing paths to images.
 - [planet-ResNet50.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/vision):  Using a pretrained ResNet50 model for multi-label classification.
 
 
+## Graph Data
 
-## Graph Datasets
+### <a name="#nodeclass"></a> Graph Node Classification Datasets
 
-### [PubMed-Diabetes](https://linqs-data.soe.ucsc.edu/public/Pubmed-Diabetes.tgz):  Node Classification
+#### [PubMed-Diabetes](https://linqs-data.soe.ucsc.edu/public/Pubmed-Diabetes.tgz):  Node Classification
 
 In the PubMed graph, each node represents a paper pertaining to one of three topics:  *Diabetes Mellitus - Experimental*, *Diabetes Mellitus - Type 1*, and *Diabetes Mellitus - Type 2*.  Links represent citations between papers.  The attributes or features assigned to each node are in the form of a vector of words in each paper and their corresponding TF-IDF scores.
 
 - [pubmed-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model for transductive and inductive inference.
 
-### [Cora Citation Graph](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz):  Node Classification
+#### [Cora Citation Graph](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz):  Node Classification
 
 In the Cora citation graph, each node represents a paper pertaining to one of several topic categories.  Links represent citations between papers.  The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.
 
 - [cora-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model for transductive inference on validation and test set of nodes in graph.
 
 
-### [Hateful Twitter Users](https://www.kaggle.com/manoelribeiro/hateful-users-on-twitter):  Node Classification
+#### [Hateful Twitter Users](https://www.kaggle.com/manoelribeiro/hateful-users-on-twitter):  Node Classification
 Dataset of Twitter users and their attributes.  A small portion of the user accounts are annotated as `hateful` or `normal`.  The goal is to predict hateful accounts based on user features and graph structure.
 
 - [hateful_twitter_users-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model to predict hateful Twitter users using transductive inference.