Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Apr 9, 2020
2 parents 138100a + a1d1ec8 commit 1e03338
Show file tree
Hide file tree
Showing 36 changed files with 1,389 additions and 293 deletions.
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,24 @@ Most recent releases are shown at the top. Each release shows:
- **Fixed**: Bug fixes that don't change documented behaviour


## 0.13.0 (2020-04-08)

### New:
- support for link prediction with graph neural networks
- `bigru` method now selects pretrained word vectors based on detected language

### Changed
- instead of throwing error, default to English if `detect_lang` could not detect language from batch of texts
- `layers` argument moved to `TransformerEmbedding` constructor
- enforce specific version of TensorFlow due to undocumented breaking changes in newer TF versions
- `AdamWeightDecay` optimizer is now used to support global weight decay. Used when user
excplictly sets a weight decay


### Fixed:
- force re-instantiation of `TransformerEmbedding` object with `sequence_tagger` function is re-invoked


## 0.12.3 (2020-04-02)

### New:
Expand Down
26 changes: 7 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@


### News and Announcements
- **2020-04-08:**
- ***ktrain*** **v0.13.x is released** and includes support for **link prediction** using graph neural networks. See [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/graphs/cora_link_prediction-GraphSAGE.ipynb) on citation prediction.
- **2020-03-31:**
- ***ktrain*** **v0.12.x is released** and now includes BERT embeddings (i.e., BERT, DistilBert, and Albert) that can be used for downstream tasks like building sequence-taggers (i.e., NER)
for any language such as English, Chinese, Russian, Arabic, Dutch, etc. See [this English NER example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/CoNLL2003-BiLSTM.ipynb) or the [Dutch NER notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/CoNLL2002_Dutch-BiLSTM.ipynb) for examples on how to use this feature.
Expand All @@ -24,21 +26,6 @@ learner.fit(0.01, 1, cycle_len=5)
```
- **2020-03-18:**
- ***ktrain*** **v0.11.x is released** and includes various fixes and enhancements to sequence-tagging including abilty to easily use non-English pretrained word embeddings covering 157 languages (e.g., [Dutch NER](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/CoNLL2002_Dutch-BiLSTM.ipynb))
- **2020-03-03:**
- ***ktrain*** **v0.10.x is released** and now includes [ready-to-use NER for English, Chinese, and Russian](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/shallownlp-examples.ipynb) with no training required.
- **Also in v0.10.x:** Ability to train [community-uploaded Hugging Face transformer models](https://huggingface.co/models) like [SciBERT](https://arxiv.org/abs/1903.10676) and [BioBERT](https://arxiv.org/abs/1901.08746):
```python
# text classification with SciBERT
import ktrain
from ktrain import text
MODEL_NAME = 'allenai/scibert_scivocab_uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=label_list)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(3e-5, 1)
```
----

### Overview
Expand All @@ -52,16 +39,17 @@ learner.fit_onecycle(3e-5, 1)
- **Text Classification**: [BERT](https://arxiv.org/abs/1810.04805), [DistilBERT](https://arxiv.org/abs/1910.01108), [NBSVM](https://www.aclweb.org/anthology/P12-2018), [fastText](https://arxiv.org/abs/1607.01759), and other models <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/IMDb-BERT.ipynb)]</sup></sub>
- **Text Regression**: [BERT](https://arxiv.org/abs/1810.04805), [DistilBERT](https://arxiv.org/abs/1910.01108), Embedding-based linear text regression, [fastText](https://arxiv.org/abs/1607.01759), and other models <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_regression_example.ipynb)]</sup></sub>
- **Sequence Labeling (NER)**: Bidirectional LSTM with optional [CRF layer](https://arxiv.org/abs/1603.01360) and various embedding schemes such as pretrained [BERT](https://huggingface.co/transformers/pretrained_models.html) and [fasttext](https://fasttext.cc/docs/en/crawl-vectors.html) word embeddings and character embeddings <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/CoNLL2002_Dutch-BiLSTM.ipynb)]</sup></sub>
- **Ready-to-Use NER models for English, Chinese, and Russian** with no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/shallownlp-examples.ipynb)]</sup></sub>
- **Unsupervised Topic Modeling** with [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-topic_modeling.ipynb)]</sup></sub>
- **Document Similarity with One-Class Learning**: given some documents of interest, find and score new documents that are semantically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
- **Document Recommendation Engine**: given text from a sample document, recommend documents that are semantically-related to it from a larger corpus <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
- `vision` data:
- **image classification** (e.g., [ResNet](https://arxiv.org/abs/1512.03385), [Wide ResNet](https://arxiv.org/abs/1605.07146), [Inception](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/vision/dogs_vs_cats-ResNet50.ipynb)]</sup></sub>
- `graph` data:
- **graph node classification** with graph neural networks (e.g., [GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/graphs/pubmed-GraphSAGE.ipynb)]</sup></sub>
- perform multilingual text classification (e.g., [Chinese Sentiment Analysis with BERT](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/ChineseHotelReviews-BERT.ipynb), [Arabic Sentiment Analysis with NBSVM](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-nbsvm.ipynb))
- Easily train NER models for any language (e.g., [Dutch NER](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/CoNLL2002_Dutch-BiLSTM.ipynb) )
- [Ready-to-Use NER for English, Chinese, and Russian](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/shallownlp-examples.ipynb) (no training required)
- **node classification** with graph neural networks ([GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/graphs/pubmed_node_classification-GraphSAGE.ipynb)]</sup></sub>
- **link prediction** with graph neural networks ([GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/graphs/cora_link_prediction-GraphSAGE.ipynb)]</sup></sub>
- build text classifiers for any language (e.g., [Chinese Sentiment Analysis with BERT](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/ChineseHotelReviews-BERT.ipynb), [Arabic Sentiment Analysis with NBSVM](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-nbsvm.ipynb))
- easily train NER models for any language (e.g., [Dutch NER](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/CoNLL2002_Dutch-BiLSTM.ipynb) )
- load and preprocess text and image data from a variety of formats
- inspect data points that were misclassified and [provide explanations](https://eli5.readthedocs.io/en/latest/) to help improve your model
- leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data
Expand Down
17 changes: 15 additions & 2 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ This directory contains various example notebooks using *ktrain*. The directory
- [image classification](#imageclass): models for image datasetsimage classification examples using various models and datasets
- `graphs`:
- [node classification](#-graph-node-classification-datasets): node classification in graphs or networks
- [link prediction](#-graph-link-prediction-datasets): link prediction in graphs or networks


## Text Data
Expand Down Expand Up @@ -152,17 +153,29 @@ Image labels are in the form of a CSV containing paths to images.

In the PubMed graph, each node represents a paper pertaining to one of three topics: *Diabetes Mellitus - Experimental*, *Diabetes Mellitus - Type 1*, and *Diabetes Mellitus - Type 2*. Links represent citations between papers. The attributes or features assigned to each node are in the form of a vector of words in each paper and their corresponding TF-IDF scores.

- [pubmed-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model for transductive and inductive inference.
- [pubmed_node_classification-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model for transductive and inductive inference.

#### [Cora Citation Graph](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz): Node Classification

In the Cora citation graph, each node represents a paper pertaining to one of several topic categories. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.

- [cora-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model for transductive inference on validation and test set of nodes in graph.
- [cora_node_classification-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model for transductive inference on validation and test set of nodes in graph.


#### [Hateful Twitter Users](https://www.kaggle.com/manoelribeiro/hateful-users-on-twitter): Node Classification
Dataset of Twitter users and their attributes. A small portion of the user accounts are annotated as `hateful` or `normal`. The goal is to predict hateful accounts based on user features and graph structure.

- [hateful_twitter_users-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model to predict hateful Twitter users using transductive inference.


### <a name="#linkpred"></a> Graph Link Prediction Datasets


#### [Cora Citation Graph](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz): Node Classification

In the Cora citation graph, each node represents a paper. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.

- [cora_link_prediction-GraphSAGE.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/graphs): GraphSAGE model to predict missing links in the citation network.



430 changes: 430 additions & 0 deletions examples/graphs/cora_link_prediction-GraphSAGE.ipynb

Large diffs are not rendered by default.

File renamed without changes.
File renamed without changes.
4 changes: 3 additions & 1 deletion ktrain/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from .vision.learner import ImageClassLearner
from .text.learner import BERTTextClassLearner, TransformerTextClassLearner
from .text.ner.learner import NERLearner
from .graph.learner import NodeClassLearner
from .graph.learner import NodeClassLearner, LinkPredLearner
from .data import Dataset

from . import utils as U
Expand Down Expand Up @@ -99,6 +99,8 @@ def get_learner(model, train_data=None, val_data=None,
learner = ImageClassLearner
elif U.is_nodeclass(data=train_data):
learner = NodeClassLearner
elif U.is_nodeclass(data=train_data):
learner = LinkPredLearner
elif U.is_huggingface(data=train_data):
learner = TransformerTextClassLearner
else:
Expand Down

0 comments on commit 1e03338

Please sign in to comment.