Merge branch 'develop'

amaiya · May 14, 2020 · 0130a22 · 0130a22
2 parents 7fffa14 + 02c3455
commit 0130a22
Show file tree

Hide file tree

Showing 9 changed files with 522 additions and 22 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,17 @@ Most recent releases are shown at the top. Each release shows:
 - **Fixed**: Bug fixes that don't change documented behaviour
 
 
+## 0.15.1 (2020-05-14)
+
+### New:
+- N/A
+
+### Changed
+- Changed `Transformer.preprocess*` methods to accept sentence pairs for sentence pair classification
+
+### Fixed:
+- N/A
+
 ## 0.15.0 (2020-05-13)
 
 ### New:

diff --git a/README.md b/README.md
@@ -10,7 +10,8 @@
 - **2020-05-13:**  
   - ***ktrain*** **v0.15.x is released** and includes support for:
     - **image regression**:  See the [example notebook on age prediction from photos](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/vision/utk_faces_age_prediction-resnet50.ipynb).
-    - **`tf.data.Datasets`**:  See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/vision/mnist-tf_workflow.ipynb) on using `tf.data.Datasets` in *ktrain* for custom models and data formats.
+    - **tf.data.Datasets**:  See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/vision/mnist-tf_workflow.ipynb) on using `tf.data.Datasets` in *ktrain* for custom models and data formats.
+    - **sentence pair classification**:  See this [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/MRPC-BERT.ipynb) on using BERT for paraphrase detection.<sub><sup>(Sentence pair classification included in v0.15.0, but not v0.15.0.)</sup></sub>
 - **2020-04-15:**  
   - ***ktrain*** **v0.14.x is released** and now includes support for **open-domain question-answering**.  See the [example QA notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)
 - **2020-04-09:**  
@@ -36,6 +37,7 @@ ts.summarize(some_long_document)
      - **Text Regression**: [BERT](https://arxiv.org/abs/1810.04805), [DistilBERT](https://arxiv.org/abs/1910.01108), Embedding-based linear text regression, [fastText](https://arxiv.org/abs/1607.01759), and other models <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_regression_example.ipynb)]</sup></sub>
      - **Sequence Labeling (NER)**:  Bidirectional LSTM with optional [CRF layer](https://arxiv.org/abs/1603.01360) and various embedding schemes such as pretrained [BERT](https://huggingface.co/transformers/pretrained_models.html) and [fasttext](https://fasttext.cc/docs/en/crawl-vectors.html) word embeddings and character embeddings <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/CoNLL2002_Dutch-BiLSTM.ipynb)]</sup></sub>
      - **Ready-to-Use NER models for English, Chinese, and Russian** with no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/shallownlp-examples.ipynb)]</sup></sub>
+     - **Sentence Pair Classification**  for tasks like paraphrase detection <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/MRPC-BERT.ipynb)]</sup></sub>
      - **Unsupervised Topic Modeling** with [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-topic_modeling.ipynb)]</sup></sub>
      - **Document Similarity with One-Class Learning**:  given some documents of interest, find and score new documents that are semantically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
      - **Document Recommendation Engine**:  given text from a sample document, recommend documents that are thematically-related to it from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>

diff --git a/examples/README.md b/examples/README.md
@@ -5,6 +5,7 @@ This directory contains various example notebooks using *ktrain*.  The directory
   - [text classification](#textclass): examples using various text classification models and datasets
   - [text regression](#textregression): example for predicting continuous value purely from text
   - [text sequence labeling](#seqlab):  sequence tagging models
+  - [sentence pair classification](#sentpair):  sentence pair classification for tasks such as paraphrase or sarcasm detection
   - [topic modeling](#lda):  unsupervised learning from unlabeled text data
   - [document similarity with one-class learning](#docsim): given a sample of interesting documents, find and score new documents that are semantically similar to it using One-Class text classification
   - [document recommender system](#docrec):  given text from a sample document, recommend documents that are semantically similar to it from a larger corpus 
@@ -93,6 +94,13 @@ The objective of the CoNLL2003 task is to classify sequences of words as belongi
 - [CoNLL2002_Dutch-BiLSTM.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  A Bidirectional LSTM model that uses pretrained BERT embeddings along with pretrained fasttext word embeddings - both for Dutch.
 
 
+### <a name="sentpair"></a> Sentence Pair Classification
+
+#### [Microsoft Research Paraphrase Corpus (MRPC)](https://www.microsoft.com/en-us/download/details.aspx?id=52398):  Paraphrase Detection
+
+- [MRPC-BERT.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  Using BERT for sentence pair classification on MRPC dataset
+
+
 ### <a name="lda"></a> Topic Modeling
 
 #### [20 News Groups](http://qwone.com/~jason/20Newsgroups/): unsupervised learning on 20newsgroups corpus