Merge branch 'develop'

amaiya · Jun 25, 2020 · 2cdc795 · 2cdc795
2 parents 3228f33 + b0a41b6
commit 2cdc795
Show file tree

Hide file tree

Showing 18 changed files with 639 additions and 45 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,26 @@ Most recent releases are shown at the top. Each release shows:
 - **Fixed**: Bug fixes that don't change documented behaviour
 
 
+## 0.17.0 (2020-06-24)
+
+### New:
+- support for language translation using pretraiend `MarianMT` models
+- added `core.evaluate` as alias to `core.validate`
+- `Learner.estimate_lr` method will return numerical estimates of learning rate using two different methods.
+   Should only be called **after** running `Learner.lr_find`.
+
+### Changed
+- `text.zsl.ZeroShotClassifier` changed to use `AutoModel*` and `AutoTokenizer` in order to load any `mlni` model
+- remove external modules from `ktrain.__init__.py` so that they do not appear when pressing TAB in notebook
+- added `Transformer.save_tokenizer` and `Transformer.get_tokenizer` methods to facilitate training on machines
+  with no internet
+
+### Fixed:
+- explicitly call `plt.show()` in `LRFinder.plot_loss` to resolved issues with plot not displaying in certain cases (PR #170)
+- suppress warning about text regression when making text regression predictions
+- allow `xnli` models for `zsl` module
+
+
 ## 0.16.3 (2020-06-10)
 
 ### New:

diff --git a/FAQ.md b/FAQ.md
@@ -4,6 +4,8 @@
 
 - [How do I obtain the word or sentence embeddings after fine-tuning a Transformer-based text classifier?](#how-do-i-obtain-the-word-or-sentence-embeddings-after-fine-tuning-a-transformer-based-text-classifier)
 
+- [How do I use ktrain without an internet connection?](#how-do-i-use-ktrain-without-an-internet-connection)
+
 - [How do I train using multiple GPUs?](#how-do-i-train-using-multiple-gpus)
 
 - [How do I train a model using mixed precision?](#how-do-i-train-a-model-using-mixed-precision)
@@ -24,6 +26,9 @@
 
 - [Running `predictor.explain` for text classification is slow.  How can I speed it up?](#running-predictorexplain-for-text-classification-is-slow--how-can-i-speed-it-up)
 
+- [Why does `texts_from_csv` throw an error on Google Cloud Storage?](#why-does-texts_from_csv-throw-an-error-on-google-cloud-storage)
+
+
 - [What kinds of applications have been built with *ktrain*?](#what-kinds-of-applications-have-been-built-with-ktrain)
 
 
@@ -110,6 +115,39 @@ See also [this post](https://github.com/huggingface/transformers/issues/1950) on
 [[Back to Top](#frequently-asked-questions-about-ktrain)]
 
 
+### How do I use ktrain without an internet connection?
+
+When using pretrained models or pretrained word embeddings in *ktrain*, files are automatically downloaded.  For instance,
+pretrained models and tokenizers from the `transformers` library are downloaded to `<home_directory>/.cache/torch/transformers`
+by default.  Other data like pretrained word vectors are downloaded to the `<home_directory>/ktrain_data` folder.
+
+In some settings, it is necessary to either train models or make predictions in environments with no internet 
+access (e.g., behind a firewall, air-gapped networks).  Typically, it is sufficient to copy the above folders
+to the machine without internet access. 
+
+However, due to a current bug in the `transformers` library, files from `<home_directory>/.cache/torch/transformers` are
+not loaded when there is no internet access.  To get around this, you can download the model files from [here]( https://huggingface.co/models) and point
+*ktrain* to the folder.  There are typically three files you need, and it is important that the downloaded files are rennamed 
+to `tf_model.h5`, `config.json`, and `vocab.txt`.
+
+Here is an example of how to run `SimpleQA` for [open-domain question-answering](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb) without internet access:
+
+1. On a machine with public internet access, go to the Hugging Face model repository: [https://huggingface.co/models](https://huggingface.co/models)
+2. Select the model you want and click "List all files in model".  For `SimpleQA`, you will need `bert-large-uncased-whole-word-masking-finetuned-squad` and `bert-base-uncased`
+3. Download the `tf_model.h5`, `config.json`, and `vocab.txt` files into a folder.  It is important that these downloaded files are renamed specifically to the three aforementioned file names.
+4. Copy these folders to the machine without public internet access
+5. When invoking `SimpleQA`, provide these folders containing the downloaded files as arguments to the `bert_squad_model` and `bert_emb_model` parameters:
+```python
+qa = text.SimpleQA(INDEXDIR,
+                    bert_squad_model='/path/to/bert/squad/model/folder',
+                    bert_emb_model='/path/to/bert-base-uncased/folder')
+```
+
+You can use simlar steps for other models that use the `transformers` library like text classification using the `ktrain.text.Transformer` class, for example.
+
+
+
+[[Back to Top](#frequently-asked-questions-about-ktrain)]
 
 ### How do I train using multiple GPUs?
 
@@ -212,7 +250,10 @@ In this toy example, we are supplying the text data to classify in the URL as a
 [[Back to Top](#frequently-asked-questions-about-ktrain)]
 
 ### How do I use custom metrics with *ktrain*?
-You can use custom callbacks:
+
+The `Transformer.get_classifier`, `text.text_classifier`, and `vision.image_classifier` methods/functions all accept a `metrics` argument.
+
+You can also use custom Keras callbacks:
 
 ```python
 # define a custom callback for ROC-AUC
@@ -287,7 +328,7 @@ See [this tutorial](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/maste
 
 ### Why am I seeing an ERROR when installing *ktrain* on Google Colab?
 
-These errors (e.g., `has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible`) are related to TensorFlow and can be usually be safely ignored and shouldn't affect operation of *ktrain*.
+These errors (e.g., `has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible`) are related to TensorFlow and can be usually safely ignored and shouldn't affect operation of *ktrain*.
 
 [[Back to Top](#frequently-asked-questions-about-ktrain)]
 
@@ -300,6 +341,20 @@ If you pass `n_samples=500` to `explain`, results are returned in ~5 seconds on
 smaller sample sizes (e.g., 500, 1000) may be sufficient for your use case.
 
 
+[[Back to Top](#frequently-asked-questions-about-ktrain)]
+
+
+### Why does `texts_from_csv` throw an error on Google Cloud Storage?
+
+The error is probably happening because *ktrain* tries to auto-detect the character encoding using `open(train_filepath, 'rb')` which may be problematic with Google Cloud Storage. 
+One solution is to explicitly provide the `encoding` to `texts_from_csv` as an argument so this step is skipped (default is *None*, which activates auto-detect).
+
+Alternatively, you can read the data in yourself as a *pandas* DataFrame using one of [these methods](https://stackoverflow.com/a/50201179/13550699). For instance, *pandas* evidently supports GCS, so you can simply do this: `df = pd.read_csv('gs://bucket/your_path.csv')
+`
+
+Then, using *ktrain*, you can use `ktrain.text.texts_from_df` (or `ktrain.text.texts_from_array`) to load and preprocess your data.
+
+
 [[Back to Top](#frequently-asked-questions-about-ktrain)]
 
 ### What kinds of applications have been built with *ktrain*?
@@ -309,6 +364,7 @@ Examples include:
 - **medical informatics:**  analyzing doctors' written analyses of patients and medical imagery
 - **finance:**  financial crime analytics, mining stock-related news stories
 - **insurance:** detecting fraud in insurance claims
+- **customer relationship management (CRM):** making sense of feedback from customers and/or patients
 - **social science:** making sense of text-based responses in surveys and emotion-classification from text data
 - **linguistics:** detecting sarcasm in the news
 - **education:** analysis of attitudes towards educational institutions in social media

diff --git a/README.md b/README.md
@@ -7,6 +7,20 @@
 
 
 ### News and Announcements
+- **2020-06-26:**  
+  - ***ktrain*** **v0.17.x is released** and includes support for **language translation**. See the [example language translation notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/language_translation_example.ipynb) for more information.  <sub><sup>(This feature currently requires that PyTorch be installed.)</sup></sub>
+```python
+# Example: Translating Chinese to German
+
+# NOTE: Language Translation uses PyTorch instead of TensorFlow
+from ktrain import text 
+translator = text.Translator(model_name='Helsinki-NLP/opus-mt-ZH-de')
+src_text = '''大流行对世界经济造成了严重破坏。但是，截至2020年6月，美国股票市场持续上涨。'''
+print(translator.translate(src_text))
+# output:
+# Die Pandemie hat eine ernste Zerstörung der Weltwirtschaft verursacht.
+# Aber bis Juni 2020 stieg der US-Markt weiter an.
+```
 - **2020-06-03:**  
   - ***ktrain*** **v0.16.x is released** and includes support for **Zero-Shot Learning**, where documents can be classified into user-provided topics **without** any training examples. See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/zero_shot_learning_with_nli.ipynb).  <sub><sup>(This feature currently requires that PyTorch be installed.)</sup></sub>
 ```python
@@ -24,19 +38,11 @@ zsl.predict(doc, topic_strings=topic_strings, include_labels=True)
 #  ('films', 0.0008969294722191989),
 #  ('television', 0.00045271270209923387)]
 ```
-- **2020-05-13:**  
-  - ***ktrain*** **v0.15.x is released** and includes support for:
-    - **image regression**:  See the [example notebook on age prediction from photos](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/vision/utk_faces_age_prediction-resnet50.ipynb).
-    - **sentence pair classification**:  See this [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/MRPC-BERT.ipynb) on using BERT for paraphrase detection.<sub><sup>(Sentence pair classification included in v0.15.1, but not v0.15.0.)</sup></sub>
-    - **tf.data.Datasets**:  See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/vision/mnist-tf_workflow.ipynb) on using `tf.data.Datasets` in *ktrain* for custom models and data formats.
-- **2020-04-15:**  
-  - ***ktrain*** **v0.14.x is released** and now includes support for **open-domain question-answering**.  See the [example QA notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)
-
 ----
 
 ### Overview
 
-*ktrain* is a lightweight wrapper for the deep learning library [TensorFlow Keras](https://www.tensorflow.org/guide/keras/overview) (and other libraries) to help build, train, and deploy neural networks and other machine learning models.  It is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows you to easily and quickly:
+*ktrain* is a lightweight wrapper for the deep learning library [TensorFlow Keras](https://www.tensorflow.org/guide/keras/overview) (and other libraries) to help build, train, and deploy neural networks and other machine learning models.  Inspired by ML framework extensions like *fastai* and *ludwig*, it is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows you to easily and quickly:
 
 - employ fast, accurate, and easy-to-use pre-canned models for  `text`, `vision`, and `graph` data:
   - `text` data:
@@ -51,6 +57,7 @@ zsl.predict(doc, topic_strings=topic_strings, include_labels=True)
      - **Text Summarization**:  summarize long documents with a pretrained BART model - no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_summarization_with_bart.ipynb)]</sup></sub>
      - **Open-Domain Question-Answering**:  ask a large text corpus questions and receive exact answers <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
      - **Zero-Shot Learning**:  classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
+     - **Language Translation**:  translate text from one language to another <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/language_translation_example.ipynb)]</sup></sub>
   - `vision` data:
     - **image classification** (e.g., [ResNet](https://arxiv.org/abs/1512.03385), [Wide ResNet](https://arxiv.org/abs/1605.07146), [Inception](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf)) <sub><sup>[[example notebook](https://colab.research.google.com/drive/1WipQJUPL7zqyvLT10yekxf_HNMXDDtyR)]</sup></sub>
     - **image regression** for predicting numerical targets from photos (e.g., age prediction) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/vision/utk_faces_age_prediction-resnet50.ipynb)]</sup></sub>

diff --git a/examples/README.md b/examples/README.md
@@ -13,6 +13,7 @@ This directory contains various example notebooks using *ktrain*.  The directory
   - [Text Summarization](#bart):  an example of text summarization using a pretrained BART model
   - [Open-Domain Question-Answering](#textqa):  ask questions to a large text corpus and receive exact candidate answers
   - [Zero-Shot Learning](#zsl):  classify documents by user-supplied topics **without** any training examples
+  - [Language Translation](#translation): an example of language translation using pretrained MarianMT models
 - `vision`:  
   - [image classification](#imageclass):  models for image datasetsimage classification examples using various models and datasets
   - [image regression](#imageregression): example of predicting numerical values purely from images/photos
@@ -135,6 +136,7 @@ The objective of the CoNLL2003 task is to classify sequences of words as belongi
 ### <a name="bart"></a>Text Summarization with pretrained BART: [text_summarization_with_bart.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 ### <a name="textqa"></a>Open-Domain Question-Answering: [question_answering_with_bert.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 ### <a name="zsl"></a>Zero-Shot Learning: [zero_shot_learning_with_nli.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
+### <a name="translation"></a>Language Translation: [language_translation_example.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 
 
 ## Vision Data