Merge branch 'develop'

amaiya · Aug 16, 2020 · 6d2145e · 6d2145e
2 parents dba89d8 + 7e7e2ca
commit 6d2145e
Show file tree

Hide file tree

Showing 14 changed files with 138 additions and 28 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,23 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+## 0.19.7 (2020-08-16)
+
+### New:
+- N/A
+
+### Changed
+- added `class_weight` parameter to `lr_find` for imbalanced datasets
+- removed pins for `cchardet` and `scikitlearn` from `setup.py`
+- added version check for `eli5` fork
+- removed `scipy` pin from `setup.py`
+- Allow TensorFlow 2.3 for Python 3.8
+- Request  manual installation of `shap` in `TabularPredictor.explain` instead of inclusion in `setup.py`
+
+### Fixed:
+- N/A
+
+
 ## 0.19.6 (2020-08-12)
 
 ### New:

diff --git a/FAQ.md b/FAQ.md
@@ -7,6 +7,8 @@
 - [What kinds of applications have been built with *ktrain*?](#what-kinds-of-applications-have-been-built-with-ktrain)
 
 ## Installation/Deployment Issues
+- [How do I install ktrain on a Windows machine?](#how-do-i-install-ktrain-on-a-windows-machine)
+
 - [How do I use ktrain without an internet connection?](#how-do-i-use-ktrain-without-an-internet-connection)
 
 - [Why am I seeing an ERROR when installing *ktrain* on Google Colab?](#why-am-i-seeing-an-error-when-installing-ktrain-on-google-colab)
@@ -84,6 +86,8 @@ Here is how you can quickly get started using *ktrain*:
 4. Make sure the notebook is setup to use a GPU: `Runtime --> Change runtime type` and select `GPU` in the menu.
 5. Click on each cell in the notebook and execute it by pressing `SHIFT` and `ENTER` at the same time. The notebook shows you how to build a neural network that recoginizes cats vs. dogs in photos.
 
+If you're on a Windows laptop, you can also try out *ktrain* locally (insead of using Google Colab) by [following these instructions](#how-do-i-install-ktrain-on-a-windows-machine)
+
 Next, you can go through [the tutorials](https://github.com/amaiya/ktrain#tutorials) to learn more.  If you have questions about a method or function, 
 type a question mark before the method and press ENTER in a Google Colab or Jupyter notebook to learn more.  Example: `?learner.autofit`.
 
@@ -202,6 +206,65 @@ See also [this post](https://github.com/huggingface/transformers/issues/1950) on
 Note that, once a `transformers` model is trained and saved (e.g., using `predictor.save` or `learner.save_model` or `learner.model.save_pretrained`), it 
 can be reloaded into other libraries that support `transformers` (e.g., `sentence-transformers`).
 
+[[Back to Top](#frequently-asked-questions-about-ktrain)]
+
+
+
+### How do I install ktrain on a Windows machine?
+
+Here are detailed instructions for getting started with *ktrain* and TensorFlow on a Windows 10 computer.
+
+#### Installation on Windows
+
+1. Download and Install the [Miniconda Python distribution](https://docs.conda.io/en/latest/miniconda.html).  You will most likely want the **Python 3.8 Miniconda3 Windows 64-bit**.
+2. Download and Install the [Microsft Visual C++ Redistributable](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads)
+3. Click on **Anaconda Powershell Prompt** in the Start Menu.
+4. Create a conda environment for *ktrain*: `conda create -n kt python=3.7; conda activate kt`
+5. Type: `pip install -U pip setuptools_scm jupyter` (run twice if error or use `--user` option)
+6. Type: `pip install ktrain`
+
+If your machine has a GPU (which is needed for larger models), you'll need to perform [GPU setup for TensorFlow](https://www.tensorflow.org/install/gpu).
+
+#### Resolving Problems
+- If you experience a **Kernel Error** when running `jupyter notebook`, follow the [instructions here](https://stackoverflow.com/a/60611014)
+  and copy the two files in `C:\Users\<your_user_name>\Miniconda3\envs\kt\Lib\site-packages\pywin32_system32` to `C:\Windows\System32`.
+- If you experience SSL certificate problems with either `pip` or `conda`, run `conda config --set ssl_verify false` and 
+replace all `pip` comands above with `pip --trusted-host pypi.org --trusted-host files.pythonhosted.org`.
+- We have selected Python 3.7 in STEP 4 above with `python=3.7`, but Python 3.8 is default if removed. We recommend using Python 3.6 or Python 3.7 for time being due
+  to yet-to-be-resolved bugs in the current version of TensorFlow. 
+
+
+#### Running an Example
+Once installed, you can fire up Jupyter notebook (type:`jupyter notebook` at command prompt) and test out *ktrain* with something like this:
+```python
+# download Cats vs. Dogs image classification dataset
+!curl -k --output C:/temp/cats_and_dogs_filtered.zip --url https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip 
+import os
+import zipfile
+local_zip = 'C:/temp/cats_and_dogs_filtered.zip'
+zip_ref = zipfile.ZipFile(local_zip, 'r')
+zip_ref.extractall('C:/temp')
+zip_ref.close()
+
+# train model
+import ktrain
+from ktrain import vision as vis
+(trn, val, preproc) = vis.images_from_folder(
+                                              datadir='C:/temp/cats_and_dogs_filtered',
+                                              data_aug = vis.get_data_aug(horizontal_flip=True),
+                                              train_test_names=['train', 'validation'])
+learner = ktrain.get_learner(model=vis.image_classifier('pretrained_mobilenet', trn, val, freeze_layers=15), 
+                             train_data=trn, val_data=val, workers=4, batch_size=64)
+learner.fit_onecycle(1e-4, 1)
+
+# make prediction
+predictor = ktrain.get_predictor(learner.model, preproc)
+predictor.predict_filename('C:/temp/cats_and_dogs_filtered/validation/cats/cat.2000.jpg')
+```
+
+
+
+
 [[Back to Top](#frequently-asked-questions-about-ktrain)]
 
 
@@ -499,7 +562,7 @@ if using `focal_loss` with a `transformers` model like DistilBert.
 *ktrain* is just a lightweight wrapper around `tf.keras`, so this would be done in the exact same way as you would in Keras.
 More specifically, you can simply recompile your model with the loss function or optimizer you want by invoking `model.compile`.
 
-For example, here is how to use **focal loss** with a DistilBert model:
+For example, here is how to use [focal loss](https://www.dlology.com/blog/multi-class-classification-with-focal-loss-for-imbalanced-datasets/) with a DistilBert model:
 
 ```python
 import tensorflow as tf

diff --git a/README.md b/README.md
@@ -61,8 +61,8 @@ zsl.predict(doc, topic_strings=topic_strings, include_labels=True)
      - **Ready-to-Use NER models for English, Chinese, and Russian** with no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/shallownlp-examples.ipynb)]</sup></sub>
      - **Sentence Pair Classification**  for tasks like paraphrase detection <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/MRPC-BERT.ipynb)]</sup></sub>
      - **Unsupervised Topic Modeling** with [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-topic_modeling.ipynb)]</sup></sub>
-     - **Document Similarity with One-Class Learning**:  given some documents of interest, find and score new documents that are semantically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
-     - **Document Recommendation Engine**:  given text from a sample document, recommend documents that are thematically-related to it from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
+     - **Document Similarity with One-Class Learning**:  given some documents of interest, find and score new documents that are thematically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
+     - **Document Recommendation Engines and Semantic Searches**:  given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
      - **Text Summarization**:  summarize long documents with a pretrained BART model - no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_summarization_with_bart.ipynb)]</sup></sub>
      - **Open-Domain Question-Answering**:  ask a large text corpus questions and receive exact answers <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
      - **Zero-Shot Learning**:  classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
@@ -318,15 +318,17 @@ Using *ktrain* on **Google Colab**?  See these Colab examples:
 
 ### Installation
 
-1. Make sure pip is up-to-date with: `pip3 install -U pip`
+1. Make sure pip is up-to-date with: `pip install -U pip`
 
-2. Install *ktrain*: `pip3 install ktrain`
+2. Install *ktrain*: `pip install ktrain`
+
+The above should be all you need on Linux systems and cloud computing environments like Google Colab and AWS EC2.  If you are using *ktrain* on a **Windows computer**, you can follow these 
+[more detailed instructions](https://github.com/amaiya/ktrain/blob/master/FAQ.md#how-do-i-install-ktrain-on-a-windows-machine) that include some extra steps.
 
 **Some important things to note about installation:**
 - If using *ktrain* on a local machine with a GPU (versus Google Colab, for example), you'll need to [install GPU support for TensorFlow 2](https://www.tensorflow.org/install/gpu).
-- *ktrain* currently uses [TensorFlow 2.1.0 or 2.2.0](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*. 
-TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems.  TensorFlow 2.2.0 will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
- On systems where Python 3.8 is the default (e.g., Ubuntu 20.04), we recommend installing and using Python 3.6/3.7 and TensorFlow 2.1.0 with *ktrain* due to unresolved bugs in versions of TensorFlow >= 2.2.0.
+- *ktrain* currently uses [TensorFlow 2](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*. 
+TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems (due to some TensorFlow bugs that will not be fixed by Google until TensorFlow 2.4).  TensorFlow `>=2.2.0` will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
 - Since some *ktrain* dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues), 
   *ktrain* is temporarily using forked versions of some libraries. Specifically, *ktrain* uses forked versions of the `eli5` and `stellargraph` libraries.  If not installed, *ktrain* will complain  when a method or function needing 
   either of these libraries is invoked.

diff --git a/examples/tabular/tabular_classification_and_regression_example.ipynb b/examples/tabular/tabular_classification_and_regression_example.ipynb
@@ -852,7 +852,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The plot above is generated using the [shap](https://github.com/slundberg/shap) library.  The features in red are causing our model to increase the prediction for the **Survived** class, while features in blue cause our model to *decrease* the prediction for **Survived** (or *increase* the prediction for **Not_Survived**).  \n",
+    "The plot above is generated using the [shap](https://github.com/slundberg/shap) library. You can install it with either `pip install shap` or, for *conda* users, `conda install -c conda-forge shap`.  The features in red are causing our model to increase the prediction for the **Survived** class, while features in blue cause our model to *decrease* the prediction for **Survived** (or *increase* the prediction for **Not_Survived**).  \n",
     "\n",
     "From the plot, we see that the predicted softmax probability for `Survived` is **50%**, which is a comparatively much less confident classification than other classifications. Why is this?\n",
     "\n",

diff --git a/ktrain/core.py b/ktrain/core.py
@@ -474,7 +474,7 @@ def reset_weights(self, verbose=1):
 
 
 
-    def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None, 
+    def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None, class_weight=None,
                 stop_factor=4, show_plot=False, suggest=False, restore_weights_only=False, verbose=1):
         """
         Plots loss as learning rate is increased.  Highest learning rate 
@@ -500,6 +500,8 @@ def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None,
                                Default is None. Set max_epochs to an integer
                                (e.g., 5) if lr_find is taking too long
                                and running for more epochs than desired.
+            class_weight(dict): class_weight parameter passed to model.fit
+                                for imbalanced datasets.
             stop_factor(int): factor used to determine threhsold that loss 
                               must exceed to stop training simulation.
                               Increase this if loss is erratic and lr_find
@@ -554,6 +556,7 @@ def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None,
                                 use_gen=use_gen,
                                 start_lr=start_lr, lr_mult=lr_mult, 
                                 max_epochs=max_epochs,
+                                class_weight=class_weight,
                                 workers=self.workers, 
                                 use_multiprocessing=self.use_multiprocessing, 
                                 batch_size=self.batch_size,

diff --git a/ktrain/imports.py b/ktrain/imports.py
@@ -245,4 +245,5 @@
                    'pip3 install allennlp'
 
 
-
+# ELI5
+KTRAIN_ELI5_TAG = '0.10.1-1'
diff --git a/ktrain/lroptimize/lrfinder.py b/ktrain/lroptimize/lrfinder.py
@@ -58,7 +58,7 @@ def on_batch_end(self, batch, logs):
             return
 
 
-    def find(self, train_data, steps_per_epoch, use_gen=False,
+    def find(self, train_data, steps_per_epoch, use_gen=False, class_weight=None,
              start_lr=1e-7, lr_mult=1.01, max_epochs=None, 
              batch_size=U.DEFAULT_BS, workers=1, use_multiprocessing=False, verbose=1):
         """
@@ -113,13 +113,14 @@ def find(self, train_data, steps_per_epoch, use_gen=False,
             # *_generator methods are deprecated from TF 2.1.0
             fit_fn = self.model.fit
             fit_fn(train_data, steps_per_epoch=steps_per_epoch, 
-                   epochs=epochs, 
+                   epochs=epochs, class_weight=class_weight,
                    workers=workers, use_multiprocessing=use_multiprocessing,
                    verbose=verbose,
                    callbacks=[callback])
         else:
             self.model.fit(train_data[0], train_data[1],
-                            batch_size=batch_size, epochs=epochs, verbose=verbose,
+                            batch_size=batch_size, epochs=epochs, class_weight=class_weight, 
+                            verbose=verbose,
                             callbacks=[callback])
 
 

diff --git a/ktrain/tabular/predictor.py b/ktrain/tabular/predictor.py
@@ -73,7 +73,13 @@ def explain(self, test_df, row_index=None, row_num=None, class_id=None, backgrou
           background_size(int): size of background data (SHAP parameter)
           nsamples(int): number of samples (SHAP parameter)
         """
-        import shap
+        try:
+            import shap
+        except ImportError:
+            msg = 'TabularPredictor.explain requires shap library. Please install with: pip install shap. '+\
+                    'Conda users should use this command instead: conda install -c conda-forge shap'
+            warnings.warn(msg)
+            return
 
         classification, multilabel = U.is_classifier(self.model)
         if classification and class_id is None:

diff --git a/ktrain/text/eda.py b/ktrain/text/eda.py
@@ -515,6 +515,12 @@ def train_scorer(self, topic_ids=[], doc_ids=[], n_neighbors=20):
         Trains a scorer that can score documents based on similarity to a
         seed set of documents represented by topic_ids and doc_ids.
 
+        NOTE: The score method currently employs the use of LocalOutLierFactor, which
+        means you should not try to score documents that were used in training. Only
+        new, unseen documents should be scored for similarity. 
+        REFERENCE: 
+        https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor
+
         Args:
             topic_ids(list of ints): list of topid IDs where each id is in the range
                                      of range(self.n_topics).  Documents associated
@@ -547,6 +553,10 @@ def score(self, texts=None, doc_topics=None):
         classifiers are more strict than traditional binary classifiers.
         Documents with negative scores closer to zero are good candidates for
         inclusion in a training set for binary classification (e.g., active labeling).
+
+        NOTE: The score method currently employs the use of LocalOutLierFactor, which
+        means you should not try to score documents that were used in training. Only
+        new, unseen documents should be scored for similarity.
  
         Args:
             texts(list of str): list of document texts.  Mutually-exclusive with <doc_topics>

diff --git a/ktrain/text/predictor.py b/ktrain/text/predictor.py
@@ -111,6 +111,12 @@ def explain(self, doc, truncate_len=512, all_targets=False, n_samples=2500):
                   'Install with: pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1'
             warnings.warn(msg)
             return
+        if not hasattr(eli5, 'KTRAIN_ELI5_TAG') or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
+            msg = 'ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. '+\
+                  'Uninstall the current version and install/re-install the fork with: pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1'
+            warnings.warn(msg)
+            return
+
 
         prediction = [self.predict(doc)] if not all_targets else None
 

diff --git a/ktrain/version.py b/ktrain/version.py
@@ -1,2 +1,2 @@
 __all__ = ['__version__']
-__version__ = '0.19.6'
+__version__ = '0.19.7'
diff --git a/ktrain/vision/predictor.py b/ktrain/vision/predictor.py
@@ -45,9 +45,10 @@ def explain(self, img_fpath):
             warnings.warn(msg)
             return
 
-        if not hasattr(eli5, 'KTRAIN'):
+        #if not hasattr(eli5, 'KTRAIN'):
+        if not hasattr(eli5, 'KTRAIN_ELI5_TAG') or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
             warnings.warn("Since eli5 does not yet support tf.keras, ktrain uses a forked version of eli5.  " +\
-                           "We do not detect this forked version, so predictor.explain will not work.  " +\
+                           "We do not detect this forked version (or it is out-of-date), so predictor.explain may not work.  " +\
                            "It will work if you uninstall the current version of eli5 and install "+\
                            "the forked version:  " +\
                            "pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1")