Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Aug 16, 2020
2 parents dba89d8 + 7e7e2ca commit 6d2145e
Show file tree
Hide file tree
Showing 14 changed files with 138 additions and 28 deletions.
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,23 @@ Most recent releases are shown at the top. Each release shows:
- **Changed**: Additional parameters, changes to inputs or outputs, etc
- **Fixed**: Bug fixes that don't change documented behaviour

## 0.19.7 (2020-08-16)

### New:
- N/A

### Changed
- added `class_weight` parameter to `lr_find` for imbalanced datasets
- removed pins for `cchardet` and `scikitlearn` from `setup.py`
- added version check for `eli5` fork
- removed `scipy` pin from `setup.py`
- Allow TensorFlow 2.3 for Python 3.8
- Request manual installation of `shap` in `TabularPredictor.explain` instead of inclusion in `setup.py`

### Fixed:
- N/A


## 0.19.6 (2020-08-12)

### New:
Expand Down
65 changes: 64 additions & 1 deletion FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
- [What kinds of applications have been built with *ktrain*?](#what-kinds-of-applications-have-been-built-with-ktrain)

## Installation/Deployment Issues
- [How do I install ktrain on a Windows machine?](#how-do-i-install-ktrain-on-a-windows-machine)

- [How do I use ktrain without an internet connection?](#how-do-i-use-ktrain-without-an-internet-connection)

- [Why am I seeing an ERROR when installing *ktrain* on Google Colab?](#why-am-i-seeing-an-error-when-installing-ktrain-on-google-colab)
Expand Down Expand Up @@ -84,6 +86,8 @@ Here is how you can quickly get started using *ktrain*:
4. Make sure the notebook is setup to use a GPU: `Runtime --> Change runtime type` and select `GPU` in the menu.
5. Click on each cell in the notebook and execute it by pressing `SHIFT` and `ENTER` at the same time. The notebook shows you how to build a neural network that recoginizes cats vs. dogs in photos.

If you're on a Windows laptop, you can also try out *ktrain* locally (insead of using Google Colab) by [following these instructions](#how-do-i-install-ktrain-on-a-windows-machine)

Next, you can go through [the tutorials](https://github.com/amaiya/ktrain#tutorials) to learn more. If you have questions about a method or function,
type a question mark before the method and press ENTER in a Google Colab or Jupyter notebook to learn more. Example: `?learner.autofit`.

Expand Down Expand Up @@ -202,6 +206,65 @@ See also [this post](https://github.com/huggingface/transformers/issues/1950) on
Note that, once a `transformers` model is trained and saved (e.g., using `predictor.save` or `learner.save_model` or `learner.model.save_pretrained`), it
can be reloaded into other libraries that support `transformers` (e.g., `sentence-transformers`).

[[Back to Top](#frequently-asked-questions-about-ktrain)]



### How do I install ktrain on a Windows machine?

Here are detailed instructions for getting started with *ktrain* and TensorFlow on a Windows 10 computer.

#### Installation on Windows

1. Download and Install the [Miniconda Python distribution](https://docs.conda.io/en/latest/miniconda.html). You will most likely want the **Python 3.8 Miniconda3 Windows 64-bit**.
2. Download and Install the [Microsft Visual C++ Redistributable](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads)
3. Click on **Anaconda Powershell Prompt** in the Start Menu.
4. Create a conda environment for *ktrain*: `conda create -n kt python=3.7; conda activate kt`
5. Type: `pip install -U pip setuptools_scm jupyter` (run twice if error or use `--user` option)
6. Type: `pip install ktrain`

If your machine has a GPU (which is needed for larger models), you'll need to perform [GPU setup for TensorFlow](https://www.tensorflow.org/install/gpu).

#### Resolving Problems
- If you experience a **Kernel Error** when running `jupyter notebook`, follow the [instructions here](https://stackoverflow.com/a/60611014)
and copy the two files in `C:\Users\<your_user_name>\Miniconda3\envs\kt\Lib\site-packages\pywin32_system32` to `C:\Windows\System32`.
- If you experience SSL certificate problems with either `pip` or `conda`, run `conda config --set ssl_verify false` and
replace all `pip` comands above with `pip --trusted-host pypi.org --trusted-host files.pythonhosted.org`.
- We have selected Python 3.7 in STEP 4 above with `python=3.7`, but Python 3.8 is default if removed. We recommend using Python 3.6 or Python 3.7 for time being due
to yet-to-be-resolved bugs in the current version of TensorFlow.


#### Running an Example
Once installed, you can fire up Jupyter notebook (type:`jupyter notebook` at command prompt) and test out *ktrain* with something like this:
```python
# download Cats vs. Dogs image classification dataset
!curl -k --output C:/temp/cats_and_dogs_filtered.zip --url https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip
import os
import zipfile
local_zip = 'C:/temp/cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('C:/temp')
zip_ref.close()

# train model
import ktrain
from ktrain import vision as vis
(trn, val, preproc) = vis.images_from_folder(
datadir='C:/temp/cats_and_dogs_filtered',
data_aug = vis.get_data_aug(horizontal_flip=True),
train_test_names=['train', 'validation'])
learner = ktrain.get_learner(model=vis.image_classifier('pretrained_mobilenet', trn, val, freeze_layers=15),
train_data=trn, val_data=val, workers=4, batch_size=64)
learner.fit_onecycle(1e-4, 1)

# make prediction
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.predict_filename('C:/temp/cats_and_dogs_filtered/validation/cats/cat.2000.jpg')
```




[[Back to Top](#frequently-asked-questions-about-ktrain)]


Expand Down Expand Up @@ -499,7 +562,7 @@ if using `focal_loss` with a `transformers` model like DistilBert.
*ktrain* is just a lightweight wrapper around `tf.keras`, so this would be done in the exact same way as you would in Keras.
More specifically, you can simply recompile your model with the loss function or optimizer you want by invoking `model.compile`.

For example, here is how to use **focal loss** with a DistilBert model:
For example, here is how to use [focal loss](https://www.dlology.com/blog/multi-class-classification-with-focal-loss-for-imbalanced-datasets/) with a DistilBert model:

```python
import tensorflow as tf
Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ zsl.predict(doc, topic_strings=topic_strings, include_labels=True)
- **Ready-to-Use NER models for English, Chinese, and Russian** with no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/shallownlp-examples.ipynb)]</sup></sub>
- **Sentence Pair Classification** for tasks like paraphrase detection <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/MRPC-BERT.ipynb)]</sup></sub>
- **Unsupervised Topic Modeling** with [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-topic_modeling.ipynb)]</sup></sub>
- **Document Similarity with One-Class Learning**: given some documents of interest, find and score new documents that are semantically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
- **Document Recommendation Engine**: given text from a sample document, recommend documents that are thematically-related to it from a larger corpus <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
- **Document Similarity with One-Class Learning**: given some documents of interest, find and score new documents that are thematically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
- **Document Recommendation Engines and Semantic Searches**: given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
- **Text Summarization**: summarize long documents with a pretrained BART model - no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_summarization_with_bart.ipynb)]</sup></sub>
- **Open-Domain Question-Answering**: ask a large text corpus questions and receive exact answers <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
- **Zero-Shot Learning**: classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
Expand Down Expand Up @@ -318,15 +318,17 @@ Using *ktrain* on **Google Colab**? See these Colab examples:

### Installation

1. Make sure pip is up-to-date with: `pip3 install -U pip`
1. Make sure pip is up-to-date with: `pip install -U pip`

2. Install *ktrain*: `pip3 install ktrain`
2. Install *ktrain*: `pip install ktrain`

The above should be all you need on Linux systems and cloud computing environments like Google Colab and AWS EC2. If you are using *ktrain* on a **Windows computer**, you can follow these
[more detailed instructions](https://github.com/amaiya/ktrain/blob/master/FAQ.md#how-do-i-install-ktrain-on-a-windows-machine) that include some extra steps.

**Some important things to note about installation:**
- If using *ktrain* on a local machine with a GPU (versus Google Colab, for example), you'll need to [install GPU support for TensorFlow 2](https://www.tensorflow.org/install/gpu).
- *ktrain* currently uses [TensorFlow 2.1.0 or 2.2.0](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*.
TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems. TensorFlow 2.2.0 will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
On systems where Python 3.8 is the default (e.g., Ubuntu 20.04), we recommend installing and using Python 3.6/3.7 and TensorFlow 2.1.0 with *ktrain* due to unresolved bugs in versions of TensorFlow >= 2.2.0.
- *ktrain* currently uses [TensorFlow 2](https://www.tensorflow.org/install/pip?lang=python3), which will be installed automatically when installing *ktrain*.
TensorFlow 2.1.0 will be installed as a dependency on Python 3.6 and 3.7 systems (due to some TensorFlow bugs that will not be fixed by Google until TensorFlow 2.4). TensorFlow `>=2.2.0` will be installed only if using Python 3.8 (as TF 2.1.0 does not support Python 3.8).
- Since some *ktrain* dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues),
*ktrain* is temporarily using forked versions of some libraries. Specifically, *ktrain* uses forked versions of the `eli5` and `stellargraph` libraries. If not installed, *ktrain* will complain when a method or function needing
either of these libraries is invoked.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -852,7 +852,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot above is generated using the [shap](https://github.com/slundberg/shap) library. The features in red are causing our model to increase the prediction for the **Survived** class, while features in blue cause our model to *decrease* the prediction for **Survived** (or *increase* the prediction for **Not_Survived**). \n",
"The plot above is generated using the [shap](https://github.com/slundberg/shap) library. You can install it with either `pip install shap` or, for *conda* users, `conda install -c conda-forge shap`. The features in red are causing our model to increase the prediction for the **Survived** class, while features in blue cause our model to *decrease* the prediction for **Survived** (or *increase* the prediction for **Not_Survived**). \n",
"\n",
"From the plot, we see that the predicted softmax probability for `Survived` is **50%**, which is a comparatively much less confident classification than other classifications. Why is this?\n",
"\n",
Expand Down
5 changes: 4 additions & 1 deletion ktrain/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,7 @@ def reset_weights(self, verbose=1):



def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None,
def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None, class_weight=None,
stop_factor=4, show_plot=False, suggest=False, restore_weights_only=False, verbose=1):
"""
Plots loss as learning rate is increased. Highest learning rate
Expand All @@ -500,6 +500,8 @@ def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None,
Default is None. Set max_epochs to an integer
(e.g., 5) if lr_find is taking too long
and running for more epochs than desired.
class_weight(dict): class_weight parameter passed to model.fit
for imbalanced datasets.
stop_factor(int): factor used to determine threhsold that loss
must exceed to stop training simulation.
Increase this if loss is erratic and lr_find
Expand Down Expand Up @@ -554,6 +556,7 @@ def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None,
use_gen=use_gen,
start_lr=start_lr, lr_mult=lr_mult,
max_epochs=max_epochs,
class_weight=class_weight,
workers=self.workers,
use_multiprocessing=self.use_multiprocessing,
batch_size=self.batch_size,
Expand Down
3 changes: 2 additions & 1 deletion ktrain/imports.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,4 +245,5 @@
'pip3 install allennlp'



# ELI5
KTRAIN_ELI5_TAG = '0.10.1-1'
7 changes: 4 additions & 3 deletions ktrain/lroptimize/lrfinder.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def on_batch_end(self, batch, logs):
return


def find(self, train_data, steps_per_epoch, use_gen=False,
def find(self, train_data, steps_per_epoch, use_gen=False, class_weight=None,
start_lr=1e-7, lr_mult=1.01, max_epochs=None,
batch_size=U.DEFAULT_BS, workers=1, use_multiprocessing=False, verbose=1):
"""
Expand Down Expand Up @@ -113,13 +113,14 @@ def find(self, train_data, steps_per_epoch, use_gen=False,
# *_generator methods are deprecated from TF 2.1.0
fit_fn = self.model.fit
fit_fn(train_data, steps_per_epoch=steps_per_epoch,
epochs=epochs,
epochs=epochs, class_weight=class_weight,
workers=workers, use_multiprocessing=use_multiprocessing,
verbose=verbose,
callbacks=[callback])
else:
self.model.fit(train_data[0], train_data[1],
batch_size=batch_size, epochs=epochs, verbose=verbose,
batch_size=batch_size, epochs=epochs, class_weight=class_weight,
verbose=verbose,
callbacks=[callback])


Expand Down
8 changes: 7 additions & 1 deletion ktrain/tabular/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,13 @@ def explain(self, test_df, row_index=None, row_num=None, class_id=None, backgrou
background_size(int): size of background data (SHAP parameter)
nsamples(int): number of samples (SHAP parameter)
"""
import shap
try:
import shap
except ImportError:
msg = 'TabularPredictor.explain requires shap library. Please install with: pip install shap. '+\
'Conda users should use this command instead: conda install -c conda-forge shap'
warnings.warn(msg)
return

classification, multilabel = U.is_classifier(self.model)
if classification and class_id is None:
Expand Down
10 changes: 10 additions & 0 deletions ktrain/text/eda.py
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,12 @@ def train_scorer(self, topic_ids=[], doc_ids=[], n_neighbors=20):
Trains a scorer that can score documents based on similarity to a
seed set of documents represented by topic_ids and doc_ids.
NOTE: The score method currently employs the use of LocalOutLierFactor, which
means you should not try to score documents that were used in training. Only
new, unseen documents should be scored for similarity.
REFERENCE:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor
Args:
topic_ids(list of ints): list of topid IDs where each id is in the range
of range(self.n_topics). Documents associated
Expand Down Expand Up @@ -547,6 +553,10 @@ def score(self, texts=None, doc_topics=None):
classifiers are more strict than traditional binary classifiers.
Documents with negative scores closer to zero are good candidates for
inclusion in a training set for binary classification (e.g., active labeling).
NOTE: The score method currently employs the use of LocalOutLierFactor, which
means you should not try to score documents that were used in training. Only
new, unseen documents should be scored for similarity.
Args:
texts(list of str): list of document texts. Mutually-exclusive with <doc_topics>
Expand Down
6 changes: 6 additions & 0 deletions ktrain/text/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,12 @@ def explain(self, doc, truncate_len=512, all_targets=False, n_samples=2500):
'Install with: pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1'
warnings.warn(msg)
return
if not hasattr(eli5, 'KTRAIN_ELI5_TAG') or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
msg = 'ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. '+\
'Uninstall the current version and install/re-install the fork with: pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1'
warnings.warn(msg)
return


prediction = [self.predict(doc)] if not all_targets else None

Expand Down
2 changes: 1 addition & 1 deletion ktrain/version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__all__ = ['__version__']
__version__ = '0.19.6'
__version__ = '0.19.7'
5 changes: 3 additions & 2 deletions ktrain/vision/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,10 @@ def explain(self, img_fpath):
warnings.warn(msg)
return

if not hasattr(eli5, 'KTRAIN'):
#if not hasattr(eli5, 'KTRAIN'):
if not hasattr(eli5, 'KTRAIN_ELI5_TAG') or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
warnings.warn("Since eli5 does not yet support tf.keras, ktrain uses a forked version of eli5. " +\
"We do not detect this forked version, so predictor.explain will not work. " +\
"We do not detect this forked version (or it is out-of-date), so predictor.explain may not work. " +\
"It will work if you uninstall the current version of eli5 and install "+\
"the forked version: " +\
"pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1")
Expand Down

0 comments on commit 6d2145e

Please sign in to comment.