Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Jul 16, 2021
2 parents ee9e601 + 90e47b0 commit e68c0af
Show file tree
Hide file tree
Showing 18 changed files with 109 additions and 107 deletions.
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,19 @@ Most recent releases are shown at the top. Each release shows:
- **Changed**: Additional parameters, changes to inputs or outputs, etc
- **Fixed**: Bug fixes that don't change documented behaviour

## 0.26.5 (2021-07-15)

### New:
- N/A

### Changed
- added `query` parameter to `SimpleQA.ask` so that an alternative query can be used to retrieve contexts from corpus
- added `chardet` as dependency for `stellargraph`

### Fixed:
- fixed issue with `TopicModel.build` when `threshold=None`


## 0.26.4 (2021-06-23)

### New:
Expand Down
65 changes: 4 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,65 +10,8 @@


### News and Announcements
- **2021-03-10:**
- ***ktrain*** **v0.26.x is released** and now supports `transformers>=4.0.0`.
Note that, `transformers>=4.0.0` included a complete reogranization of the module's structure. This means that, if you saved a **transformers**-based `Predictor` (e.g., DistilBERT) in an older version of **ktrain** and **transformers**, you will need to either generate a new `tf_model.preproc` file or manually edit the existing `tf_model.preproc` file before loading the `predictor` in the latest versions of **ktrain** and **transformers**.
For instance, suppose you trained a DistilBERT model and saved the resultant predictor using an older version of **ktrain** with: `predictor.save('/tmp/my_predictor/')`. After upgrading to the newest version of **ktrain**, you will find that `ktrain.load_predictor('/tmp/my_predictor`) will throw an error unless you follow one of the two approaches below:

**Approach 1: Manually edit `tf_model.preproc` file:**
Open `tf_model.preproc` with an editor like **vim** and edit it to replace old module locations with new module locations (example changes for a DistilBERT model shown below):
```python
# change transformers.configuration_distilbert to transformers.models.distilbert.configuration_distilbert
# change transformers.modeling_tf_auto to transformers.models.auto.modeling_tf_auto
# change transformers.tokenization_auto to transformers.models.auto.tokenization_auto
```
The above was confirmed to work using the **vim** editor on Linux.

**Approach 2: Re-generate `tf_model.preproc` file**:
```python
# Step 1: Re-create a Preprocessor instance
# NOTES:
# 1. If training set is large, you can use a sample containing at least one example for each class
# 2. Labels must be in same format as you originally used
# 3. If original training set is not easily accessible, set preproc.preprocess_train_called=True
# below instead of invoking preproc.preprocess_train(x_train, y_train)

preproc = text.Transformer(MODEL_NAME, maxlen=500, class_names=class_names)
trn = preproc.preprocess_train(x_train, y_train)

# Step 2: load the transformers model from predictor folder
from transformers import *
model = TFAutoModelForSequenceClassification.from_pretrained('/tmp/my_predictor/')

# Step 3: re-create/re-save Predictor
predictor = ktrain.get_predictor(model, preproc)
predictor.save('/tmp/my_new_predictor')
```
- If you're using PyTorch 1.8 or above with **ktrain**, you will need to upgrade to `ktrain>=0.26.0`. If you're using `ktrain<0.26.0`, then you will have to downgrade PyTorch with: `pip install torch==1.7.1`.
- **2020-11-08:**
- ***ktrain*** **v0.25.x is released** and includes out-of-the-box support for text extraction via the [textract](https://pypi.org/project/textract/) package . This, for example,
can be used in the `SimpleQA.index_from_folder` method to perform Question-Answering on large collections of PDFs, MS Word documents, or PowerPoint files. See the [Question-Answering example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/question_answering_with_bert.ipynb) for more information.
```python
# End-to-End Question-Answering in ktrain

# index documents of different types into a built-in search engine
from ktrain import text
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
multisegment=True, procs=4, # these args speed up indexing
breakup_docs=True) # this slows indexing but speeds up answer retrieval

# ask questions (setting higher batch size can further speed up answer retrieval)
qa = text.SimpleQA(INDEXDIR)
answers = qa.ask('What is ktrain?', batch_size=8)

# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
# "ktrain is a low-code platform for machine learning"
```
- **2020-11-04**
- ***ktrain*** **v0.24.x is released** and now includes built-in support for exporting models to [ONNX](https://onnx.ai/) and [TensorFlow Lite](https://www.tensorflow.org/lite). See the [example notebook](https://github.com/amaiya/ktrain/blob/develop/examples/text/ktrain-ONNX-TFLite-examples.ipynb) for more information.
- **2021-07-15**
- **ktrain** was used to train machine learning models for [CoronaCentral.ai](https://coronacentral.ai/), a machine-learning-enhanced search engine for COVID publications at Stanford University. The CoronaCentral document classifier, **CoronaBERT**, is [available on the Hugging Face model hub](https://huggingface.co/jakelever/coronabert). CoronaCentral.ai was developed by Jake Lever and Russ Altman and funded by the Chan Zuckerberg Biohub. Check out [their paper](https://www.biorxiv.org/content/10.1101/2020.12.21.423860v1).
----

### Overview
Expand Down Expand Up @@ -360,8 +303,8 @@ The above should be all you need on Linux systems and cloud computing environmen
- Since some **ktrain** dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues),
**ktrain** is temporarily using forked versions of some libraries. Specifically, **ktrain** uses forked versions of the `eli5` and `stellargraph` libraries. If not installed, **ktrain** will complain when a method or function needing either of these libraries is invoked. To install these forked versions, you can do the following:
```
pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1
pip install git+https://github.com/amaiya/stellargraph@no_tf_dep_082
pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip
pip install https://github.com/amaiya/stellargraph/archive/refs/heads/no_tf_dep_082.zip
```

This code was tested on Ubuntu 18.04 LTS using TensorFlow 2.3.1 and Python 3.6.9.
Expand Down
2 changes: 1 addition & 1 deletion docs/imports.html
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ <h1 class="title">Module <code>ktrain.imports</code></h1>

SG_ERRMSG = &#39;ktrain currently uses a forked version of stellargraph v0.8.2. &#39;+\
&#39;Please install with: &#39;+\
&#39;pip install git+https://github.com/amaiya/stellargraph@no_tf_dep_082&#39;
&#39;pip install https://github.com/amaiya/stellargraph/archive/refs/heads/no_tf_dep_082.zip&#39;

ALLENNLP_ERRMSG = &#39;To use ELMo embedings, please install allenlp:\n&#39; +\
&#39;pip install allennlp&#39;
Expand Down
35 changes: 30 additions & 5 deletions docs/text/eda.html
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,12 @@ <h1 class="title">Module <code>ktrain.text.eda</code></h1>
threshold (float): If not None, documents with whose highest topic probability
is less than threshold are filtered out.
&#34;&#34;&#34;
doc_topics, bool_array = self.predict(texts, threshold=threshold)
if threshold is not None:
doc_topics, bool_array = self.predict(texts, threshold=threshold)
else:
doc_topics = self.predict(texts)
bool_array = np.array([True] * len(texts))

self.doc_topics = doc_topics
self.bool_array = bool_array

Expand Down Expand Up @@ -1109,7 +1114,12 @@ <h2 id="args">Args</h2>
threshold (float): If not None, documents with whose highest topic probability
is less than threshold are filtered out.
&#34;&#34;&#34;
doc_topics, bool_array = self.predict(texts, threshold=threshold)
if threshold is not None:
doc_topics, bool_array = self.predict(texts, threshold=threshold)
else:
doc_topics = self.predict(texts)
bool_array = np.array([True] * len(texts))

self.doc_topics = doc_topics
self.bool_array = bool_array

Expand Down Expand Up @@ -1652,7 +1662,12 @@ <h2 id="args">Args</h2>
threshold (float): If not None, documents with whose highest topic probability
is less than threshold are filtered out.
&#34;&#34;&#34;
doc_topics, bool_array = self.predict(texts, threshold=threshold)
if threshold is not None:
doc_topics, bool_array = self.predict(texts, threshold=threshold)
else:
doc_topics = self.predict(texts)
bool_array = np.array([True] * len(texts))

self.doc_topics = doc_topics
self.bool_array = bool_array

Expand Down Expand Up @@ -2949,7 +2964,12 @@ <h2 id="args">Args</h2>
threshold (float): If not None, documents with whose highest topic probability
is less than threshold are filtered out.
&#34;&#34;&#34;
doc_topics, bool_array = self.predict(texts, threshold=threshold)
if threshold is not None:
doc_topics, bool_array = self.predict(texts, threshold=threshold)
else:
doc_topics = self.predict(texts)
bool_array = np.array([True] * len(texts))

self.doc_topics = doc_topics
self.bool_array = bool_array

Expand Down Expand Up @@ -3492,7 +3512,12 @@ <h2 id="args">Args</h2>
threshold (float): If not None, documents with whose highest topic probability
is less than threshold are filtered out.
&#34;&#34;&#34;
doc_topics, bool_array = self.predict(texts, threshold=threshold)
if threshold is not None:
doc_topics, bool_array = self.predict(texts, threshold=threshold)
else:
doc_topics = self.predict(texts)
bool_array = np.array([True] * len(texts))

self.doc_topics = doc_topics
self.bool_array = bool_array

Expand Down
14 changes: 12 additions & 2 deletions docs/text/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -2551,7 +2551,12 @@ <h2 id="args">Args</h2>
threshold (float): If not None, documents with whose highest topic probability
is less than threshold are filtered out.
&#34;&#34;&#34;
doc_topics, bool_array = self.predict(texts, threshold=threshold)
if threshold is not None:
doc_topics, bool_array = self.predict(texts, threshold=threshold)
else:
doc_topics = self.predict(texts)
bool_array = np.array([True] * len(texts))

self.doc_topics = doc_topics
self.bool_array = bool_array

Expand Down Expand Up @@ -3094,7 +3099,12 @@ <h2 id="args">Args</h2>
threshold (float): If not None, documents with whose highest topic probability
is less than threshold are filtered out.
&#34;&#34;&#34;
doc_topics, bool_array = self.predict(texts, threshold=threshold)
if threshold is not None:
doc_topics, bool_array = self.predict(texts, threshold=threshold)
else:
doc_topics = self.predict(texts)
bool_array = np.array([True] * len(texts))

self.doc_topics = doc_topics
self.bool_array = bool_array

Expand Down
12 changes: 6 additions & 6 deletions docs/text/predictor.html
Original file line number Diff line number Diff line change
Expand Up @@ -146,12 +146,12 @@ <h1 class="title">Module <code>ktrain.text.predictor</code></h1>
from eli5.lime import TextExplainer
except:
msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. &#39;+\
&#39;Install with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
&#39;Install with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
warnings.warn(msg)
return
if not hasattr(eli5, &#39;KTRAIN_ELI5_TAG&#39;) or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. &#39;+\
&#39;Uninstall the current version and install/re-install the fork with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
&#39;Uninstall the current version and install/re-install the fork with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
warnings.warn(msg)
return

Expand Down Expand Up @@ -389,12 +389,12 @@ <h2 class="section-title" id="header-classes">Classes</h2>
from eli5.lime import TextExplainer
except:
msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. &#39;+\
&#39;Install with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
&#39;Install with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
warnings.warn(msg)
return
if not hasattr(eli5, &#39;KTRAIN_ELI5_TAG&#39;) or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. &#39;+\
&#39;Uninstall the current version and install/re-install the fork with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
&#39;Uninstall the current version and install/re-install the fork with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
warnings.warn(msg)
return

Expand Down Expand Up @@ -548,12 +548,12 @@ <h2 id="args">Args</h2>
from eli5.lime import TextExplainer
except:
msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. &#39;+\
&#39;Install with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
&#39;Install with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
warnings.warn(msg)
return
if not hasattr(eli5, &#39;KTRAIN_ELI5_TAG&#39;) or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. &#39;+\
&#39;Uninstall the current version and install/re-install the fork with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
&#39;Uninstall the current version and install/re-install the fork with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
warnings.warn(msg)
return

Expand Down
24 changes: 14 additions & 10 deletions docs/text/qa/core.html
Original file line number Diff line number Diff line change
Expand Up @@ -216,14 +216,15 @@ <h1 class="title">Module <code>ktrain.text.qa.core</code></h1>



def ask(self, question, batch_size=8, n_docs_considered=10, n_answers=50,
def ask(self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50,
rerank_threshold=0.015, include_np=False):
&#34;&#34;&#34;
```
submit question to obtain candidate answers

Args:
question(str): question in the form of a string
query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
batch_size(int): number of question-context pairs fed to model at each iteration
Default:8
Increase for faster answer-retrieval.
Expand Down Expand Up @@ -252,9 +253,9 @@ <h1 class="title">Module <code>ktrain.text.qa.core</code></h1>
paragraphs = []
refs = []
#doc_results = self.search(question, limit=n_docs_considered)
doc_results = self.search(_process_question(question, include_np=include_np), limit=n_docs_considered)
doc_results = self.search(_process_question(query if query is not None else question, include_np=include_np), limit=n_docs_considered)
if not doc_results:
warnings.warn(&#39;No documents matched words in question&#39;)
warnings.warn(&#39;No documents matched words in question (or query if supplied)&#39;)
return []
# extract paragraphs as contexts
contexts = []
Expand Down Expand Up @@ -755,14 +756,15 @@ <h2 class="section-title" id="header-classes">Classes</h2>



def ask(self, question, batch_size=8, n_docs_considered=10, n_answers=50,
def ask(self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50,
rerank_threshold=0.015, include_np=False):
&#34;&#34;&#34;
```
submit question to obtain candidate answers

Args:
question(str): question in the form of a string
query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
batch_size(int): number of question-context pairs fed to model at each iteration
Default:8
Increase for faster answer-retrieval.
Expand Down Expand Up @@ -791,9 +793,9 @@ <h2 class="section-title" id="header-classes">Classes</h2>
paragraphs = []
refs = []
#doc_results = self.search(question, limit=n_docs_considered)
doc_results = self.search(_process_question(question, include_np=include_np), limit=n_docs_considered)
doc_results = self.search(_process_question(query if query is not None else question, include_np=include_np), limit=n_docs_considered)
if not doc_results:
warnings.warn(&#39;No documents matched words in question&#39;)
warnings.warn(&#39;No documents matched words in question (or query if supplied)&#39;)
return []
# extract paragraphs as contexts
contexts = []
Expand Down Expand Up @@ -904,13 +906,14 @@ <h3>Subclasses</h3>
<h3>Methods</h3>
<dl>
<dt id="ktrain.text.qa.core.QA.ask"><code class="name flex">
<span>def <span class="ident">ask</span></span>(<span>self, question, batch_size=8, n_docs_considered=10, n_answers=50, rerank_threshold=0.015, include_np=False)</span>
<span>def <span class="ident">ask</span></span>(<span>self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50, rerank_threshold=0.015, include_np=False)</span>
</code></dt>
<dd>
<div class="desc"><pre><code>submit question to obtain candidate answers

Args:
question(str): question in the form of a string
query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
batch_size(int): number of question-context pairs fed to model at each iteration
Default:8
Increase for faster answer-retrieval.
Expand Down Expand Up @@ -938,14 +941,15 @@ <h3>Methods</h3>
<summary>
<span>Expand source code</span>
</summary>
<pre><code class="python">def ask(self, question, batch_size=8, n_docs_considered=10, n_answers=50,
<pre><code class="python">def ask(self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50,
rerank_threshold=0.015, include_np=False):
&#34;&#34;&#34;
```
submit question to obtain candidate answers

Args:
question(str): question in the form of a string
query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
batch_size(int): number of question-context pairs fed to model at each iteration
Default:8
Increase for faster answer-retrieval.
Expand Down Expand Up @@ -974,9 +978,9 @@ <h3>Methods</h3>
paragraphs = []
refs = []
#doc_results = self.search(question, limit=n_docs_considered)
doc_results = self.search(_process_question(question, include_np=include_np), limit=n_docs_considered)
doc_results = self.search(_process_question(query if query is not None else question, include_np=include_np), limit=n_docs_considered)
if not doc_results:
warnings.warn(&#39;No documents matched words in question&#39;)
warnings.warn(&#39;No documents matched words in question (or query if supplied)&#39;)
return []
# extract paragraphs as contexts
contexts = []
Expand Down

0 comments on commit e68c0af

Please sign in to comment.