Merge branch 'develop'

amaiya · Jul 16, 2021 · e68c0af · e68c0af
2 parents ee9e601 + 90e47b0
commit e68c0af
Show file tree

Hide file tree

Showing 18 changed files with 109 additions and 107 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,19 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+## 0.26.5 (2021-07-15)
+
+### New:
+- N/A
+
+### Changed
+- added `query` parameter to `SimpleQA.ask` so that an alternative query can be used to retrieve contexts from corpus
+- added `chardet` as dependency for `stellargraph`
+
+### Fixed:
+- fixed issue with `TopicModel.build` when `threshold=None`
+
+
 ## 0.26.4 (2021-06-23)
 
 ### New:

diff --git a/README.md b/README.md
@@ -10,65 +10,8 @@
 
 
 ### News and Announcements
-- **2021-03-10:**
-  - ***ktrain*** **v0.26.x is released** and now supports `transformers>=4.0.0`.  
-Note that, `transformers>=4.0.0` included a complete reogranization of the module's structure. This means that, if you saved a **transformers**-based `Predictor` (e.g., DistilBERT) in an older version of **ktrain** and **transformers**, you will need to either generate a new `tf_model.preproc` file or manually edit the existing `tf_model.preproc` file before loading the `predictor` in the latest versions of **ktrain** and **transformers**.  
-For instance, suppose you trained a DistilBERT model and saved the resultant predictor using an older version of **ktrain** with: `predictor.save('/tmp/my_predictor/')`.  After upgrading to the newest version of **ktrain**,  you will find that `ktrain.load_predictor('/tmp/my_predictor`) will throw an error unless you follow one of the two approaches below:  
-
-    **Approach 1: Manually edit `tf_model.preproc` file:**  
-    Open `tf_model.preproc` with an editor like **vim** and edit it to replace old module locations with new module locations (example changes for a DistilBERT model shown below):  
-    ```python
-    # change transformers.configuration_distilbert to transformers.models.distilbert.configuration_distilbert
-    # change transformers.modeling_tf_auto to transformers.models.auto.modeling_tf_auto
-    # change transformers.tokenization_auto to transformers.models.auto.tokenization_auto  
-    ```
-    The above was confirmed to work using the **vim** editor on Linux.   
-
-    **Approach 2: Re-generate `tf_model.preproc` file**:  
-	```python
-	# Step 1: Re-create a Preprocessor instance
-	# NOTES:
-	# 1. If training set is large, you can use a sample containing at least one example for each class
-	# 2. Labels must be in same format as you originally used
-	# 3. If original training set is not easily accessible, set preproc.preprocess_train_called=True 
-	#    below instead of invoking preproc.preprocess_train(x_train, y_train)
-
-	preproc = text.Transformer(MODEL_NAME, maxlen=500, class_names=class_names)
-	trn = preproc.preprocess_train(x_train, y_train)
-
-	# Step 2: load the transformers model from predictor folder
-	from transformers import *
-	model = TFAutoModelForSequenceClassification.from_pretrained('/tmp/my_predictor/')
-
-	# Step 3: re-create/re-save Predictor
-	predictor = ktrain.get_predictor(model, preproc)
-	predictor.save('/tmp/my_new_predictor')
-	```
-  - If you're using PyTorch 1.8 or above with **ktrain**, you will need to upgrade to `ktrain>=0.26.0`. If you're using `ktrain<0.26.0`, then you will have to downgrade PyTorch with: `pip install torch==1.7.1`.
-- **2020-11-08:**
-  - ***ktrain*** **v0.25.x is released** and includes out-of-the-box support for text extraction via the [textract](https://pypi.org/project/textract/) package . This, for example,
-can be used in the `SimpleQA.index_from_folder` method to perform Question-Answering on large collections of PDFs, MS Word documents, or PowerPoint files.   See the [Question-Answering example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/question_answering_with_bert.ipynb) for more information.
-```python
-# End-to-End Question-Answering in ktrain
-
-# index documents of different types into a built-in search engine
-from ktrain import text
-INDEXDIR = '/tmp/myindex'
-text.SimpleQA.initialize_index(INDEXDIR)
-corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
-text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
-                              multisegment=True, procs=4, # these args speed up indexing
-                              breakup_docs=True)          # this slows indexing but speeds up answer retrieval
-
-# ask questions (setting higher batch size can further speed up answer retrieval)
-qa = text.SimpleQA(INDEXDIR)
-answers = qa.ask('What is ktrain?', batch_size=8)
-
-# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
-#   "ktrain is a low-code platform for machine learning"
-```
-- **2020-11-04**
-  - ***ktrain*** **v0.24.x is released** and now includes built-in support for exporting models to [ONNX](https://onnx.ai/) and  [TensorFlow Lite](https://www.tensorflow.org/lite).    See the [example notebook](https://github.com/amaiya/ktrain/blob/develop/examples/text/ktrain-ONNX-TFLite-examples.ipynb) for more information.
+- **2021-07-15**
+  - **ktrain** was used to train machine learning models for [CoronaCentral.ai](https://coronacentral.ai/), a machine-learning-enhanced search engine for COVID publications at Stanford University. The CoronaCentral document classifier, **CoronaBERT**, is [available on the Hugging Face model hub](https://huggingface.co/jakelever/coronabert).  CoronaCentral.ai was developed by Jake Lever and Russ Altman and funded by the Chan Zuckerberg Biohub. Check out [their paper](https://www.biorxiv.org/content/10.1101/2020.12.21.423860v1).
 ----
 
 ### Overview
@@ -360,8 +303,8 @@ The above should be all you need on Linux systems and cloud computing environmen
 - Since some **ktrain** dependencies have not yet been migrated to `tf.keras` in TensorFlow 2 (or may have other issues), 
   **ktrain** is temporarily using forked versions of some libraries. Specifically, **ktrain** uses forked versions of the `eli5` and `stellargraph` libraries.  If not installed, **ktrain** will complain  when a method or function needing either of these libraries is invoked.  To install these forked versions, you can do the following:
 ```
-pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1
-pip install git+https://github.com/amaiya/stellargraph@no_tf_dep_082
+pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip
+pip install https://github.com/amaiya/stellargraph/archive/refs/heads/no_tf_dep_082.zip
 ```
 
 This code was tested on Ubuntu 18.04 LTS using TensorFlow 2.3.1 and Python 3.6.9.

diff --git a/docs/imports.html b/docs/imports.html
@@ -266,7 +266,7 @@ <h1 class="title">Module <code>ktrain.imports</code></h1>
 
 SG_ERRMSG = &#39;ktrain currently uses a forked version of stellargraph v0.8.2. &#39;+\
             &#39;Please install with: &#39;+\
-            &#39;pip install git+https://github.com/amaiya/stellargraph@no_tf_dep_082&#39;
+            &#39;pip install https://github.com/amaiya/stellargraph/archive/refs/heads/no_tf_dep_082.zip&#39;
 
 ALLENNLP_ERRMSG  = &#39;To use ELMo embedings, please install allenlp:\n&#39; +\
                    &#39;pip install allennlp&#39;

diff --git a/docs/text/eda.html b/docs/text/eda.html
@@ -297,7 +297,12 @@ <h1 class="title">Module <code>ktrain.text.eda</code></h1>
             threshold (float): If not None, documents with whose highest topic probability
                                is less than threshold are filtered out.
         &#34;&#34;&#34;
-        doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        if threshold is not None:
+            doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        else:
+            doc_topics = self.predict(texts)
+            bool_array = np.array([True] * len(texts))
+
         self.doc_topics = doc_topics
         self.bool_array = bool_array
 
@@ -1109,7 +1114,12 @@ <h2 id="args">Args</h2>
             threshold (float): If not None, documents with whose highest topic probability
                                is less than threshold are filtered out.
         &#34;&#34;&#34;
-        doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        if threshold is not None:
+            doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        else:
+            doc_topics = self.predict(texts)
+            bool_array = np.array([True] * len(texts))
+
         self.doc_topics = doc_topics
         self.bool_array = bool_array
 
@@ -1652,7 +1662,12 @@ <h2 id="args">Args</h2>
         threshold (float): If not None, documents with whose highest topic probability
                            is less than threshold are filtered out.
     &#34;&#34;&#34;
-    doc_topics, bool_array = self.predict(texts, threshold=threshold)
+    if threshold is not None:
+        doc_topics, bool_array = self.predict(texts, threshold=threshold)
+    else:
+        doc_topics = self.predict(texts)
+        bool_array = np.array([True] * len(texts))
+
     self.doc_topics = doc_topics
     self.bool_array = bool_array
 
@@ -2949,7 +2964,12 @@ <h2 id="args">Args</h2>
             threshold (float): If not None, documents with whose highest topic probability
                                is less than threshold are filtered out.
         &#34;&#34;&#34;
-        doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        if threshold is not None:
+            doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        else:
+            doc_topics = self.predict(texts)
+            bool_array = np.array([True] * len(texts))
+
         self.doc_topics = doc_topics
         self.bool_array = bool_array
 
@@ -3492,7 +3512,12 @@ <h2 id="args">Args</h2>
         threshold (float): If not None, documents with whose highest topic probability
                            is less than threshold are filtered out.
     &#34;&#34;&#34;
-    doc_topics, bool_array = self.predict(texts, threshold=threshold)
+    if threshold is not None:
+        doc_topics, bool_array = self.predict(texts, threshold=threshold)
+    else:
+        doc_topics = self.predict(texts)
+        bool_array = np.array([True] * len(texts))
+
     self.doc_topics = doc_topics
     self.bool_array = bool_array
 

diff --git a/docs/text/index.html b/docs/text/index.html
@@ -2551,7 +2551,12 @@ <h2 id="args">Args</h2>
             threshold (float): If not None, documents with whose highest topic probability
                                is less than threshold are filtered out.
         &#34;&#34;&#34;
-        doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        if threshold is not None:
+            doc_topics, bool_array = self.predict(texts, threshold=threshold)
+        else:
+            doc_topics = self.predict(texts)
+            bool_array = np.array([True] * len(texts))
+
         self.doc_topics = doc_topics
         self.bool_array = bool_array
 
@@ -3094,7 +3099,12 @@ <h2 id="args">Args</h2>
         threshold (float): If not None, documents with whose highest topic probability
                            is less than threshold are filtered out.
     &#34;&#34;&#34;
-    doc_topics, bool_array = self.predict(texts, threshold=threshold)
+    if threshold is not None:
+        doc_topics, bool_array = self.predict(texts, threshold=threshold)
+    else:
+        doc_topics = self.predict(texts)
+        bool_array = np.array([True] * len(texts))
+
     self.doc_topics = doc_topics
     self.bool_array = bool_array
 

diff --git a/docs/text/predictor.html b/docs/text/predictor.html
@@ -146,12 +146,12 @@ <h1 class="title">Module <code>ktrain.text.predictor</code></h1>
             from eli5.lime import TextExplainer
         except:
             msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. &#39;+\
-                  &#39;Install with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
+                  &#39;Install with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
             warnings.warn(msg)
             return
         if not hasattr(eli5, &#39;KTRAIN_ELI5_TAG&#39;) or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
             msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. &#39;+\
-                  &#39;Uninstall the current version and install/re-install the fork with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
+                  &#39;Uninstall the current version and install/re-install the fork with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
             warnings.warn(msg)
             return
 
@@ -389,12 +389,12 @@ <h2 class="section-title" id="header-classes">Classes</h2>
             from eli5.lime import TextExplainer
         except:
             msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. &#39;+\
-                  &#39;Install with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
+                  &#39;Install with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
             warnings.warn(msg)
             return
         if not hasattr(eli5, &#39;KTRAIN_ELI5_TAG&#39;) or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
             msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. &#39;+\
-                  &#39;Uninstall the current version and install/re-install the fork with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
+                  &#39;Uninstall the current version and install/re-install the fork with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
             warnings.warn(msg)
             return
 
@@ -548,12 +548,12 @@ <h2 id="args">Args</h2>
         from eli5.lime import TextExplainer
     except:
         msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. &#39;+\
-              &#39;Install with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
+              &#39;Install with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
         warnings.warn(msg)
         return
     if not hasattr(eli5, &#39;KTRAIN_ELI5_TAG&#39;) or eli5.KTRAIN_ELI5_TAG != KTRAIN_ELI5_TAG:
         msg = &#39;ktrain requires a forked version of eli5 to support tf.keras. It is either missing or not up-to-date. &#39;+\
-              &#39;Uninstall the current version and install/re-install the fork with: pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1&#39;
+              &#39;Uninstall the current version and install/re-install the fork with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip&#39;
         warnings.warn(msg)
         return
 

diff --git a/docs/text/qa/core.html b/docs/text/qa/core.html
@@ -216,14 +216,15 @@ <h1 class="title">Module <code>ktrain.text.qa.core</code></h1>
 
 
 
-    def ask(self, question, batch_size=8, n_docs_considered=10, n_answers=50, 
+    def ask(self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50, 
             rerank_threshold=0.015, include_np=False):
         &#34;&#34;&#34;
         ```
         submit question to obtain candidate answers
 
         Args:
           question(str): question in the form of a string
+          query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
           batch_size(int):  number of question-context pairs fed to model at each iteration
                             Default:8
                             Increase for faster answer-retrieval.
@@ -252,9 +253,9 @@ <h1 class="title">Module <code>ktrain.text.qa.core</code></h1>
         paragraphs = []
         refs = []
         #doc_results = self.search(question, limit=n_docs_considered)
-        doc_results = self.search(_process_question(question, include_np=include_np), limit=n_docs_considered)
+        doc_results = self.search(_process_question(query if query is not None else question, include_np=include_np), limit=n_docs_considered)
         if not doc_results: 
-            warnings.warn(&#39;No documents matched words in question&#39;)
+            warnings.warn(&#39;No documents matched words in question (or query if supplied)&#39;)
             return []
         # extract paragraphs as contexts
         contexts = []
@@ -755,14 +756,15 @@ <h2 class="section-title" id="header-classes">Classes</h2>
 
 
 
-    def ask(self, question, batch_size=8, n_docs_considered=10, n_answers=50, 
+    def ask(self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50, 
             rerank_threshold=0.015, include_np=False):
         &#34;&#34;&#34;
         ```
         submit question to obtain candidate answers
 
         Args:
           question(str): question in the form of a string
+          query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
           batch_size(int):  number of question-context pairs fed to model at each iteration
                             Default:8
                             Increase for faster answer-retrieval.
@@ -791,9 +793,9 @@ <h2 class="section-title" id="header-classes">Classes</h2>
         paragraphs = []
         refs = []
         #doc_results = self.search(question, limit=n_docs_considered)
-        doc_results = self.search(_process_question(question, include_np=include_np), limit=n_docs_considered)
+        doc_results = self.search(_process_question(query if query is not None else question, include_np=include_np), limit=n_docs_considered)
         if not doc_results: 
-            warnings.warn(&#39;No documents matched words in question&#39;)
+            warnings.warn(&#39;No documents matched words in question (or query if supplied)&#39;)
             return []
         # extract paragraphs as contexts
         contexts = []
@@ -904,13 +906,14 @@ <h3>Subclasses</h3>
 <h3>Methods</h3>
 <dl>
 <dt id="ktrain.text.qa.core.QA.ask"><code class="name flex">
-<span>def <span class="ident">ask</span></span>(<span>self, question, batch_size=8, n_docs_considered=10, n_answers=50, rerank_threshold=0.015, include_np=False)</span>
+<span>def <span class="ident">ask</span></span>(<span>self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50, rerank_threshold=0.015, include_np=False)</span>
 </code></dt>
 <dd>
 <div class="desc"><pre><code>submit question to obtain candidate answers
 
 Args:
   question(str): question in the form of a string
+  query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
   batch_size(int):  number of question-context pairs fed to model at each iteration
                     Default:8
                     Increase for faster answer-retrieval.
@@ -938,14 +941,15 @@ <h3>Methods</h3>
 <summary>
 <span>Expand source code</span>
 </summary>
-<pre><code class="python">def ask(self, question, batch_size=8, n_docs_considered=10, n_answers=50, 
+<pre><code class="python">def ask(self, question, query=None, batch_size=8, n_docs_considered=10, n_answers=50, 
         rerank_threshold=0.015, include_np=False):
     &#34;&#34;&#34;
     ```
     submit question to obtain candidate answers
 
     Args:
       question(str): question in the form of a string
+      query(str): Optional. If not None, words in query will be used to retrieve contexts instead of words in question
       batch_size(int):  number of question-context pairs fed to model at each iteration
                         Default:8
                         Increase for faster answer-retrieval.
@@ -974,9 +978,9 @@ <h3>Methods</h3>
     paragraphs = []
     refs = []
     #doc_results = self.search(question, limit=n_docs_considered)
-    doc_results = self.search(_process_question(question, include_np=include_np), limit=n_docs_considered)
+    doc_results = self.search(_process_question(query if query is not None else question, include_np=include_np), limit=n_docs_considered)
     if not doc_results: 
-        warnings.warn(&#39;No documents matched words in question&#39;)
+        warnings.warn(&#39;No documents matched words in question (or query if supplied)&#39;)
         return []
     # extract paragraphs as contexts
     contexts = []