Merge branch 'develop'

amaiya · Nov 8, 2020 · e97ba9c · e97ba9c
2 parents 4c718bb + 017d16e
commit e97ba9c
Show file tree

Hide file tree

Showing 8 changed files with 91 additions and 23 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,18 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+## 0.25.0 (2020-11-08)
+
+### New:
+- The `SimpleQA.index_from_folder` method now supports text extraction from many file types including PDFs, MS Word documents, and MS PowerPoint files.
+
+### Changed
+- The default in `SimpleQA.index_from_list` and `SimpleQA.index_from_folder` has been changed to `breakup_docs=True`.
+
+### Fixed:
+- N/A
+
+
 ## 0.24.2 (2020-11-07)
 
 ### New:

diff --git a/FAQ.md b/FAQ.md
@@ -752,9 +752,12 @@ predictor = ktrain.load_predictor('/path/to/folder')
 ### How do I use ktrain with documents in PDF, DOC, or PPT formats?
 
 If you have documents in formats like `.pdf`, `.docx`, or `.pptx` formats and want to use them in a training set or with various **ktrain** features 
-like question-answering  or zero-shot-learning, they will need to be converted to plain text format first (i.e., `.txt` files).  You can use the
+like zero-shot-learning or text summarization, they will need to be converted to plain text format first (i.e., `.txt` files).  You can use the
 `ktrain.text.textutils.extract_copy` function to automatically do this. Alternatively, you can use other tools like [Apache Tika](https://tika.apache.org/) to do the conversion.
 
+With respect to Question-Answering, the `SimpleQA.index_from_folder` method includes a `use_text_extraction` argument.  When set to `True`, question-answering can be performed on document sets 
+comprised of many different file types. More information on this is included in the [question-answering example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb).
+
 [[Back to Top](#frequently-asked-questions-about-ktrain)]
 
 

diff --git a/README.md b/README.md
@@ -10,30 +10,32 @@
 
 
 ### News and Announcements
-- **2020-11-04**
-  - ***ktrain*** **v0.24.x is released** and now includes built-in support for exporting models to [ONNX](https://onnx.ai/) and  [TensorFlow Lite](https://www.tensorflow.org/lite).    See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/ktrain-ONNX-TFLite-examples.ipynb) for more information.
-- **2020-10-16:**
-  - ***ktrain*** **v0.23.x is released** with updates for compatibility with upcoming release of TensorFlow 2.4.
-- **2020-10-06:**
-  - ***ktrain*** **v0.22.x is released** and includes enhancements to **end-to-end question-answering** such as significantly faster answer-retrieval.  See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb) for more information.
+- **2020-11-08:**
+  - ***ktrain*** **v0.25.x is released** and includes out-of-the-box support for text extraction via the [textract](https://pypi.org/project/textract/) package . This, for example,
+can be used in the `SimpleQA.index_from_folder` method to perform Question-Answering on large collections of PDFs, MS Word documents, or PowerPoint files.   See the [Question-Answering example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/question_answering_with_bert.ipynb) for more information.
 ```python
 # End-to-End Question-Answering in ktrain
 
-# index some documents into a built-in search engine
+# index documents of different types into a built-in search engine
 from ktrain import text
 INDEXDIR = '/tmp/myindex'
 text.SimpleQA.initialize_index(INDEXDIR)
-text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs),  # docs is a list of strings
+corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
+text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
                               multisegment=True, procs=4, # these args speed up indexing
                               breakup_docs=True)          # this slows indexing but speeds up answer retrieval
 
 # ask questions (setting higher batch size can further speed up answer retrieval)
 qa = text.SimpleQA(INDEXDIR)
-answers = qa.ask('What causes computer images to be too dark?', batch_size=8)
+answers = qa.ask('What is ktrain?', batch_size=8)
 
-# top answer snippet:
-#   "if your viewer does not do gamma correction , then linear images will look too dark"
+# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
+#   "ktrain is a low-code platform for machine learning"
 ```
+- **2020-11-04**
+  - ***ktrain*** **v0.24.x is released** and now includes built-in support for exporting models to [ONNX](https://onnx.ai/) and  [TensorFlow Lite](https://www.tensorflow.org/lite).    See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/ktrain-ONNX-TFLite-examples.ipynb) for more information.
+- **2020-10-16:**
+  - ***ktrain*** **v0.23.x is released** with updates for compatibility with upcoming release of TensorFlow 2.4.
 ----
 
 ### Overview
@@ -52,6 +54,7 @@ answers = qa.ask('What causes computer images to be too dark?', batch_size=8)
      - **Document Recommendation Engines and Semantic Searches**:  given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
      - **Text Summarization**:  summarize long documents with a pretrained BART model - no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_summarization_with_bart.ipynb)]</sup></sub>
      - **End-to-End Question-Answering**:  ask a large text corpus questions and receive exact answers <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
+     - **Easy-to-Use Built-In Search Engine**:  perform keyword searches on large collections of documents <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
      - **Zero-Shot Learning**:  classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
      - **Language Translation**:  translate text from one language to another <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/language_translation_example.ipynb)]</sup></sub>
   - `vision` data:

diff --git a/examples/text/question_answering_with_bert.ipynb b/examples/text/question_answering_with_bert.ipynb
@@ -99,7 +99,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents (e.g.,, `.txt` files).  If your documents are in formats like `.pdf`, `.docx`, or `.pptx`, you can convert them to `.txt` files with tools like [Apache Tika](https://tika.apache.org/) or [textract](https://textract.readthedocs.io/en/stable/).  You can also use the `ktrain.text.textutils.extract_copy` function, that will automatically use `textract` to extract plain text from your documents and copy them to a different directory.\n",
+    "For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents (e.g.,, `.txt` files) by default.  If your documents are in formats like `.pdf`, `.docx`, or `.pptx`, you can supply the `use_text_extraction=True` argument to `index_from_folder`, which will use the [textract](https://textract.readthedocs.io/en/stable/) package to extract text from different file types and index this text into the search engine for answer rerieval.  You can also manually convert them to `.txt` files with the  `ktrain.text.textutils.extract_copy` or tools like [Apache Tika](https://tika.apache.org/) or [textract](https://textract.readthedocs.io/en/stable/).  \n",
     "\n",
     "#### Speeding Up Indexing\n",
     "By default, `index_from_list` and `index_from_folder` use a single processor (`procs=1`) with each processor using a maximum of 256MB of memory (`limitmb=256`) and merging results into a single segment (`multisegment=False`).  These values can be changed to speedup indexing as arguments to `index_from_list` or `index_from_folder`.  See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/batch.html) for more information on these parameters and how to use them to speedup indexing.  In this case, we've used `multisegment=True` and `procs=4`.\n",
@@ -459,6 +459,30 @@
     "```\n",
     "See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/querylang.html) for more information on query syntax.\n",
     "\n",
+    "### The `index_from_folder` method\n",
+    "\n",
+    "Earlier, we mentioned the `index_from_folder` method could be used to index documents of different file types (e.g., `.pdf`, `.docx`, `.ppt`, etc.).  Here is a brief code example:\n",
+    "\n",
+    "```python\n",
+    "# index documents of different types into a built-in search engine\n",
+    "from ktrain import text\n",
+    "INDEXDIR = '/tmp/myindex'\n",
+    "text.SimpleQA.initialize_index(INDEXDIR)\n",
+    "corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files\n",
+    "text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction\n",
+    "                              multisegment=True, procs=4, # these args speed up indexing\n",
+    "                              breakup_docs=True)          # speeds up answer retrieval\n",
+    "\n",
+    "# ask questions (setting higher batch size can further speed up answer retrieval)\n",
+    "qa = text.SimpleQA(INDEXDIR)\n",
+    "answers = qa.ask('What is ktrain?', batch_size=8)\n",
+    "\n",
+    "# top answer snippet extracted from https://arxiv.org/abs/2004.10703:\n",
+    "#   \"ktrain is a low-code platform for machine learning\"\n",
+    "\n",
+    "\n",
+    "```\n",
+    "\n",
     "\n",
     "### Connecting the QA System to an Existing Search Engine\n",
     "\n",

diff --git a/ktrain/text/qa/core.py b/ktrain/text/qa/core.py
@@ -365,7 +365,7 @@ def initialize_index(cls, index_dir):
         return ix
 
     @classmethod
-    def index_from_list(cls, docs, index_dir, commit_every=1024, breakup_docs=False,
+    def index_from_list(cls, docs, index_dir, commit_every=1024, breakup_docs=True,
                         procs=1, limitmb=256, multisegment=False, min_words=20, references=None):
         """
         index documents from list.
@@ -433,8 +433,8 @@ def index_from_list(cls, docs, index_dir, commit_every=1024, breakup_docs=False,
 
 
     @classmethod
-    def index_from_folder(cls, folder_path, index_dir,  commit_every=1024, breakup_docs=False, min_words=20,
-                          encoding='utf-8', procs=1, limitmb=256, multisegment=False, verbose=1):
+    def index_from_folder(cls, folder_path, index_dir,  use_text_extraction=False, commit_every=1024, breakup_docs=True, 
+                          min_words=20, encoding='utf-8', procs=1, limitmb=256, multisegment=False, verbose=1):
         """
         index all plain text documents within a folder.
         The procs, limitmb, and especially multisegment arguments can be used to 
@@ -444,6 +444,9 @@ def index_from_folder(cls, folder_path, index_dir,  commit_every=1024, breakup_d
         Args:
           folder_path(str): path to folder containing plain text documents (e.g., .txt files)
           index_dir(str): path to index directory (see initialize_index)
+          use_text_extraction(bool): If True, the  `textract` package will be used to index text from various
+                                     file types including PDF, MS Word, and MS PowerPoint (in addition to plain text files).
+                                     If False, only plain text files will be indexed.
           commit_every(int): commet after adding this many documents
           breakup_docs(bool): break up documents into smaller paragraphs and treat those as the documents.
                               This can potentially improve the speed at which answers are returned by the ask method
@@ -457,15 +460,33 @@ def index_from_folder(cls, folder_path, index_dir,  commit_every=1024, breakup_d
           verbose(bool): verbosity
 
         """
+        if use_text_extraction:
+            try:
+                import textract
+            except ImportError:
+                raise Exception('use_text_extraction=True requires textract:   pip install textract')
+
+
         if not os.path.isdir(folder_path): raise ValueError('folder_path is not a valid folder')
         if folder_path[-1] != os.sep: folder_path += os.sep
         ix = index.open_dir(index_dir)
         writer = ix.writer(procs=procs, limitmb=limitmb, multisegment=multisegment)
         for idx, fpath in enumerate(TU.extract_filenames(folder_path)):
-            if not TU.is_txt(fpath): continue
             reference = "%s" % (fpath.join(fpath.split(folder_path)[1:]))
-            with open(fpath, 'r', encoding=encoding) as f:
-                doc = f.read()
+            if TU.is_txt(fpath):
+                with open(fpath, 'r', encoding=encoding) as f:
+                    doc = f.read()
+            else:
+                if use_text_extraction:
+                    try:
+                        doc = textract.process(fpath)
+                        doc = doc.decode('utf-8', 'ignore')
+                    except:
+                        if verbose:
+                            warnings.warn('Could not extract text from %s' % (fpath))
+                        continue
+                else:
+                    continue
 
             if breakup_docs:
                 small_docs = TU.paragraph_tokenize(doc, join_sentences=True, lang='en')

diff --git a/ktrain/text/textutils.py b/ktrain/text/textutils.py
@@ -67,8 +67,13 @@ def extract_copy(corpus_path, output_path):
 def get_mimetype(filepath):
     return mimetypes.guess_type(filepath)[0]
 
-def is_txt(filepath):
-    return mimetypes.guess_type(filepath)[0] == 'text/plain'
+def is_txt(filepath, strict=False):
+    if strict:
+        return mimetypes.guess_type(filepath)[0] == 'text/plain'
+    else:
+        mtype = get_mimetype(filepath)
+        return mtype is not None and mtype.split('/')[0] == 'text'
+
 
 def is_pdf(filepath):
     return mimetypes.guess_type(filepath)[0] == 'application/pdf'

diff --git a/ktrain/version.py b/ktrain/version.py
@@ -1,2 +1,2 @@
 __all__ = ['__version__']
-__version__ = '0.24.2'
+__version__ = '0.25.0'
diff --git a/setup.py b/setup.py
@@ -49,7 +49,7 @@
           #'stellargraph>=0.8.2', # forked version used by graph module
           #'allennlp', # required for Elmo embeddings since TF2 TF_HUB does not work
           #'textblob', # used by textutils.extract_noun_phrases
-          #'textract', # used by textutils.extract_copy
+          #'textract', # used by textutils.extract_copy and text.qa.core.SimpleQA
       ],
   classifiers=[  # Optional
     # How mature is this project? Common values are