Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Nov 8, 2020
2 parents 4c718bb + 017d16e commit e97ba9c
Show file tree
Hide file tree
Showing 8 changed files with 91 additions and 23 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,18 @@ Most recent releases are shown at the top. Each release shows:
- **Changed**: Additional parameters, changes to inputs or outputs, etc
- **Fixed**: Bug fixes that don't change documented behaviour

## 0.25.0 (2020-11-08)

### New:
- The `SimpleQA.index_from_folder` method now supports text extraction from many file types including PDFs, MS Word documents, and MS PowerPoint files.

### Changed
- The default in `SimpleQA.index_from_list` and `SimpleQA.index_from_folder` has been changed to `breakup_docs=True`.

### Fixed:
- N/A


## 0.24.2 (2020-11-07)

### New:
Expand Down
5 changes: 4 additions & 1 deletion FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -752,9 +752,12 @@ predictor = ktrain.load_predictor('/path/to/folder')
### How do I use ktrain with documents in PDF, DOC, or PPT formats?

If you have documents in formats like `.pdf`, `.docx`, or `.pptx` formats and want to use them in a training set or with various **ktrain** features
like question-answering or zero-shot-learning, they will need to be converted to plain text format first (i.e., `.txt` files). You can use the
like zero-shot-learning or text summarization, they will need to be converted to plain text format first (i.e., `.txt` files). You can use the
`ktrain.text.textutils.extract_copy` function to automatically do this. Alternatively, you can use other tools like [Apache Tika](https://tika.apache.org/) to do the conversion.

With respect to Question-Answering, the `SimpleQA.index_from_folder` method includes a `use_text_extraction` argument. When set to `True`, question-answering can be performed on document sets
comprised of many different file types. More information on this is included in the [question-answering example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb).

[[Back to Top](#frequently-asked-questions-about-ktrain)]


Expand Down
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,32 @@


### News and Announcements
- **2020-11-04**
- ***ktrain*** **v0.24.x is released** and now includes built-in support for exporting models to [ONNX](https://onnx.ai/) and [TensorFlow Lite](https://www.tensorflow.org/lite). See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/ktrain-ONNX-TFLite-examples.ipynb) for more information.
- **2020-10-16:**
- ***ktrain*** **v0.23.x is released** with updates for compatibility with upcoming release of TensorFlow 2.4.
- **2020-10-06:**
- ***ktrain*** **v0.22.x is released** and includes enhancements to **end-to-end question-answering** such as significantly faster answer-retrieval. See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb) for more information.
- **2020-11-08:**
- ***ktrain*** **v0.25.x is released** and includes out-of-the-box support for text extraction via the [textract](https://pypi.org/project/textract/) package . This, for example,
can be used in the `SimpleQA.index_from_folder` method to perform Question-Answering on large collections of PDFs, MS Word documents, or PowerPoint files. See the [Question-Answering example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/question_answering_with_bert.ipynb) for more information.
```python
# End-to-End Question-Answering in ktrain

# index some documents into a built-in search engine
# index documents of different types into a built-in search engine
from ktrain import text
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs), # docs is a list of strings
corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
multisegment=True, procs=4, # these args speed up indexing
breakup_docs=True) # this slows indexing but speeds up answer retrieval

# ask questions (setting higher batch size can further speed up answer retrieval)
qa = text.SimpleQA(INDEXDIR)
answers = qa.ask('What causes computer images to be too dark?', batch_size=8)
answers = qa.ask('What is ktrain?', batch_size=8)

# top answer snippet:
# "if your viewer does not do gamma correction , then linear images will look too dark"
# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
# "ktrain is a low-code platform for machine learning"
```
- **2020-11-04**
- ***ktrain*** **v0.24.x is released** and now includes built-in support for exporting models to [ONNX](https://onnx.ai/) and [TensorFlow Lite](https://www.tensorflow.org/lite). See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/ktrain-ONNX-TFLite-examples.ipynb) for more information.
- **2020-10-16:**
- ***ktrain*** **v0.23.x is released** with updates for compatibility with upcoming release of TensorFlow 2.4.
----

### Overview
Expand All @@ -52,6 +54,7 @@ answers = qa.ask('What causes computer images to be too dark?', batch_size=8)
- **Document Recommendation Engines and Semantic Searches**: given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
- **Text Summarization**: summarize long documents with a pretrained BART model - no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_summarization_with_bart.ipynb)]</sup></sub>
- **End-to-End Question-Answering**: ask a large text corpus questions and receive exact answers <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
- **Easy-to-Use Built-In Search Engine**: perform keyword searches on large collections of documents <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
- **Zero-Shot Learning**: classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
- **Language Translation**: translate text from one language to another <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/language_translation_example.ipynb)]</sup></sub>
- `vision` data:
Expand Down
26 changes: 25 additions & 1 deletion examples/text/question_answering_with_bert.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents (e.g.,, `.txt` files). If your documents are in formats like `.pdf`, `.docx`, or `.pptx`, you can convert them to `.txt` files with tools like [Apache Tika](https://tika.apache.org/) or [textract](https://textract.readthedocs.io/en/stable/). You can also use the `ktrain.text.textutils.extract_copy` function, that will automatically use `textract` to extract plain text from your documents and copy them to a different directory.\n",
"For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents (e.g.,, `.txt` files) by default. If your documents are in formats like `.pdf`, `.docx`, or `.pptx`, you can supply the `use_text_extraction=True` argument to `index_from_folder`, which will use the [textract](https://textract.readthedocs.io/en/stable/) package to extract text from different file types and index this text into the search engine for answer rerieval. You can also manually convert them to `.txt` files with the `ktrain.text.textutils.extract_copy` or tools like [Apache Tika](https://tika.apache.org/) or [textract](https://textract.readthedocs.io/en/stable/). \n",
"\n",
"#### Speeding Up Indexing\n",
"By default, `index_from_list` and `index_from_folder` use a single processor (`procs=1`) with each processor using a maximum of 256MB of memory (`limitmb=256`) and merging results into a single segment (`multisegment=False`). These values can be changed to speedup indexing as arguments to `index_from_list` or `index_from_folder`. See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/batch.html) for more information on these parameters and how to use them to speedup indexing. In this case, we've used `multisegment=True` and `procs=4`.\n",
Expand Down Expand Up @@ -459,6 +459,30 @@
"```\n",
"See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/querylang.html) for more information on query syntax.\n",
"\n",
"### The `index_from_folder` method\n",
"\n",
"Earlier, we mentioned the `index_from_folder` method could be used to index documents of different file types (e.g., `.pdf`, `.docx`, `.ppt`, etc.). Here is a brief code example:\n",
"\n",
"```python\n",
"# index documents of different types into a built-in search engine\n",
"from ktrain import text\n",
"INDEXDIR = '/tmp/myindex'\n",
"text.SimpleQA.initialize_index(INDEXDIR)\n",
"corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files\n",
"text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction\n",
" multisegment=True, procs=4, # these args speed up indexing\n",
" breakup_docs=True) # speeds up answer retrieval\n",
"\n",
"# ask questions (setting higher batch size can further speed up answer retrieval)\n",
"qa = text.SimpleQA(INDEXDIR)\n",
"answers = qa.ask('What is ktrain?', batch_size=8)\n",
"\n",
"# top answer snippet extracted from https://arxiv.org/abs/2004.10703:\n",
"# \"ktrain is a low-code platform for machine learning\"\n",
"\n",
"\n",
"```\n",
"\n",
"\n",
"### Connecting the QA System to an Existing Search Engine\n",
"\n",
Expand Down
33 changes: 27 additions & 6 deletions ktrain/text/qa/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ def initialize_index(cls, index_dir):
return ix

@classmethod
def index_from_list(cls, docs, index_dir, commit_every=1024, breakup_docs=False,
def index_from_list(cls, docs, index_dir, commit_every=1024, breakup_docs=True,
procs=1, limitmb=256, multisegment=False, min_words=20, references=None):
"""
index documents from list.
Expand Down Expand Up @@ -433,8 +433,8 @@ def index_from_list(cls, docs, index_dir, commit_every=1024, breakup_docs=False,


@classmethod
def index_from_folder(cls, folder_path, index_dir, commit_every=1024, breakup_docs=False, min_words=20,
encoding='utf-8', procs=1, limitmb=256, multisegment=False, verbose=1):
def index_from_folder(cls, folder_path, index_dir, use_text_extraction=False, commit_every=1024, breakup_docs=True,
min_words=20, encoding='utf-8', procs=1, limitmb=256, multisegment=False, verbose=1):
"""
index all plain text documents within a folder.
The procs, limitmb, and especially multisegment arguments can be used to
Expand All @@ -444,6 +444,9 @@ def index_from_folder(cls, folder_path, index_dir, commit_every=1024, breakup_d
Args:
folder_path(str): path to folder containing plain text documents (e.g., .txt files)
index_dir(str): path to index directory (see initialize_index)
use_text_extraction(bool): If True, the `textract` package will be used to index text from various
file types including PDF, MS Word, and MS PowerPoint (in addition to plain text files).
If False, only plain text files will be indexed.
commit_every(int): commet after adding this many documents
breakup_docs(bool): break up documents into smaller paragraphs and treat those as the documents.
This can potentially improve the speed at which answers are returned by the ask method
Expand All @@ -457,15 +460,33 @@ def index_from_folder(cls, folder_path, index_dir, commit_every=1024, breakup_d
verbose(bool): verbosity
"""
if use_text_extraction:
try:
import textract
except ImportError:
raise Exception('use_text_extraction=True requires textract: pip install textract')


if not os.path.isdir(folder_path): raise ValueError('folder_path is not a valid folder')
if folder_path[-1] != os.sep: folder_path += os.sep
ix = index.open_dir(index_dir)
writer = ix.writer(procs=procs, limitmb=limitmb, multisegment=multisegment)
for idx, fpath in enumerate(TU.extract_filenames(folder_path)):
if not TU.is_txt(fpath): continue
reference = "%s" % (fpath.join(fpath.split(folder_path)[1:]))
with open(fpath, 'r', encoding=encoding) as f:
doc = f.read()
if TU.is_txt(fpath):
with open(fpath, 'r', encoding=encoding) as f:
doc = f.read()
else:
if use_text_extraction:
try:
doc = textract.process(fpath)
doc = doc.decode('utf-8', 'ignore')
except:
if verbose:
warnings.warn('Could not extract text from %s' % (fpath))
continue
else:
continue

if breakup_docs:
small_docs = TU.paragraph_tokenize(doc, join_sentences=True, lang='en')
Expand Down
9 changes: 7 additions & 2 deletions ktrain/text/textutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,13 @@ def extract_copy(corpus_path, output_path):
def get_mimetype(filepath):
return mimetypes.guess_type(filepath)[0]

def is_txt(filepath):
return mimetypes.guess_type(filepath)[0] == 'text/plain'
def is_txt(filepath, strict=False):
if strict:
return mimetypes.guess_type(filepath)[0] == 'text/plain'
else:
mtype = get_mimetype(filepath)
return mtype is not None and mtype.split('/')[0] == 'text'


def is_pdf(filepath):
return mimetypes.guess_type(filepath)[0] == 'application/pdf'
Expand Down
2 changes: 1 addition & 1 deletion ktrain/version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__all__ = ['__version__']
__version__ = '0.24.2'
__version__ = '0.25.0'
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
#'stellargraph>=0.8.2', # forked version used by graph module
#'allennlp', # required for Elmo embeddings since TF2 TF_HUB does not work
#'textblob', # used by textutils.extract_noun_phrases
#'textract', # used by textutils.extract_copy
#'textract', # used by textutils.extract_copy and text.qa.core.SimpleQA
],
classifiers=[ # Optional
# How mature is this project? Common values are
Expand Down

0 comments on commit e97ba9c

Please sign in to comment.