Merge branch 'develop'

amaiya · Oct 13, 2021 · 72a0299 · 72a0299
2 parents a882d96 + eace7d6
commit 72a0299
Show file tree

Hide file tree

Showing 20 changed files with 1,567 additions and 47 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,19 @@ Most recent releases are shown at the top. Each release shows:
 - **Changed**: Additional parameters, changes to inputs or outputs, etc
 - **Fixed**: Bug fixes that don't change documented behaviour
 
+## 0.28.0 (2021-10-13)
+
+### New:
+- `text.AnswerExtractor` is a universal information extractor powered by a Question-Answering module and capable of extracting user-specfied information from texts.
+- `text.TextExtractor` is a  text extraction pipeline (e.g., convert PDFs to plain text)
+
+### Changed
+- changed transformers pin to  `transformers>=4.0.0,<=4.10.3`
+
+### Fixed:
+- N/A
+
+
 ## 0.27.3 (2021-09-03)
 
 ### New:
@@ -19,7 +32,6 @@ Most recent releases are shown at the top. Each release shows:
 - change API call to support newest `causalnlp`
 
 
-
 ## 0.27.2 (2021-07-28)
 
 ### New:

diff --git a/FAQ.md b/FAQ.md
@@ -806,7 +806,7 @@ You can safely ignore the error, if it arises from downloading Hugging Face **tr
 
 If you have documents in formats like `.pdf`, `.docx`, or `.pptx` formats and want to use them in a training set or with various **ktrain** features 
 like zero-shot-learning or text summarization, they will need to be converted to plain text format first (i.e., `.txt` files).  You can use the
-`ktrain.text.textutils.extract_copy` function to automatically do this. Alternatively, you can use other tools like [Apache Tika](https://tika.apache.org/) to do the conversion.
+`ktrain.text.textutils.extract_copy` function to automatically do this.  As of v0.28.x of ktrain, there is also the [TextExtractor](https://nbviewer.org/github/amaiya/ktrain/blob/develop/examples/text/text_extraction_example.ipynb) that can be used for conversion.  Alternatively, you can use other tools like [Apache Tika](https://tika.apache.org/) to do the conversion.
 
 With respect to Question-Answering, the `SimpleQA.index_from_folder` method includes a `use_text_extraction` argument.  When set to `True`, question-answering can be performed on document sets 
 comprised of many different file types. More information on this is included in the [question-answering example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb).

diff --git a/README.md b/README.md
@@ -10,6 +10,31 @@
 
 
 ### News and Announcements
+- **2021-10-15**
+  - **ktrain v0.28.x** is released and now includes the `AnswerExtractor`, which allows you to extract any information of interest from documents by simply phrasing it in the form of a question. A short example is shown here, but see the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/qa_information_extraction.ipynb) for more information.
+```python
+# QA-Based Information Extraction
+
+# DataFrame BEFORE
+df.head()
+#     Text
+#0    Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.
+#1    His speciality is medical risk assessments, and he is 30 years old.
+#2    A total of nine studies including 356 patients were included in this study.
+
+# AnswerExtractor will create two new columns:  'Risk Factors' and 'Sample Size'
+from ktrain.text import AnswerExtractor
+ae = AnswerExtractor()
+df = ae.extract(df.Text.values, df, [('What are the risk factors?', 'Risk Factors'),
+                                     ('How many individuals in sample?', 'Sample Size')])
+
+# DataFrame AFTER
+df[['Risk Fctors', 'Sample Size']].head()
+#     Risk Factors                                       Sample Size
+#0    sex (male), age (≥60), and severe pneumonia        None
+#1    None                                               None
+#2    None                                               356
+```
 - **2021-07-20**
   - **ktrain v0.27.x** is released and now supports causal inference using [meta-learners](https://arxiv.org/abs/1706.03461). See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/tabular/causal_inference_example.ipynb) for more information.
 - **2021-07-15**
@@ -35,6 +60,8 @@
      - **Easy-to-Use Built-In Search Engine**:  perform keyword searches on large collections of documents <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
      - **Zero-Shot Learning**:  classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
      - **Language Translation**:  translate text from one language to another <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/language_translation_example.ipynb)]</sup></sub>
+     - **Text Extraction**: Extract text from PDFs, Word documents, etc. <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_extraction_example.ipynb)]</sup></sub>
+     - **Universal Information Extraction**:  extract any kind of information from documents by simply phrasing it in the form of a question <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/qa_information_extraction.ipynb)]</sup></sub>
   - `vision` data:
     - **image classification** (e.g., [ResNet](https://arxiv.org/abs/1512.03385), [Wide ResNet](https://arxiv.org/abs/1605.07146), [Inception](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf)) <sub><sup>[[example notebook](https://colab.research.google.com/drive/1WipQJUPL7zqyvLT10yekxf_HNMXDDtyR)]</sup></sub>
     - **image regression** for predicting numerical targets from photos (e.g., age prediction) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/vision/utk_faces_age_prediction-resnet50.ipynb)]</sup></sub>
@@ -314,6 +341,8 @@ pip install torch
 pip install shap
 # for ktrain.tabular.causal_inference_model
 pip install causalnlp
+# for ktrain.text.TextExtractor
+pip install textract
 ```
 If the above libaries are not installed, **ktrain** will complain  when a method or function needing either any of the above is invoked.
 Notice that **ktrain** is using forked versions of the `eli5` and `stellargraph` libraries above in order to support TensorFlow2.

diff --git a/examples/README.md b/examples/README.md
@@ -14,6 +14,8 @@ This directory contains various example notebooks using *ktrain*.  The directory
   - [Open-Domain Question-Answering](#textqa):  ask questions to a large text corpus and receive exact candidate answers
   - [Zero-Shot Learning](#zsl):  classify documents by user-supplied topics **without** any training examples
   - [Language Translation](#translation): an example of language translation using pretrained MarianMT models
+  - [Text Extraction](#textextraction): extract text from PDFs, Word documents, etc.
+  - [Universal Information Extraction](#extraction): an example of using a Question-Answering model for information extraction
 - `vision`:  
   - [image classification](#imageclass):  models for image datasetsimage classification examples using various models and datasets
   - [image regression](#imageregression): example of predicting numerical values purely from images/photos
@@ -142,6 +144,8 @@ The objective of the CoNLL2003 task is to classify sequences of words as belongi
 ### <a name="textqa"></a>Open-Domain Question-Answering: [question_answering_with_bert.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 ### <a name="zsl"></a>Zero-Shot Learning: [zero_shot_learning_with_nli.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 ### <a name="translation"></a>Language Translation: [language_translation_example.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
+### <a name="textextraction"></a>Text Extraction: [text_extraction_example.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
+### <a name="extraction"></a>Universal Information Extraction: [qa_information_extraction.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 
 
 ## Vision Data