In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

# QA-Based Information Extraction


The latest version of ktrain (v0.28.0), an open-source machine learning library, now includes a “universal” information extractor, which uses a Question-Answering model to extract any information of interest from documents.

Suppose you have a table (e.g., an Excel spreadsheet) that looks like the DataFrame below. (In this example, each document is a single sentence, but each row can potenially be an entire report with many paragraphs.)

In [2]:
data = [
'Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .',
'There is a risk of Donald Trump running again in 2024.',
"""This risk was consistent across patients stratified by history of CVD, risk factors 
but no CVD, and neither CVD nor risk factors.""",
"""Risk factors associated with subsequent death include older age, hypertension, diabetes, 
ischemic heart disease, obesity and chronic lung disease; however, sometimes 
there are no obvious risk factors .""",
'Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.',
'His speciality is medical risk assessments and is 30 years old.',
"""Results: A total of nine studies including 356 patients were included in this study, 
the mean age was 52.4 years and 221 (62.1%) were male."""]
import pandas as pd
pd.set_option("display.max_colwidth", None)
df = pd.DataFrame(data, columns=['Text'])
df.head(10)

Unnamed: 0,Text
0,"Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) ."
1,There is a risk of Donald Trump running again in 2024.
2,"This risk was consistent across patients stratified by history of CVD, risk factors \nbut no CVD, and neither CVD nor risk factors."
3,"Risk factors associated with subsequent death include older age, hypertension, diabetes, \nischemic heart disease, obesity and chronic lung disease; however, sometimes \nthere are no obvious risk factors ."
4,"Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia."
5,His speciality is medical risk assessments and is 30 years old.
6,"Results: A total of nine studies including 356 patients were included in this study, \nthe mean age was 52.4 years and 221 (62.1%) were male."


Let's pretend your boss wants you to extract both the reported risk factors from each document and the sample sizes for the reported studies.  This can easily be accomplished with the `AnswerExtractor` in **ktrain**, a kind of universal information extractor based on a Question-Answering model.

In [3]:
from ktrain.text import AnswerExtractor
ae = AnswerExtractor()
df = ae.extract(df.Text.values, df, [('What are the risk factors?', 'Risk Factors'), 
                                     ('How many individuals in sample?', 'Sample Size')])
df.head(10)

Unnamed: 0,Text,Risk Factors,Sample Size
0,"Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .","sex, obesity, genetic factors and mechanical factors",
1,There is a risk of Donald Trump running again in 2024.,,
2,"This risk was consistent across patients stratified by history of CVD, risk factors \nbut no CVD, and neither CVD nor risk factors.",and neither cvd nor risk factors,
3,"Risk factors associated with subsequent death include older age, hypertension, diabetes, \nischemic heart disease, obesity and chronic lung disease; however, sometimes \nthere are no obvious risk factors .","older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease",
4,"Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.","sex (male), age (≥60), and severe pneumonia",
5,His speciality is medical risk assessments and is 30 years old.,,
6,"Results: A total of nine studies including 356 patients were included in this study, \nthe mean age was 52.4 years and 221 (62.1%) were male.",,356.0


As you can see, all that's required is that you phrase the type information you want to extract as a question (e.g., *What are the risk factors?*) and provide a label (e.g., *Risk Factors*).  The above command will return a new DataFrame with additional columns containing the information of interest.

If there are false positives (or false negatives), you can adjust the `min_conf` parameter (i.e., minimum confidence threshold) until you’re happy (default is `min_conf=5`).  If `return_conf=True`, then columns showing the confidence scores of each extraction is also shown.

In [4]:
del df['Risk Factors']
del df['Sample Size']
df = ae.extract(df.Text.values, df, [('What are the risk factors?', 'Risk Factors'), 
                                     ('How many individuals in sample?', 'Sample Size')], return_conf=True)
df.head(10)

Unnamed: 0,Text,Risk Factors,Risk Factors CONF,Sample Size,Sample Size CONF
0,"Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .","sex, obesity, genetic factors and mechanical factors",13.61,,5.47
1,There is a risk of Donald Trump running again in 2024.,,2.91,,-1.95
2,"This risk was consistent across patients stratified by history of CVD, risk factors \nbut no CVD, and neither CVD nor risk factors.",and neither cvd nor risk factors,7.61,,-3.74
3,"Risk factors associated with subsequent death include older age, hypertension, diabetes, \nischemic heart disease, obesity and chronic lung disease; however, sometimes \nthere are no obvious risk factors .","older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease",15.07,,-10000.0
4,"Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.","sex (male), age (≥60), and severe pneumonia",12.65,,3.69
5,His speciality is medical risk assessments and is 30 years old.,,1.95,,2.41
6,"Results: A total of nine studies including 356 patients were included in this study, \nthe mean age was 52.4 years and 221 (62.1%) were male.",,-0.8,356.0,12.05
