# Question Answering
- This notebook extracts exact answers to queries from long natural language documents (CARTA Contracts)

- This script uses pretrained model (BERT-based question answering) available in huggingface library, which was trained on SQUAD dataset

- If supervised training data is available, we can finetune this model using techniques similar to train_NER.ipynb

### Example: 

context = "New Zealand (Māori: Aotearoa) is a sovereign island country in the southwestern Pacific Ocean. It has a total land area of 268,000 square kilometres (103,500 sq mi), and a population of 4.9 million. New Zealand's capital city is Wellington, and its most populous city is Auckland."

questions = "How many people live in New Zealand?", "What's the largest city?"

Answers = 4.9 million, Auckland

### Documents:

- Online Demo of BERT-based QA model: https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad

- Complete Results: https://instabase.atlassian.net/browse/INSIGHTS-1514

- Slides: https://docs.google.com/presentation/d/1aB6dWmaLYSilZQsnVwa2mH2Gu1jVCRV3FZ-YtqxsNqY/edit?usp=sharing

- Final Intern presentation: https://docs.google.com/presentation/d/10mXA7K5sa_nAkqx2onsIfrH3TPj2Ni4LfCOxDhN5XBI/edit?usp=sharing

In [2]:
import os
import importlib
import logging
import time
importlib.reload(logging)
import framework
importlib.reload(framework)
import bert_qa
importlib.reload(bert_qa)
import infer_bert_qa
importlib.reload(infer_bert_qa)
import bert_utils
importlib.reload(bert_utils)
import pandas as pd
from framework import DataCuration, FeatureEngineering
from bert_qa import TaskQA, FeatureEngineeringQA, BERTQA

# Define some constants and configurations
logging.getLogger().setLevel(logging.INFO)
ACCESS_TOKEN = 'WUpGevbWC9lsnTW8quNUtmWRdAEM89'

Using TensorFlow backend.


## Specify Task
- Mention configurations of the task and create a task object

In [3]:
DATASET = 'carta' # supports w2 and resume
TASK_CONFIG = {
    'task': 'qa'
}

task = TaskQA(TASK_CONFIG)

## Curate dataset
- Specify paths for dataset and goldens (if available). Paths can be local or from instabase drives (use *is_local*). 
- Also specify configurations like extensions, column names to use as index. 
- Currently supports csv format for goldens, ibocr/ibdoc for dataset. 
- Use *context2txt* to extract and store raw texts. 
- This block creates a object of DataCuration() which maps goldens with the dataset and removes any mismatches, generates 1:1 mapping. 
- This object can be queried using *data.golden* or *data.dataset* or *data.dataset.texts* based on requirement

### Files:
- Download documents (ibocr files) from https://www.instabase.com/ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/ and specify local directory path
- Alternatively, specify instabase drive path (/ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/) and set *is_local* as False
- Manually created goldens for sample Annotated files can be found here: https://docs.google.com/spreadsheets/u/2/d/1kT7suSh_261tiOGnxFTF-YB-DXy9lxJf4om6kUD3asE/edit#gid=0

In [4]:
CARTA_DATA = [
   '/Users/ahsaasbajaj/Documents/Data/CARTA/Annotated Samples/out/s1_process_files'
]
CARTA_GOLDEN = [
   '/Users/ahsaasbajaj/Documents/Data/CARTA/Annotated Samples/golden/output.csv'
]

GOLDEN_CONFIG = {
    'path': CARTA_GOLDEN,
    'is_local': True,
    'index_field_name':'filename',
    'file_type': 'csv',
    'identifier': 'file'
}
DATASET_CONFIG = {
    'path': CARTA_DATA,
    'is_local': True, 
    'file_type': 'ibocr',
    'identifier': lambda path: os.path.basename(path).split('.ibocr')[0],
    'convert2txt': True
}

data = DataCuration(ACCESS_TOKEN, DATASET_CONFIG, GOLDEN_CONFIG)

INFO:root:Loading dataset from /Users/ahsaasbajaj/Documents/Data/CARTA/Annotated Samples/out/s1_process_files
INFO:root:4 files loaded
INFO:root:Converting IBOCR/IBDOC to raw texts
INFO:root:Loading goldens from /Users/ahsaasbajaj/Documents/Data/CARTA/Annotated Samples/golden/output.csv
INFO:root:Total files Goldens: (4, 9)
INFO:root:Total files found in the source with unique index: (4, 9)


### Print Data objects and Goldens

In [5]:
data.dataset

{'annotated_AOI_2.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x154d63438>,
 'annotated_AOI_3.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x155670828>,
 'annotated_AOI_4.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x1549e4b70>,
 'annotated_AOI_5.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x1551938d0>}

In [6]:
data.golden

Unnamed: 0_level_0,Number of authorized shares / share class,Number of authorized shares / preferred share type,Cumulative dividends,Dividend rate,Original Issue Price,Liquidation preference / preferred share type,Seniority (Preferred share class),Participation (Preferred share class),Conversion price (Preferred share class)
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
annotated_AOI_2.pdf,"Common Stock: 15,442,630 shares; Preferred Sto...","Series A Preferred Stock: 3,899,551 shares; Se...",False,Series A Preferred Stock: 6% per annum; Series...,Series A Preferred Stock: $1.649 per share; Se...,Series A Preferred Stock: $1.649 per share; Se...,Series A Preferred Stock: 1; Series A-1 Prefer...,,Series A Preferred Stock: $1.649 per share; Se...
annotated_AOI_3.pdf,"Common Stock: 13,000,000 shares; Preferred Sto...","Series Seed-1 Preferred Stock: 910,000 shares;...",False,Series Seed-1 preferred stock: $0.0264 per sha...,Series Seed-1 preferred stock: $0.65 per share...,Series Seed-1 preferred stock: $0.33 per share...,Series Seed-1 preferred stock: 1; Series Seed-...,,Series Seed-1 preferred stock: $0.65 per share...
annotated_AOI_4.pdf,"Common Stock: 16,000,000 shares; Preferred Sto...","Series Seed Preferred Stock: 1,820,119 shares;...",False,Series A Preferred Stock: $0.1044 per annum pe...,Series Seed Preferred Stock: $0.795 per share;...,Series Seed Preferred Stock: $0.795 per share;...,Series Seed Preferred Stock: 1; Series A Prefe...,Series A Preferred Stock: $2.6098 per share,Series Seed Preferred Stock: $0.795 per share;...
annotated_AOI_5.pdf,"Common Stock: 18,527,000 shares; Preferred Sto...","Series Seed Preferred Stock: 2,575,871 shares;...",False,Not defined,Series Seed Preferred Stock: $ 1.0676 per shar...,Series Seed Preferred Stock: $ 1.0676 per shar...,Series Seed Preferred Stock: 1; Series A Prefe...,,Series Seed Preferred Stock: $ 1.0676 per shar...


### Specify queries (for question answering)

In [7]:
open_queries = [ 
                "Who is incorporating the company?",
                "How many shares are being created?",
                "What are the number of authorized shares?",
                "What are the Preferred stocks?",
                "What are the Non-cumulative dividends?",
                "What are the Common stocks?",
                "What is the Dividend rate per annum per preferred share type?",
                "What is the original issue price per share?",
                "What is the seniority of preferred share?",
                "What is the liquidation preference?",
                "What is the conversion price"
                ]

closed_queries = [ 
                "The company is incorporated by",
                "The number of shares being created are",
                "The common stocks are",
                "The Preferred stocks are",
                "The Non-cumulative dividends are",
                "The Dividend rate per annum per preferred share type are",
                "The number of authorized shares are",
                "The Original Issue Price per share is",
                "The Liquidation preference is"
                ]

## Modeling (BERT Inference for question answering)
- Specify model type and load fine-tuned model for inference
- Specify queries to be inferred
- This model uses pretrained BERT QA model which was finetuned on standard datasets

### Specify TRAINING_ARGS
- Mention the class of model, to be used appropriately by back-end huggingface libraries
- Mention the path where outputs to queries are to be written
- Also supports the use of GPU for deep learning libraries

### Model Files:
Specify bert-large-uncased-whole-word-masking-finetuned-squad in *model_file_or_path* of TRAINING_ARGS for huggingface to automatically downloads checkpoint in runtime

Alternatively, follow the steps below:
- Download (config.json, modelcard.json, pytorch_model.bin, vocab.txt) files from https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad#list-files
- Rename names of downloaded files to the exact names mentioned in link above
- Specify the local directory path containing above files in *model_file_or_path* of TRAINING_ARGS


Note: Below code takes a long time to run due to huge size of documents and multiple documnets in our CARTA_DATA path

To run for sample (one doc, one query), check *run_QA_demo.ipynb*

In [None]:
NUM_FILES = len(data.dataset.keys())
stime = time.time()

DATA_ARGS = {
    'task': task,
    'dataset': data
}

queries = open_queries
TRAINING_ARGS = {
'model_file_or_path': "bert-large-uncased-whole-word-masking-finetuned-squad", #  or download files from https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad and place in a local directory. Specify local dir path here.
'gpu': False,
'output_dir': '../outputs/bert_qa'
}

model = BERTQA(DATA_ARGS, TRAINING_ARGS)
output = model.predict(queries)

etime = time.time()
logging.info('Total time for {} files and {} queries each is {} seconds'.format(NUM_FILES, len(queries), (etime - stime)))

### Complete Results generated by this script are available here: https://instabase.atlassian.net/browse/INSIGHTS-1514