# Question Answering
- This notebook extracts exact answers to queries from long natural language documents (CARTA Contracts)

- This script uses pretrained model (BERT-based question answering) available in huggingface library, which was trained on SQUAD dataset

- If supervised training data is available, we can finetune this model using techniques similar to train_NER.ipynb


### Example: 

context = "New Zealand (Māori: Aotearoa) is a sovereign island country in the southwestern Pacific Ocean. It has a total land area of 268,000 square kilometres (103,500 sq mi), and a population of 4.9 million. New Zealand's capital city is Wellington, and its most populous city is Auckland."

questions = "How many people live in New Zealand?", "What's the largest city?"

Answers = 4.9 million, Auckland

- Online Demo of BERT-based QA model: https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad

### Note:
- This script is almost same as *run_QA.ipynb*, but here the data path has just one document, and we run just 1 query for faster execution. 

- Also we use local model files downloaded from above link instead of asking huggingface to download model at runtime

Final Intern presentation: https://docs.google.com/presentation/d/10mXA7K5sa_nAkqx2onsIfrH3TPj2Ni4LfCOxDhN5XBI/edit?usp=sharing


In [1]:
import os
import importlib
import logging
import time
importlib.reload(logging)
import framework
importlib.reload(framework)
import bert_qa
importlib.reload(bert_qa)
import infer_bert_qa
importlib.reload(infer_bert_qa)
import bert_utils
importlib.reload(bert_utils)
import pandas as pd
from framework import DataCuration, FeatureEngineering
from bert_qa import TaskQA, FeatureEngineeringQA, BERTQA

# Define some constants and configurations
logging.getLogger().setLevel(logging.INFO)
ACCESS_TOKEN = 'WUpGevbWC9lsnTW8quNUtmWRdAEM89'

Using TensorFlow backend.


## Specify Task
- Mention configurations of the task and create a task object

In [2]:
DATASET = 'carta' # supports w2 and resume
TASK_CONFIG = {
    'task': 'qa'
}

task = TaskQA(TASK_CONFIG)

## Curate dataset
- Specify paths for dataset. Paths can be local or from instabase drives (use *is_local*). 
- Limit data path to have just one document (for demo purpose of this notebook)
- Also specify configurations like extensions, column names to use as index. 
- Currently supports csv format for goldens, ibocr/ibdoc for dataset. 
- Use *context2txt* to extract and store raw texts. 
- This object can be queried using *data.golden* or *data.dataset* or *data.dataset.texts* based on requirement

### Files:
- Download documents (ibocr files) from https://www.instabase.com/ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/ and specify local directory path. Use *is_local* as True
- Alternatively, specify instabase drive path (/ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/) and set *is_local* as False

In [3]:
CARTA_DATA = [
   '/Users/ahsaasbajaj/Documents/Data/QA_model/data'
]

DATASET_CONFIG = {
    'path': CARTA_DATA,
    'is_local': True, 
    'file_type': 'ibocr',
    'identifier': lambda path: os.path.basename(path).split('.ibocr')[0],
    'convert2txt': True
}

CARTA_GOLDEN = None
GOLDEN_CONFIG = None

data = DataCuration(ACCESS_TOKEN, DATASET_CONFIG, GOLDEN_CONFIG)

INFO:root:Loading dataset from /Users/ahsaasbajaj/Documents/Data/QA_model/data
INFO:root:1 files loaded
INFO:root:Converting IBOCR/IBDOC to raw texts


### Print Data objects

In [4]:
data.dataset

{'annotated_AOI_4.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x156066668>}

## Modeling (BERT Inference for question answering)
- Specify model type and load fine-tuned model for inference
- Specify queries to be inferred
- This model uses pretrained BERT QA model which was finetuned on standard datasets

### Specify TRAINING_ARGS
- Mention the class of model, to be used appropriately by back-end huggingface libraries
- Mention the path where outputs to queries are to be written
- Also supports the use of GPU for deep learning libraries

### Specify query (can be a single string or a list of queries) for question answering


### Model Files:
Specify bert-large-uncased-whole-word-masking-finetuned-squad in *model_file_or_path* of TRAINING_ARGS for huggingface to automatically downloads checkpoint in runtime

Alternatively, follow the steps below:
- Download (config.json, modelcard.json, pytorch_model.bin, vocab.txt) files from https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad#list-files
- Rename names of downloaded files to the exact names mentioned in link above
- Specify the local directory path containing above files in *model_file_or_path* of TRAINING_ARGS

#### This block of code gets answer for *query* for document in *CARTA_DATA*


In [6]:
query = "What are the Preferred stocks?"

NUM_FILES = len(data.dataset.keys())
stime = time.time()

DATA_ARGS = {
    'task': task,
    'dataset': data
}

TRAINING_ARGS = {
'model_file_or_path': "/Users/ahsaasbajaj/Documents/Data/QA_model/model",
'gpu': False,
'output_dir': '../outputs/bert_qa'
}

model = BERTQA(DATA_ARGS, TRAINING_ARGS)
output = model.predict(query)

etime = time.time()
logging.info('Total time {} seconds'.format(etime - stime))

INFO:root: Total number of Files: 1
INFO:root:File name: annotated_AOI_4.pdf
convert squad examples to features: 100%|██████████| 1/1 [00:03<00:00,  3.28s/it]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 2267.19it/s]
INFO:root:Total time 164.56057405471802 seconds
What are the Preferred stocks? : Series A Preferred Stock and Series A-1 Preferred Stock


### Print filenname, questions and corresponding answers generated by the model

In [7]:
output = output.set_index('filename')
filenames = output.index.to_list()

for filename in filenames:
    print("filename: ", filename)

    for col in output.columns.to_list():
        print("query: ", col)
        answer = output.loc[filename, col]
        print("answer: ", answer)
    

filename:  annotated_AOI_4.pdf
query:  What are the Preferred stocks?
answer:  Series A Preferred Stock and Series A-1 Preferred Stock
