# Document Retrieval
- This notebook extracts 'search results' for queries from long natural language documents (CARTA Contracts)

- Each huge document is split into multiple segments, each such segment can be a candidate for retrieval. 

- Works like any web search engine -- to return context surrounding the query terms. The algorithm matches each segment with query, and returns a ranked list of results (segments)

- Uses text matching algorithm like BM25 -- https://en.wikipedia.org/wiki/Okapi_BM25


Final Intern presentation: https://docs.google.com/presentation/d/10mXA7K5sa_nAkqx2onsIfrH3TPj2Ni4LfCOxDhN5XBI/edit?usp=sharing


In [2]:
import os
import importlib
import logging
import time
importlib.reload(logging)
import framework
importlib.reload(framework)
import bert_qa
importlib.reload(bert_qa)
import infer_bert_qa
importlib.reload(infer_bert_qa)
import bert_utils
importlib.reload(bert_utils)
import pandas as pd
from framework import DataCuration, FeatureEngineering, StringProcessing
from retrieval import TaskRetrieval, FeatureEngineeringRetrieval, Retrieval

# Define some constants and configurations
logging.getLogger().setLevel(logging.INFO)
ACCESS_TOKEN = 'WUpGevbWC9lsnTW8quNUtmWRdAEM89'

Using TensorFlow backend.


## Specify Task
- Mention configurations of the task and create a task object

In [3]:
DATASET = 'carta' # supports w2 and resume
TASK_CONFIG = {
    'task': 'retrieval'
}

task = TaskRetrieval(TASK_CONFIG)

## Curate dataset
- Specify paths for dataset. Paths can be local or from instabase drives (use *is_local*). 
- Limit data path to have just one document (for demo purpose of this notebook)
- Also specify configurations like extensions, column names to use as index. 
- Currently supports csv format for goldens, ibocr/ibdoc for dataset. 
- Use *context2txt* to extract and store raw texts. 
- This object can be queried using *data.golden* or *data.dataset* or *data.dataset.texts* based on requirement

### Files:
- Download documents (ibocr files) from https://www.instabase.com/ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/ and specify local directory path. Use *is_local* as True
- Alternatively, specify instabase drive path (/ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/) and set *is_local* as False

In [4]:
CARTA_DATA = [
#    '/Users/ahsaasbajaj/Documents/Data/QA_model/data'
'/ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/'
]

DATASET_CONFIG = {
    'path': CARTA_DATA,
    'is_local': False, 
    'file_type': 'ibocr',
    'identifier': lambda path: os.path.basename(path).split('.ibocr')[0],
    'convert2txt': True
}

CARTA_GOLDEN = None
GOLDEN_CONFIG = None

data = DataCuration(ACCESS_TOKEN, DATASET_CONFIG, GOLDEN_CONFIG)

INFO:root:Loading dataset from /ib_solutions/solutions/fs/Instabase%20Drive/poc/carta/Annotated%20Sample/out/s1_process_files/
INFO:root:5 files loaded
INFO:root:Converting IBOCR/IBDOC to raw texts


### Print data objects

In [5]:
data.dataset

{'annotated_AOI_2.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x15a56fa90>,
 'annotated_AOI_3.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x15cb4fef0>,
 'annotated_AOI_4.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x15a56f400>,
 'annotated_AOI_5.pdf': <instabase.ocr.client.libs.ibocr.ParsedIBOCR at 0x15d39f080>}

## Feature Engineering (Generate Labeled Data)
- Specify DATA_ARGS which includes the task and data objects created beforehand
- Mention fields of interest (for extraction, classification) in DATA_ARGS
- Split huge input document into multiple segments (each being candidate of retrieval)
- Also generate maps which help to track page numbers and original doc indices

In [6]:
filename = 'annotated_AOI_4.pdf'
print('filename to split and query: ', filename)

query = "Preferred Stocks"

NUM_FILES = len(data.dataset.keys())
stime = time.time()

DATA_ARGS = {
    'task': task,
    'dataset': data
}

TRAINING_ARGS = {
'model_file_or_path': "BM25Okapi"
}

fe = FeatureEngineeringRetrieval(DATA_ARGS)

corpus, doc_to_id_map, doc_to_pageNumber_map = fe.split_doc(filename=filename, split_size=100)  # list of document segments


INFO:root:Total pages in file: 20
filename to split and query:  annotated_AOI_4.pdf
INFO:root:Total Corpus Size 247 docs


## Modeling (BM25 Retrieval)
- Specify model type 
- Specify queries to be inferred
- This model uses BM25 text matching to rank results wrt query

### Specify TRAINING_ARGS
- Mention the class of model, to be used appropriately by back-end libraries


#### Specify query (can be a single string or a list of queries) for question answering
#### This block of code gets answer for *query* for document in *CARTA_DATA*

In [8]:
tokenized_corpus = fe.tokenize_corpus(corpus)

model = Retrieval(DATA_ARGS, TRAINING_ARGS)
model.train(corpus, tokenized_corpus, doc_to_id_map, doc_to_pageNumber_map)
output, scores, top_pageNumbers = model.predict(query, len_results=5)

etime = time.time()
logging.info('Total time {} seconds'.format(etime - stime))

INFO:root:Scores available for 247 docs
INFO:root:Total time 1.4929239749908447 seconds


### Print Predictions

In [10]:
idx = 1
for doc, score, page_num in zip(output, scores, top_pageNumbers):
    print('Ranked Item {0} Matching score: {1:0.2f}, Page Number: {2}'.format(idx, score, page_num))
    clean_output = " ".join(doc.split())
    print(clean_output, '\n')
    idx += 1

Ranked Item 1 Matching score: 1.64, Page Number: 4
Prices Certificate" the "Original Issue Price" shall mean $0.795 per share for holders of Series Seed Preferred Stock, $1.3049 per share forholders of Series A Preferred Stock and $2.3199 for holders of Series A-1 Preferred Stock (each asadjusted for any stock dividends, stocksplits, stock combinations, recapitalizations orsimilar events withrespect tosuch shares) 

Ranked Item 2 Matching score: 1.64, Page Number: 8
The "Conversion Price" for each series of Preferred Stock shall initially mean the Original Issue Price for such series of Preferred Stock. 

Ranked Item 3 Matching score: 1.61, Page Number: 6
If any such holder shall be deemed to have converted such shares of Preferred Stock into Common Stock pursuant to this paragraph, then such holder shall not be entitled to receive any distribution that would otherwise be made to holders of the Series Seed Preferred Stock, Series A Preferred Stock or Series A-1 Preferred Stock (as appl