# Post-processing CMS1500 forms using Textract Geofinder Library 



In document processing workflows, often times we need post-processing techniques to extract consistently formatted entities and improve the accuracy of the data ingested from documents to downstream systems. **Textractor** is a python package created to seamlessly work with [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html) a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract offering various post-processing capabilities.

Below are the different amazon-textract-* packages, you can find them using the links below:

- [amazon-textract-caller](https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller) (to simplify calling Amazon Textract without additional dependencies)
- [amazon-textract-response-parser](https://pypi.org/project/amazon-textract-response-parser/) (to parse the JSON response returned by Textract APIs)
- [amazon-textract-overlayer](https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer) (to draw bounding boxes around the document entities on the document image)
- [amazon-textract-prettyprinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) (convert Amazon Textract response to CSV, text, markdown, ...)
- [amazon-textract-geofinder](https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinegeofinder) (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)




## Installation
You will need to run the cell below only once for installation. 
In this use case, Amazon Textract Geofinder is the main package that we will use. 


[Amazon Textract Geofinder](https://pypi.org/project/amazon-textract-geofinder/) : Amazon Textract package to easier access data through geometric information and extract specific entities.

<b>Use cases include:</b>

   -  Give context to key/value pairs from the Amazon Textract AnalyzeDocument API for FORMS
   -  Find values in specific areas


In [None]:
!python -m pip install amazon-textract-helper amazon-textract-geofinder

## Notebook setup

In [1]:
from textractgeofinder.ocrdb import AreaSelection
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from textractprettyprinter.t_pretty_print import get_forms_string
from textractcaller import call_textract
from textractcaller.t_call import Textract_Features
import trp.trp2 as t2

### Other helper libraries for Textract response parsing : 

- Using <b>call_textract( )</b> from the [Textract-Caller](https://github.com/aws-samples/amazon-textract-textractor/tree/c689441c0562afb4976d4f248559e59289a33777/caller) library makes it is easy to parse JSON responses from AnalyzeDocument API.

- Also using [Textract-PrettyPrinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) library that provides functions to format the output received from Textract in more easily consumable formats such as CSV.

In [2]:
# path to the image/file 
image_filename='./doc-samples/CMS1500.png'

j = call_textract(input_document=image_filename, features=[Textract_Features.FORMS])

In [3]:
# loading the response JSON to TDocumentSchema Object.
t_document = t2.TDocumentSchema().load(j)
doc_height = 1000
doc_width = 1000
# loading the response JSON to TGeoFinder Schema Object.
geofinder_doc = TGeoFinder(j, doc_height=doc_height, doc_width=doc_width)

## Use case 1 : Hierarchical Key-Value mapping


Here, we define <b>set_hierachy_kv</b> is a helper function to add "virtual" Heirarchical keys to indicate context to the leaf key-value pairs.

In [4]:
def set_hierarchy_kv(list_kv: list[KeyValue], t_document: t2.TDocument, page_block: t2.TBlock, prefix="BORROWER"):
    for x in list_kv:
        t_document.add_virtual_key_for_existing_key(key_name=f"{prefix}_{x.key.text}",
                                                    existing_key=t_document.get_block_by_id(x.key.id),
                                                    page_block=page_block)

We then find the relevant phrases in the document to specify the area of key value pairs related to the patient information. Further, we will use this information to add new key value pairs with their "Hierarchical" parent key to the Amazon Textract Response JSON Schema.

In [5]:

# Using geometrical information in the form for mapping keys to Item #6 Patient Relationship

patient_dob = geofinder_doc.find_phrase_on_page("6. PATIENT RELATIONSHIP TO INSURED")[0]
patient_relationship = geofinder_doc.find_phrase_on_page("8. PATIENT STATUS",min_textdistance=0.99)[0]

top_left = t2.TPoint(y=patient_dob.ymax, x=0)
lower_right = t2.TPoint(y=patient_relationship.ymin, x=doc_width)


form_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1))
set_hierarchy_kv(list_kv=form_fields,
                 t_document=t_document,
                 prefix='6_PT_RELATIONSHIP',
                 page_block=t_document.pages[0])


All the keys now have a context which makes it possible to parse the response in downstream processes. 

For eg.: for Item #6, the keys now have a hierarchical key `6_PT_RELATIONSHIP` i.e. `6_PT_RELATIONSHIP_spouse` shows `SELECTED` and the rest of the keys show `NOT_SELECTED` in their values.

In [6]:
print(get_forms_string(t2.TDocumentSchema().dump(t_document)))

|----------------------------------------------------------------------------------------------|----------------------------------------------------------|
| Key                                                                                          | Value                                                    |
| Single                                                                                       | NOT_SELECTED                                             |
| 7. INSURED'S ADDRESS (No., Street)                                                           | 123 Any Street                                           |
| CITY                                                                                         | Any City                                                 |
| 32. NAME AND ADDRESS OF FACILITY WHERE SERVICES WERE RENDERED (If other than home or office) | Mateo Jackson PhD 9876 Healthcare Ave Any Town, CA 92126 |
| Employed                                                      