## Parser for CMS-1500 Form
This notebook will walk you through sample code to parse the CMS-1500 Form.  
The Form CMS-1500 is the standard paper claim form to bill Medicare Fee-For-Service Contractors when a paper claim is allowed. In addition to billing Medicare, the Form CMS-1500 may be suitable for billing various government and some private insurers.

In [1]:
!python -m pip install amazon-textract-caller --upgrade
!python -m pip install amazon-textract-response-parser --upgrade

In [None]:
import boto3, json
from textractcaller.t_call import call_textract, Textract_Features, Query, QueriesConfig, Adapter, AdaptersConfig

import pandas as pd
import trp
from trp import Document
import trp.trp2 as t2
from trp.trp2 import TDocument, TDocumentSchema, TBlock, TGeometry, TBoundingBox, TPoint
from trp.t_pipeline import order_blocks_by_geo_x_y
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_forms_string, convert_table_to_kv_dict, convert_table_to_list

textract = boto3.client('textract')

textract_json = call_textract(input_document="samples/CMS1500-sample.png", features = [Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client=textract)
print(json.dumps(textract_json, indent=2))


### Analyzing the CMS1500 Textract JSON Response: Order of elements
On Analyzing the structured JSON output, you will notice that the order of the response in not in the reading order. To order this correctly, we will use the `order_blocks_by_geo_x_y` function.

In [3]:
t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo_x_y(t_doc)
print(get_forms_string(TDocumentSchema().dump(ordered_doc)))

trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))



|-----------------------------------------------------------------|----------------------------------------------------------|
| Key                                                             | Value                                                    |
| PICA                                                            | NOT_SELECTED                                             |
| PICA                                                            | NOT_SELECTED                                             |
| MEDICARE (Medicare#)                                            | NOT_SELECTED                                             |
| MEDICAID (Medicaid#)                                            | NOT_SELECTED                                             |
| TRICARE (ID#/DoD#)                                              | NOT_SELECTED                                             |
| CHAMPVA (Member ID#)                                            | NOT_SELECTED                               

### Analyzing the UBCMS1500 Textract JSON Response: Complexity
UB04 form is a complex form with many identical key values, making it difficult to differentiate between them. Additionally, `PATIENT DETAILS` and `INSURED DETAILS` both contain fields `CITY`, `ZIP CODE` which we would like to map back to their respective sections.

**Utility Functions:**  
We will now walk through utilitity functions that use our code repositories for Textract GeoFinder, Textract Pretty Printer and Textract Response Parser to parse hierarchical key values that are adjacent to each other or in an area

In [None]:
!python -m pip install amazon-textract-geofinder --upgrade
!python -m pip install amazon-textract-prettyprinter --upgrade

In [4]:
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from enum import Enum, auto
from typing import List


def set_hierarchy_kv(list_kv: list[KeyValue], t_document: TDocument, page_block: TBlock, prefix: str = "DEFAULT"):
    for x in list_kv:
        # print(f"{x.key.original_text} updated to {prefix}_{x.key.original_text}")
        t_document.add_virtual_key_for_existing_key(key_name=f"{prefix}_{x.key.original_text}",
                                                    existing_key=t_document.get_block_by_id(x.key.id),
                                                    page_block=page_block)

def set_adjacent_hkv(geofinder_doc: TGeoFinder, t_document: TDocument, phrase: str, number_of_keys:int=1, direction: str = 'RIGHT', prefix = None):
    list_phrase_tword = geofinder_doc.find_phrase_on_page(phrase)
    for phrase_tword in list_phrase_tword:
        # print(phrase_tword)
        if direction == 'RIGHT':
            form_fields = geofinder_doc.get_form_fields_to_the_right(word = phrase_tword, xmax = 1000, number_of_keys = number_of_keys)
        elif direction == 'BELOW':
            form_fields = geofinder_doc.get_form_fields_below(word = phrase_tword, ymax = 1000, number_of_keys = number_of_keys)
        prefix = phrase if prefix is None else prefix
        # print(f"set_adjacent_hkv, phrasess: {phrase_tword}, form_fields:{form_fields}")
        set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix=prefix, page_block=t_document.pages[0])

class Area_Constraint(Enum):
    WIDTH_PAGE_MIN = auto()
    WIDTH_PAGE_MAX = auto()
    HEIGHT_PAGE_MIN = auto()
    HEIGHT_PAGE_MAX = auto()
    INCLUDE_TOP_LEFT_PHRASE = auto()
    INCLUDE_LOWER_RIGHT_PHRASE = auto()
        
def set_area_hkv(geofinder_doc: TGeoFinder, t_document: TDocument, top_left_phrase: str, lower_right_phrase: str, area_constraint: List[Area_Constraint]=list(), prefix: str=None):
    top_left_phrase_tword = geofinder_doc.find_phrase_on_page(top_left_phrase)[0]
    lower_right_phrase_tword = geofinder_doc.find_phrase_on_page(lower_right_phrase)[0]

    top_left_coord = dict()
    lower_right_coord = dict()
    if area_constraint:
        if Area_Constraint.WIDTH_PAGE_MIN in area_constraint:
            top_left_coord["x"] = 0
        if Area_Constraint.HEIGHT_PAGE_MIN in area_constraint:
            top_left_coord["y"] = 0
        if Area_Constraint.WIDTH_PAGE_MAX in area_constraint:
            lower_right_coord["x"] = geofinder_doc.doc_width
        if Area_Constraint.HEIGHT_PAGE_MAX in area_constraint:
            lower_right_coord["y"] = geofinder_doc.doc_height
        if Area_Constraint.INCLUDE_TOP_LEFT_PHRASE in area_constraint:
            if "x" not in top_left_coord:
                top_left_coord["x"] = top_left_phrase_tword.xmin
            if "y" not in top_left_coord:
                top_left_coord["y"] = top_left_phrase_tword.ymin
        if Area_Constraint.INCLUDE_LOWER_RIGHT_PHRASE in area_constraint:
            if "x" not in lower_right_coord:
                lower_right_coord["x"] = lower_right_phrase_tword.xmax
            if "y" not in lower_right_coord:
                lower_right_coord["y"] = lower_right_phrase_tword.ymax

    top_left_coord.setdefault("x", top_left_phrase_tword.xmax)
    top_left_coord.setdefault("y", top_left_phrase_tword.ymax)
    lower_right_coord.setdefault("x", lower_right_phrase_tword.xmin)
    lower_right_coord.setdefault("y", lower_right_phrase_tword.ymin)

    top_left = TPoint(y=top_left_coord["y"], x=top_left_coord["x"])
    lower_right = TPoint(y=lower_right_coord["y"], x=lower_right_coord["x"])

    form_fields = geofinder_doc.get_form_fields_in_area(
                    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number = 1))
    prefix = top_left_phrase if prefix is None else prefix
    # print(f"set_area_hkv, phrases: {top_left_phrase_tword}, {lower_right_phrase_tword}, form_fields:{form_fields}")
    set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix=prefix, page_block=t_document.pages[0])

def get_cell_with_text(geofinder_doc: TGeoFinder, t_document: TDocument, phrase: str):
    list_phrase_tword = geofinder_doc.find_phrase_on_page(phrase)
    # print(list_phrase_tword)
    for phrase_tword in list_phrase_tword:
        # print("calling table cells")
        table_cells = geofinder_doc.get_cells_with_text(word = phrase_tword, number_of_cells = 1)
    # print("column_index:",t_document.get_block_by_id(table_cells[0].id).column_index)
    
    return table_cells[0].id

def convert_table_to_key_value(geofinder_doc: TGeoFinder, t_document: TDocument, phrase: str):
    cell_ids = get_cell_with_text(geofinder_doc=geofinder_doc, t_document=t_document, phrase=phrase)
    table_kv_dict = dict()
    trp_doc = trp.Document(TDocumentSchema().dump(t_doc))
    for page in trp_doc.pages:
        for table in page.tables:
            for r, row in enumerate(table.rows):
                for c, cell in enumerate(row.cells):
                    if cell.id in cell_ids:
                        table_kv_dict = convert_table_to_kv_dict(table, ignore_table_summary=True)
                        print(json.dumps(table_kv_dict, indent=2))
    return table_kv_dict


### Writing an opinionated parser for UB04
We will now write an opinionate function `parse_ub04` that will use the right utility functions defined above for the respective field and extract the output

In [5]:
def parse_cms1500(textract_json):
    t_document = TDocumentSchema().load(textract_json)
    doc_height = 1000
    doc_width = 1000
    geofinder_doc = TGeoFinder(textract_json, doc_height=doc_height, doc_width=doc_width)

    area_constraint = [Area_Constraint.WIDTH_PAGE_MIN, Area_Constraint.INCLUDE_TOP_LEFT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="1 MEDICARE", lower_right_phrase="4. INSURED'S NAME", area_constraint=area_constraint, prefix="1")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="PATIENTS'S BIRTH DATE", lower_right_phrase="7. INSURED'S ADDRESS", area_constraint=area_constraint, prefix="3")

    area_constraint = [Area_Constraint.WIDTH_PAGE_MIN]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="5. PATIENT'S ADDRESS", lower_right_phrase="10. IS PATIENT'S CONDITION RELATED TO:", area_constraint=area_constraint, prefix="5")

    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="6. PATIENT RELATIONSHIP TO INSURED", number_of_keys=4, direction="BELOW", prefix="6")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="7. INSURED'S ADDRESS", lower_right_phrase="11. INSURED'S POLICY GROUP OR FECA", area_constraint=area_constraint, prefix="7")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.INCLUDE_LOWER_RIGHT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="9. OTHER INSURED'S NAME", lower_right_phrase="READ BACK OF FORM BEFORE COMPLETNG", area_constraint=area_constraint, prefix="9")


    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="a. EMPLOYMENT? (Current or Previous)", number_of_keys=2, direction="BELOW", prefix="10a.EMPLOYMENT")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="b. AUTO ACCIDENT?", lower_right_phrase="c. INSURANCE PLAN NAME OR PROGRAM NAME", area_constraint=area_constraint, prefix="10b.AUTOACCIDENT")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="c. OTHER ACCIDENT?", lower_right_phrase="d. IS THERE ANOTHER HEALTH BENEFIT PLAN?", area_constraint=area_constraint, prefix="10c.OTHERACCIDENT")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="a. INSURED'S DATE OF BIRTH", lower_right_phrase="b. OTHER CLAIM ID", area_constraint=area_constraint, prefix="11a.INSURED_DOB_SEX")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="b. OTHER CLAIM ID", lower_right_phrase="d. IS THERE ANOTHER HEALTH BENEFIT PLAN?", area_constraint=area_constraint, prefix="11")

    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="d. IS THERE ANOTHER HEALTH BENEFIT PLAN?", number_of_keys=2, direction="BELOW", prefix="11d")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="12. PATIENT'S OR AUTHORIZED PERSON'S SIGNATURE", lower_right_phrase="16. DATES", area_constraint=area_constraint, prefix="12")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="13. INSURED'S OR AUTHORIZED PERSON'S", lower_right_phrase="16. DATES", area_constraint=area_constraint, prefix="13")

    area_constraint = [Area_Constraint.WIDTH_PAGE_MIN, Area_Constraint.INCLUDE_LOWER_RIGHT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="15. OTHER DATE", lower_right_phrase="REFERRING PROVIDER OR OTHER", area_constraint=area_constraint, prefix="14")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="15. OTHER DATE", lower_right_phrase="18. HOSPITALIZATION DATES", area_constraint=area_constraint, prefix="15")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="16. DATES", lower_right_phrase="18. HOSPITALIZATION DATES", area_constraint=area_constraint, prefix="16.DATES_WORK")


    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="18. HOSPITALIZATION DATES", lower_right_phrase="20. OUTSIDELAB?", area_constraint=area_constraint, prefix="18")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="20. OUTSIDELAB?", lower_right_phrase="22. RESUBMISSION?", area_constraint=area_constraint, prefix="20.OUTSIDE")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="21. DIAGNOSIS OR NATURE", lower_right_phrase="E DIAGNOSIS PONTER", area_constraint=area_constraint, prefix="21")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="22. RESUBMISSION?", lower_right_phrase="23. PRIOR AUTHORIZATION", area_constraint=area_constraint, prefix="22.RESUB")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="25. FEDERAL TAX", lower_right_phrase="32. SERVICE FACILITY LOCATION", area_constraint=area_constraint, prefix="25")

    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="26. PATIENT'S ACCOUNT NO", lower_right_phrase="33. BILLING PROVIDER INFO", prefix="27")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MIN, Area_Constraint.HEIGHT_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="31 SIGNATURE", lower_right_phrase="32. SERVICE FACILITY LOCATION INFORMATION", area_constraint=area_constraint, prefix="31")

    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="32. SERVICE FACILITY LOCATION INFORMATION", number_of_keys=2, direction="BELOW", prefix="32")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX, Area_Constraint.HEIGHT_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="33. BILLING PROVIDER INFO", lower_right_phrase="APPROVED", area_constraint=area_constraint, prefix="33")

    return order_blocks_by_geo_x_y(t_document)
 

### Calling the post-processing parser
Let's call the UB04 parser function and analyze the response.

In [7]:
t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo_x_y(t_doc)
trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))

final_t_document = parse_cms1500(textract_json)

print(get_forms_string(TDocumentSchema().dump(final_t_document)))


|-------------------------------------------------------------------|----------------------------------------------------------|
| Key                                                               | Value                                                    |
| PICA                                                              | NOT_SELECTED                                             |
| PICA                                                              | NOT_SELECTED                                             |
| 1_MEDICARE (Medicare#)                                            | NOT_SELECTED                                             |
| MEDICARE (Medicare#)                                              | NOT_SELECTED                                             |
| 1_MEDICAID (Medicaid#)                                            | NOT_SELECTED                                             |
| MEDICAID (Medicaid#)                                              | NOT_SELECTED               