## Parser for UB04 or CMS-1450 Form
This notebook will walk you through sample code to parse the UB04 or CMS-1450 Form.  
The CMS-1450 form (aka UB-04) is used by institutional providers to bill a Medicare fiscal intermediary when a provider qualifies for a waiver from the Administrative Simplification Compliance Act requirement for electronic submission of claims.


In [43]:
!python -m pip install amazon-textract-caller --upgrade
!python -m pip install amazon-textract-response-parser --upgrade

In [None]:
import boto3, json
from textractcaller.t_call import call_textract, Textract_Features, Query, QueriesConfig, Adapter, AdaptersConfig

import pandas as pd
import trp
from trp import Document
import trp.trp2 as t2
from trp.trp2 import TDocument, TDocumentSchema, TBlock, TGeometry, TBoundingBox, TPoint
from trp.t_pipeline import order_blocks_by_geo_x_y
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_forms_string, convert_table_to_kv_dict, convert_table_to_list



session = boto3.Session(profile_name='kmascar+training-Admin')
textract = boto3.client('textract')

textract_json = call_textract(input_document="samples/ub-04-Form-sample.png", features = [Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client=textract)
print(json.dumps(textract_json, indent=2))


### Analyzing the UB04 Textract JSON Response: Order of elements
On Analyzing the structured JSON output, you will notice that the order of the response in not in the reading order. To order this correctly, we will use the `order_blocks_by_geo_x_y` function.

In [56]:
t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo_x_y(t_doc)
print(get_forms_string(TDocumentSchema().dump(ordered_doc)))

trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))



|----------------------------------|---------------|
| Key                              | Value         |
| 1                                |               |
| 2                                |               |
| 3a PAT. CNTL #                   |               |
| 4 TYPE OF BILL                   |               |
| b. MED. REC. #                   |               |
| 7                                |               |
| 5 FED. TAX NO.                   |               |
| FROM                             |               |
| THROUGH                          |               |
| a                                |               |
| a                                |               |
| b                                |               |
| b                                |               |
| C                                |               |
| d                                |               |
| e                                |               |
| 29 ACDT STATE                    |          

### Analyzing the UB04 Textract JSON Response: Complexity
UB04 form is a complex form with many identical key values, making it difficult to differentiate between them. Additionally, `8 PATIENT NAME` and `8 PATIENT ADDRESS` both contain fields `a`, `b`, `c` which we would like to map back to their respective sections.

**Utility Functions:**  
We will now walk through utilitity functions that use our code repositories for Textract GeoFinder, Textract Pretty Printer and Textract Response Parser to parse hierarchical key values that are adjacent to each other or in an area

In [None]:
!python -m pip install amazon-textract-geofinder --upgrade
!python -m pip install amazon-textract-prettyprinter --upgrade

In [57]:
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from enum import Enum, auto
from typing import List


def set_hierarchy_kv(list_kv: list[KeyValue], t_document: TDocument, page_block: TBlock, prefix: str = "DEFAULT"):
    for x in list_kv:
        # print(f"{x.key.original_text} updated to {prefix}_{x.key.original_text}")
        t_document.add_virtual_key_for_existing_key(key_name=f"{prefix}_{x.key.original_text}",
                                                    existing_key=t_document.get_block_by_id(x.key.id),
                                                    page_block=page_block)

def set_adjacent_hkv(geofinder_doc: TGeoFinder, t_document: TDocument, phrase: str, number_of_keys:int=1, direction: str = 'RIGHT', prefix = None):
    list_phrase_tword = geofinder_doc.find_phrase_on_page(phrase)
    for phrase_tword in list_phrase_tword:
        # print(phrase_tword)
        if direction == 'RIGHT':
            form_fields = geofinder_doc.get_form_fields_to_the_right(word = phrase_tword, xmax = 1000, number_of_keys = number_of_keys)
        elif direction == 'BELOW':
            form_fields = geofinder_doc.get_form_fields_below(word = phrase_tword, ymax = 1000, number_of_keys = number_of_keys)
        prefix = phrase if prefix is None else prefix
        # print(f"set_adjacent_hkv, phrasess: {phrase_tword}, form_fields:{form_fields}")
        set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix=prefix, page_block=t_document.pages[0])

class Area_Constraint(Enum):
    WIDTH_PAGE_MIN = auto()
    WIDTH_PAGE_MAX = auto()
    HEIGHT_PAGE_MIN = auto()
    HEIGHT_PAGE_MAX = auto()
    INCLUDE_TOP_LEFT_PHRASE = auto()
    INCLUDE_LOWER_RIGHT_PHRASE = auto()
        
def set_area_hkv(geofinder_doc: TGeoFinder, t_document: TDocument, top_left_phrase: str, lower_right_phrase: str, area_constraint: List[Area_Constraint]=list(), prefix: str=None):
    top_left_phrase_tword = geofinder_doc.find_phrase_on_page(top_left_phrase)[0]
    lower_right_phrase_tword = geofinder_doc.find_phrase_on_page(lower_right_phrase)[0]

    top_left_coord = dict()
    lower_right_coord = dict()
    if area_constraint:
        if Area_Constraint.WIDTH_PAGE_MIN in area_constraint:
            top_left_coord["x"] = 0
        if Area_Constraint.HEIGHT_PAGE_MIN in area_constraint:
            top_left_coord["y"] = 0
        if Area_Constraint.WIDTH_PAGE_MAX in area_constraint:
            lower_right_coord["x"] = geofinder_doc.doc_width
        if Area_Constraint.HEIGHT_PAGE_MAX in area_constraint:
            lower_right_coord["y"] = geofinder_doc.doc_height
        if Area_Constraint.INCLUDE_TOP_LEFT_PHRASE in area_constraint:
            if "x" not in top_left_coord:
                top_left_coord["x"] = top_left_phrase_tword.xmin
            if "y" not in top_left_coord:
                top_left_coord["y"] = top_left_phrase_tword.ymin
        if Area_Constraint.INCLUDE_LOWER_RIGHT_PHRASE in area_constraint:
            if "x" not in lower_right_coord:
                lower_right_coord["x"] = lower_right_phrase_tword.xmax
            if "y" not in lower_right_coord:
                lower_right_coord["y"] = lower_right_phrase_tword.ymax

    top_left_coord.setdefault("x", top_left_phrase_tword.xmax)
    top_left_coord.setdefault("y", top_left_phrase_tword.ymax)
    lower_right_coord.setdefault("x", lower_right_phrase_tword.xmin)
    lower_right_coord.setdefault("y", lower_right_phrase_tword.ymin)

    top_left = TPoint(y=top_left_coord["y"], x=top_left_coord["x"])
    lower_right = TPoint(y=lower_right_coord["y"], x=lower_right_coord["x"])

    form_fields = geofinder_doc.get_form_fields_in_area(
                    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number = 1))
    prefix = top_left_phrase if prefix is None else prefix
    # print(f"set_area_hkv, phrases: {top_left_phrase_tword}, {lower_right_phrase_tword}, form_fields:{form_fields}")
    set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix=prefix, page_block=t_document.pages[0])

def get_cell_with_text(geofinder_doc: TGeoFinder, t_document: TDocument, phrase: str):
    list_phrase_tword = geofinder_doc.find_phrase_on_page(phrase)
    # print(list_phrase_tword)
    for phrase_tword in list_phrase_tword:
        # print("calling table cells")
        table_cells = geofinder_doc.get_cells_with_text(word = phrase_tword, number_of_cells = 1)
    # print("column_index:",t_document.get_block_by_id(table_cells[0].id).column_index)
    
    # geofinder_doc.get_exact_table(id = "588328c2-0ed5-44d0-b35d-849b90dfb226")
    return table_cells[0].id

def convert_table_to_key_value(geofinder_doc: TGeoFinder, t_document: TDocument, phrase: str):
    cell_ids = get_cell_with_text(geofinder_doc=geofinder_doc, t_document=t_document, phrase=phrase)
    table_kv_dict = dict()
    trp_doc = trp.Document(TDocumentSchema().dump(t_doc))
    for page in trp_doc.pages:
        for table in page.tables:
            for r, row in enumerate(table.rows):
                for c, cell in enumerate(row.cells):
                    if cell.id in cell_ids:
                        table_kv_dict = convert_table_to_kv_dict(table, ignore_table_summary=True)
                        print(json.dumps(table_kv_dict, indent=2))
    return table_kv_dict


### Writing an opinionated parser for UB04
We will now write an opinionate function `parse_ub04` that will use the right utility functions defined above for the respective field and extract the output

In [59]:
def parse_ub04(textract_json):
    t_document = TDocumentSchema().load(textract_json)
    doc_height = 1000
    doc_width = 1000
    geofinder_doc = TGeoFinder(textract_json, doc_height=doc_height, doc_width=doc_width)

    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="3a PAT CNTL", number_of_keys=1, direction="BELOW", prefix="3")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="6 STATEMENT COVERS PERIOD", number_of_keys=2, direction="BELOW")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="8 PATIENT NAME")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="8 PATIENT NAME", direction="BELOW")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="9 PATIENT ADDRESS")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="9 PATIENT ADDRESS", direction="BELOW")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="9 PATIENT ADDRESS", lower_right_phrase="29 ACDT", area_constraint=area_constraint)

    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="31 CODE", prefix="31")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="32 CODE", prefix="32")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="33 CODE", prefix="33")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="34 CODE", prefix="34")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="35 CODE", number_of_keys=2, prefix="35")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="36 CODE", number_of_keys=2, prefix="36")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="39 CODE", prefix="39")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="40 CODE", prefix="40")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="41 CODE", prefix="41")
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="56 NPI 57", number_of_keys=2, direction="BELOW", prefix=57)
    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="74 PRINCIPAL PROCEDURE", number_of_keys=2, direction="BELOW", prefix="74 PRINCIPAL")

    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="74 PRINCIPAL PROCEDURE", lower_right_phrase="77 OPERATING", area_constraint=None, prefix="74ab")

    area_constraint = [Area_Constraint.WIDTH_PAGE_MIN]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="76 ATTENDING", lower_right_phrase="78 OTHER", area_constraint=area_constraint, prefix="74cde")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="76 ATTENDING", lower_right_phrase="77 OPERATING", area_constraint=area_constraint, prefix="76")
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="77 OPERATING", lower_right_phrase="78 OTHER", area_constraint=area_constraint, prefix="77")
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="78 OTHER", lower_right_phrase="79 OTHER", area_constraint=area_constraint, prefix="78")

    area_constraint = [Area_Constraint.INCLUDE_TOP_LEFT_PHRASE, Area_Constraint.WIDTH_PAGE_MAX, Area_Constraint.HEIGHT_PAGE_MAX]
    set_area_hkv(geofinder_doc=geofinder_doc, t_document=t_document, top_left_phrase="79 OTHER", lower_right_phrase="LAST", area_constraint=area_constraint, prefix="79")

    set_adjacent_hkv(geofinder_doc=geofinder_doc, t_document=t_document, phrase="81 CC", number_of_keys=3, direction="BELOW")


    convert_table_to_key_value(geofinder_doc=geofinder_doc, t_document=t_document, phrase="REV CD")
    convert_table_to_key_value(geofinder_doc=geofinder_doc, t_document=t_document, phrase="66 DX")

    return order_blocks_by_geo_x_y(t_document)


### Calling the post-processing parser
Let's call the UB04 parser function and analyze the response.

In [60]:
t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo_x_y(t_doc)
trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))

final_t_document = parse_ub04(textract_json)

print(get_forms_string(TDocumentSchema().dump(final_t_document)))


get_cells_with_text: found keys: [TWord(text='42 rev. cd. ', original_text='42 REV. CD. ', text_type='cell', confidence=92.431640625, id='98468515-9121-4f8f-9e28-eacc209de55f', xmin=17, ymin=259, xmax=72, ymax=274, page_number=1, doc_width=1000, doc_height=1000, child_relationships='', reference=None, resolver=None)]
[
  {
    "0": "",
    "42 REV. CD. ": "",
    "43 DESCRIPTION ": "",
    "44 HCPCS RATE HIPPS CODE ": "",
    "45 SERV. DATE ": "",
    "46 SERV. UNITS ": "",
    "47 TOTAL CHARGES ": "",
    "": "",
    "48 NON-COVERED CHARGES ": "",
    "49 ": ""
  },
  {
    "0": "",
    "42 REV. CD. ": "",
    "43 DESCRIPTION ": "",
    "44 HCPCS RATE HIPPS CODE ": "",
    "45 SERV. DATE ": "",
    "46 SERV. UNITS ": "",
    "47 TOTAL CHARGES ": "",
    "": "",
    "48 NON-COVERED CHARGES ": "",
    "49 ": ""
  },
  {
    "0": "",
    "42 REV. CD. ": "",
    "43 DESCRIPTION ": "",
    "44 HCPCS RATE HIPPS CODE ": "",
    "45 SERV. DATE ": "",
    "46 SERV. UNITS ": "",
    "47 TOTAL C