# Mortgage Document Extraction - _continued_

In this notebook, we will train an Amazon Comprehend custom entity recognizer so that we can detect and extract entities from the HOA document. We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) to extract the plaintext data from the document and use data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) to prepare training data. We will also be needing the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) libraries. We will perform two types of entity recognition with Amazon Comprehend.

- [Default entity recognition](#step1)
- [Custom entity recognition](#step2)

---

## Setup Notebook


In [None]:
import boto3
import botocore
import sagemaker
import time
import os
import json
import datetime
import io
import uuid
import pandas as pd
import numpy as np
from pytz import timezone
from PIL import Image, ImageDraw, ImageFont
import multiprocessing as mp
from pathlib import Path
from IPython.display import Image, display, HTML, JSON, IFrame
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string
from trp import Document

# Document
from IPython.display import Image, display, HTML, JSON
from PIL import Image as PImage, ImageDraw


# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)


---
# Default Entity Recognition with Amazon Comprehend <a id="step1"></a>

Amazon Comprehend can detect a pre-defined list of default entities using it's pre-trained model. Check out the [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html) for a full list of default entitied. In this section, we will see how we can use Amazon Comprehend's default entity recognizer to get the default entities present in the document.

In [None]:
documentName = "docs/hoa_statement.pdf"
display(IFrame(documentName, 500, 600));

We will now extract the (UTF-8) string text from the document above and use the Amazon Comprehend [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectEntities.html) API to detect the default entities.

In [None]:
response_1 = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/hoa_statement.pdf') 
lines_1 = get_string(textract_json=response_1, output_type=[Textract_Pretty_Print.LINES])
text_1 = lines_1.replace("\n", " ")
text_1

In [None]:
entities_default = comprehend.detect_entities(LanguageCode="en", Text=text_1)

df_default_entities = pd.DataFrame(entities_default["Entities"], columns = ['Text', 'Type', 'Score'])
df_default_entities = df_default_entities.drop_duplicates(subset=['Text']).reset_index()

df_default_entities

The output above shows us the default entities that Amazon Comprehend was able to detect in the document's text. However, we are interested in knowing specific entity values such as the property address (which is denoted currently by default entity LOCATION), or the HOA due amount (which is denoted currently by default entity QUANTITY). In order to be able to do that, we will need to train an Amazon Comprehend custom entity recognizer which we will do in the following section

---
# Custom Entity Recognition with Amazon Comprehend <a id="step2"></a>

## Data preparation

There are 2 different ways we can train an Amazon Comprehend  custom entity recognizer. 

- [Annotations](https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation.html)
- [Entity lists](https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html)

Annotation method may provide more accuracy but preparing annotation data is involved. For the purposes of this hands-on we are going to train an custom entity recognizer using [entity lists](https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html). To train using entity lists we will frst need a list of entities along with sample values in a CSV format, we will also need the document's text (one line per document) in a separate plaintext file. This means every line in the training data document is a full document.

Let's take a look at our sample document.

In [None]:
documentName = "docs/hoa_statement.pdf"
display(IFrame(documentName, 500, 600));

We would like to extract 2 entities from this document

- The property address (`PROPERTY_ADDRESS`)
- The total HOA due amount (`HOA_DUE_AMOUNT`)

Since we are going to use and Entity List with the above two entities, we need to get the sample document's content in UTF-8 encoded plain text format. This can be done by extracting the text from the document file(s) using Amazon textract.

In [None]:
response = call_textract(input_document=f's3://{data_bucket}/idp-mortgage/textract/hoa_statement.pdf') 
lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
text = lines.replace("\n", " ")
text

The custom entity recognizer needs atleast 200 document samples, and 250 entity samples for each entity. For the purposes of this hands-on we have provided the entity list CSV file and the document file where each line is an entire document (one document per line).


---

## Training the custom entity recognizer

Let's take a look at the entity list csv file.

In [None]:
entities_df = pd.read_csv('./data/entity_list.csv', dtype={'Text': object})
entities = entities_df["Type"].unique().tolist()
print(f'Custom entities : {entities}')
print(f'\nTotal Custom entities: {entities_df["Type"].nunique()}')
print("\n\nTotal Sample per entity:")
entities_df['Type'].value_counts()

Notice that we have two entities in the entity list CSV file - 'PROPERTY_ADDRESS' and 'HOA_DUE_AMOUNT'. We also have about 300 samples per entity. With this we are now ready to train the custom entity recognizer. Let's upload the document file and entity list csv file to S3.

In [None]:
!aws s3 cp ./data/entity_list.csv s3://{data_bucket}/idp-mortgage/comprehend/entity_list.csv
!aws s3 cp ./data/entity_training_corpus.txt s3://{data_bucket}/idp-mortgage/comprehend/entity_training_corpus.txt

We will initialize a few variables and start the entity recognizer training. Run the two code cells below-

In [None]:
entities_uri = f's3://{data_bucket}/idp-mortgage/comprehend/entity_list.csv'
training_data_uri = f's3://{data_bucket}/idp-mortgage/comprehend/entity_training_corpus.txt'

print(f'Entity List CSV File: {entities_uri}')
print(f'Training Data File: {training_data_uri}')

In [None]:
# Create a custom entity recognizer
account_id = boto3.client('sts').get_caller_identity().get('Account')
id = str(datetime.datetime.now().strftime("%s"))

entity_recognizer_name = 'mortgage-custom-ner-hoa'
entity_recognizer_version = 'v1'
entity_recognizer_arn = ''
create_response = None
EntityTypes = []
for e in entities:
    EntityTypes.append( {'Type':e})

In [None]:
try:
    create_response = comprehend.create_entity_recognizer(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'EntityTypes': EntityTypes,
            'Documents': {
                'S3Uri': training_data_uri
            },
            'EntityList': {
                'S3Uri': entities_uri
            }
        },
        DataAccessRoleArn=role,
        RecognizerName=entity_recognizer_name,
        VersionName=entity_recognizer_version,
        LanguageCode='en'
    )
    
    entity_recognizer_arn = create_response['EntityRecognizerArn']
    
    print(f"Comprehend Custom entity recognizer created with ARN: {entity_recognizer_arn}")
except Exception as error:

    print(error)

Note that the training may take about 20 minutes. The status of the training can be checked using the code below. You can also view the status of the training job from the Amazon Comprehend console.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

jobArn = create_response['EntityRecognizerArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    
    describe_custom_recognizer = comprehend.describe_entity_recognizer(
        EntityRecognizerArn = jobArn
    )
    status = describe_custom_recognizer["EntityRecognizerProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document entity recognizer: {status}")
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
    time.sleep(10)

---
## Deploy Custom Entity Recognizer Endpoint

Our model has now been trained and can be deployed. In the next code cell, we will deploy the trained custom entity recognizer.

In [None]:
#create comprehend endpoint
model_arn = entity_recognizer_arn
ep_name = 'mortgage-hoa-ner-endpoint'

try:
    endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ER_ENDPOINT_ARN=endpoint_response['EndpointArn']
    print(f'Endpoint created with ARN: {ER_ENDPOINT_ARN}')
    %store ER_ENDPOINT_ARN
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An endpoint with the name "{ep_name}" already exists.')
        ER_ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:entity-recognizer-endpoint/{ep_name}'
        print(f'The classifier endpoint ARN is: "{ER_ENDPOINT_ARN}"')
        %store ER_ENDPOINT_ARN
    else:
        print(error)

Note that the endpoint creation may take about 20 minutes. The status of the deployment can be checked using the code below. You can also view the status of the training job from the Amazon Comprehend console.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

ep_arn = endpoint_response["EndpointArn"]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    
    describe_endpoint_resp = comprehend.describe_endpoint(
        EndpointArn=ep_arn
    )
    status = describe_endpoint_resp["EndpointProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom entity recognizer classifier: {status}")
    
    if status == "IN_SERVICE" or status == "FAILED":
        break
        
    time.sleep(10)

---

## Detect Entities using the Endpoint

We will now try to detect our two custom entities 'PROPERTY_ADDRESS' and 'HOA_DUE_AMOUNT' from our sample HOA Letter. We will define a function that will call the comprehend DetectEntities API with the text extracted from textract and the enpoint. The expected output are the detected entities, their values and their corresponding confidence scores

In [None]:
# from trp import Document

def get_entities(text):
    try:
        #detect entities
        entities_custom = comprehend.detect_entities(LanguageCode="en", Text=text, EndpointArn=ER_ENDPOINT_ARN)  
        df_custom = pd.DataFrame(entities_custom["Entities"], columns = ['Text', 'Type', 'Score'])
        df_custom = df_custom.drop_duplicates(subset=['Text']).reset_index()
        return df_custom
    except Exception as e:
        print(e)

In [None]:
resp = get_entities(text)
resp

---

# Conclusion

In this notebook, we saw how we can train an Amazon Comprehend custom entity recognizer to detect custom entities from documents containing dense texts. We used entity lists to train the model, and eventually deployed the model with the trained model. We then used the endpoint to detect our custom entities from the text extracted by Amazon Textract, from our sample HOA Letter document.