# Mortgage Document Classification

---

## Setup Notebook

In this notebook, we are going to train an Amazon Comprehend custom classifier, and deploy it behind an endpoint. We will then use the end-point to test document classification. Let's install and import some libraries that are going to be required.

In [None]:
!python -m pip install -q amazon-textract-response-parser --upgrade --force-reinstall
!python -m pip install -q amazon-textract-caller --upgrade --force-reinstall
!python -m pip install -q amazon-textract-prettyprinter --upgrade --force-reinstall

In [None]:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string
from trp import Document

In [None]:
import boto3
import botocore
import sagemaker
import os
import io
import datetime
import pandas as pd
from PIL import Image
from pathlib import Path
import multiprocessing as mp
from sagemaker import get_execution_role
from IPython.display import Image, display, HTML, JSON

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)

---

## Amazon Comprehend Custom Classification

At the beginning of our document processing stage, it may not be obvious as do which documents are present in the mortgage packet. Using Amazon Comprehend custom classifier we will first identify these documents into their respective classes. Once we know which documents are present in the packet, we can run any kind of validation such as look for missing/required documents, extract specific document in a specific way such as ID documents and so on. The figure below explains this process.

<p align="center">
  <img src="./images/classification.png" alt="cfn1" width="800px"/>
</p>



---
## Data preparation

In order to train an Amazon Comprehend [custom classifier](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html), we will need data in UTF-8 (plaintext) format. We are going to train the classifier model in "[Multi class](https://docs.aws.amazon.com/comprehend/latest/dg/prep-classifier-data-multi-class.html)" mode by providing it a CSV file with labeled data. 

The csv file must contain data in the following format

```
label, document
```

Where `label` is the document for example `PAYSTUB`, `W2`, `1099-DIV` and so on, and the document is the plaintext data extracted from each document. Since are documents are either image files or PDF files, we will use Amazon Textract `DetectDocumentText` API to get the text out of these documents and then prepare a CSV file. 

For example, in the code cell below we pick the W2 document and get the plain text out of it to create a comma separated output where the first field is the document lable i.e. `w2` and the second field is the text from the document itself extracted by Amazon Textract

In [None]:
response = call_textract(input_document=f'docs/W2.jpg') 
# using pretty printer to get all the lines
lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
row = []
row.append('w2')
row.append(lines)

row

The code cell example above is a single line of CSV for W2 type of document. Since, Comprehend classification model training requires a minimum of 10 sample documents per class to be able to train the model, you will repeat this process for the remaining of W2 sample documents to generate a CSV file with 10 lines. You will then repeat the process for all the other document classes to generate the corresponding CSV data for the documents.

For the purposes of this lab, we have provided a csv file ready to be used to train the model. The CSV file contains 10 or more samples for each of the document types that we want to train the model with.

In [None]:
comprehend_df = pd.read_csv('data/comp_class_training_data.csv', sep=",", names=['label','document'])
comprehend_df

Let's upload this training data CSV file into our S3 bucket

In [None]:
# Upload data to S3 bucket:
!aws s3 cp data/comp_class_training_data.csv s3://{data_bucket}/idp-mortgage/comprehend/comp_class_training_data.csv

---
## Create Comprehend classifier training job

We will now initiate an Amazon Comprehend custom classifier training job with the training data. Note that the training job may take up to 30 minutes to complete.

In [None]:
import datetime

# Create a document classifier
account_id = boto3.client('sts').get_caller_identity().get('Account')
id = str(datetime.datetime.now().strftime("%s"))

document_classifier_name = 'Mortgage-Demo-Doc-Classifier'
document_classifier_version = 'v1'
document_classifier_arn = ''
response = None

try:
    create_response = comprehend.create_document_classifier(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'S3Uri': f's3://{data_bucket}/idp-mortgage/comprehend/comp_class_training_data.csv'
        },
        DataAccessRoleArn=role,
        DocumentClassifierName=document_classifier_name,
        VersionName=document_classifier_version,
        LanguageCode='en',
        Mode='MULTI_CLASS'
    )
    
    document_classifier_arn = create_response['DocumentClassifierArn']
    
    print(f"Comprehend Custom Classifier created with ARN: {document_classifier_arn}")
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'A classifier with the name "{document_classifier_name}" already exists.')
        document_classifier_arn = f'arn:aws:comprehend:{region}:{account_id}:document-classifier/{document_classifier_name}/version/{document_classifier_version}'
        print(f'The classifier ARN is: "{document_classifier_arn}"')
    else:
        print(error)

### Monitor training status of the training job

Run the code cell below to monitor the status of the training job.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time

jobArn = create_response['DocumentClassifierArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_custom_classifier = comprehend.describe_document_classifier(
        DocumentClassifierArn = jobArn
    )
    status = describe_custom_classifier["DocumentClassifierProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document classifier: {status}")
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(60)
    

---
## Deploy model to endpoint

Once the classifier training job is complete, we can deploy the trained model to an Amazon Comprehend [real time endpoint](https://docs.aws.amazon.com/comprehend/latest/dg/manage-endpoints.html). We can then call this endpoint with documents to identify which category the document belongs to.

Run the following code cell to deploy an end-point with the trained model.

In [None]:
#create comprehend endpoint
model_arn = document_classifier_arn
ep_name = 'mortgage-doc-endpoint'

try:
    endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ENDPOINT_ARN=endpoint_response['EndpointArn']
    print(f'Endpoint created with ARN: {ENDPOINT_ARN}')    
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An endpoint with the name "{ep_name}" already exists.')
        ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:document-classifier-endpoint/{ep_name}'
        print(f'The classifier endpoint ARN is: "{ENDPOINT_ARN}"')
        %store ENDPOINT_ARN
    else:
        print(error)
    

### Monitor the status of endpoint deployment

Run the code cell below to monitor the status of the endpoint deployment.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

ep_arn = endpoint_response["EndpointArn"]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_endpoint_resp = comprehend.describe_endpoint(
        EndpointArn=ep_arn
    )
    status = describe_endpoint_resp["EndpointProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document classifier: {status}")
    
    if status == "IN_SERVICE" or status == "FAILED":
        break
        
    time.sleep(10)
    

---
## Classifying Documents

Once our model is deployed with an endpoint, it's time to test it. We will pick a random document from our `/docs` directory and call the endpoint and analyze the comprehend classifier output.

We must extract the text from this document using textract and then use the extracted text to call the Comprehend classifier.

In [None]:
#extract text using Amazon Textract
response = call_textract(input_document='./docs/1099-DIV.jpg')
text = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
print(text)

In [None]:
#Classify document using the extracted text
classification_response = comprehend.classify_document(Text= text, EndpointArn=ENDPOINT_ARN)
classification_response

The classifier has correctly identified the document as a 1099-DIV document.

---
# Conclusion

In this notebook, we saw how we can train an Amazon Comprehend custom classifier with sample documents. We then deployed the trained model with an endpoint. We then extracted text from a document that we wanted to classify and then used the plain text extracted from it to call the comprehend classifier endpoint which gave us a JSON out put with all the classes and the probability of which class being the correct class for this document.