# Intelligent Document Processing

Documents contain valuable information and come in various shapes and forms. In most cases, you are manually processing these documents which is time consuming, prone to error, and costly. Not only do you want this information extracted quickly but can also automate business processes that presently relies on manual inputs and intervention across various file types and formats.

To help you overcome these challenges, AWS Machine Learning (ML) now provides you choices when it comes to extracting information from complex content in any document format such as insurance claims, mortgages, healthcare claims, contracts, and legal contracts.

The diagram below shows an architecture for an Intelligent document processing workflow. It starts with data capture stage to securely store and aggregate different types (PDF, PNG, JPEG, and TIFF), formats, and layouts of documents. Followed by accurate classification of documents and extracting text and key insights from documents and perform further enrichments of the documents (such as identity entities, redaction etc.). Finally, the verification and review stage involves manual review of the documents for quality and accuracy, followed by consumption of the documents and extracted information into downstream databases/applications.

In this workshop, we will explore the various aspects of this workflow such as the document classification, text and insights extraction, enrichments, and human review.

![Arch](./images/idp.png)



# Document Classification
In this lab we will walk you through an hands-on lab on document classification using Amazon Comprehend
Custom Classifier. We will use Amazon Textract to first extract the text out of our documents and then label them and then use the data for training our Amazon comprehend custom classifier. We will create an Amazon Comprehend real time endpoint with the custom classifier to classify our documents.


<p align="center">
  <img src="./images/Insurance_doc_classify.png" alt="IDP Classify" width="900px"/>
</p>


- [Step 1: Create Amazon Comprehend Classification Training Job](#step1)
- [Step 2: Create Amazon Comprehend real time endpoint](#step2)
- [Step 3: Classify Documents using the real-time endpoint](#step3)



---

# Step 1: Create Amazon Comprehend Classification Training Job <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook.

We will then use a prepared dataset, of the appropriate filetype (.csv) and structure - one column containing the raw text of a document, and the other column containing the label of that document.

Please see this [notebook](https://github.com/aws-samples/aws-ai-intelligent-document-processing/blob/main/industry/mortgage/01-document-classification.ipynb) from our mortgage blog series for detailed steps on data preparation for ingestion into Amazon Comprehend for [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html). Note: While the documents may be different, the process, from a code perspective, is identical.

In [3]:
!python -m pip install amazon-textract-response-parser --upgrade
!python -m pip install amazon-textract-caller --upgrade
!python -m pip install amazon-textract-prettyprinter --upgrade

[0m

In [31]:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string
from trp import Document

If the import statements above fails then please restart the notebook kernel by clicking the circular arrow button at the top of the notebook.

In [30]:
import boto3
import botocore
import sagemaker
import os
import io
import datetime
import pandas as pd
from PIL import Image
from pathlib import Path
import multiprocessing as mp
from IPython.display import Image, display, HTML, JSON

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)

SageMaker role is: arn:aws:iam::710096454740:role/service-role/AmazonSageMaker-ExecutionRole-20220504T135260
Default SageMaker Bucket: s3://sagemaker-us-east-1-710096454740


We will use the pre-prepared dataset and upload it to Amazon S3. The dataset is in `CSV` format and will be named `comprehend_train_data.csv`. Note that you can have more than one `CSV` file in an S3 bucket for training a Comprehend custom classifier. If you have more than one file, you can specify only the bucket/prefix in call to train the custom classifier. Amazon Comprehend will automatically use all the files under the bucket/prefix for training purposes.

The following code cells will upload the training data to the S3 bucket, and create a Custom Comprehend Classifier. You can also create a custom classifier manually, please see the subsequent sections for instructions on how to do that.

In [6]:
# Upload Comprehend training data to S3
key='idp/comprehend/comprehend_train_data.csv'
s3.upload_file(Filename='./dataset/comprehend_train_data.csv', 
               Bucket=data_bucket, 
               Key=key)


---

Once we have a labeled dataset ready we are going to create and train a [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) with the dataset.

### Create Amazon Comprehend custom classification Training Job

<div class="alert alert-block alert-warning"> <b style="font-size: 24px">💡 NOTE:</b> <p style="font-size: 20px">Executing the model training code block below will start a training job which can take upwards of 40 to 60 minutes to complete. In order to save time, you can skip to the "Import and existing classification model" section to import a pre-trained Comprehend classifier model</p> </div>

We will use Amazon Comprehend's Custom Classification to train our own model for classifying the documents. We will use Amazon Comprehend `CreateDocumentClassifier` API to create a classifier which will train a custom model using the labeled data CSV file we created above. The training data contains extracted text, that was extracted using Amazon Textract, and then labeled.

In [7]:
# Create a document classifier
account_id = boto3.client('sts').get_caller_identity().get('Account')
id = str(datetime.datetime.now().strftime("%s"))

document_classifier_name = 'insurance-doc-classifier-idp'
document_classifier_version = 'v1'
document_classifier_arn = ''
response = None

try:
    create_response = comprehend.create_document_classifier(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'S3Uri': f's3://{data_bucket}/{key}'
        },
        DataAccessRoleArn=role,
        DocumentClassifierName=document_classifier_name,
        VersionName=document_classifier_version,
        LanguageCode='en',
        Mode='MULTI_CLASS'
    )
    
    document_classifier_arn = create_response['DocumentClassifierArn']
    
    print(f"Comprehend Custom Classifier created with ARN: {document_classifier_arn}")
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'A classifier with the name "{document_classifier_name}" already exists.')
        document_classifier_arn = f'arn:aws:comprehend:{region}:{account_id}:document-classifier/{document_classifier_name}/version/{document_classifier_version}'
        print(f'The classifier ARN is: "{document_classifier_arn}"')
    else:
        print(error)

Comprehend Custom Classifier created with ARN: arn:aws:comprehend:us-east-1:710096454740:document-classifier/insurance-doc-classifier-idp/version/v1


In [8]:
%store document_classifier_arn


Stored 'document_classifier_arn' (str)


Check status of the Comprehend Custom Classification Job

In [9]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

jobArn = create_response['DocumentClassifierArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_custom_classifier = comprehend.describe_document_classifier(
        DocumentClassifierArn = jobArn
    )
    status = describe_custom_classifier["DocumentClassifierProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document classifier: {status}")
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(60)
    

22:22:58 : Custom document classifier: TRAINED
CPU times: user 1.61 s, sys: 191 ms, total: 1.8 s
Wall time: 1h 45min 10s



Alternatively, to create a Comprehend Custom Classifier Job manually using the console go to [Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#classification)
  
- On the left menu click "Custom Classification"
- In the "Classifier models" section, click on "Create new model"
- In Model Setting for Model name, enter a name 
- In Data Specification; select "Using Single-label" mode and for Data format select CSV file
- For Training dataset browse to your data-bucket created above and select the file `comprehend_train_data.csv`
- For IAM role select "Create an IAM role" and specify a prefix (this will create a new IAM Role for Comprehend)
- Click create

This job can take ~30 minutes to complete. Once the training job is completed move on to next step.

---
# Import an existing classification model

You can import a trained classification model from a different AWS account using Amazon Comprehend `ModelImport`

In [None]:
import_response = comprehend.import_model(
    SourceModelArn='arn:aws:comprehend:us-east-1:710096454740:document-classifier/insurance-doc-classifier-idp/version/v1',
    ModelName='insurance-doc-classifier-idp',
    VersionName='v1'
)
document_classifier_arn = import_response['ModelArn']
document_classifier_arn

---
# Step 2: Classify Documents using the custom classifier <a id="step3"></a>

In this final step we will use the Comprehend classifier model that we just trained/imported to classify a group of un-identified documents. We will use Comprehend [StartDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartDocumentClassificationJob.html) API to run an asynchronous job that will classify our documents. Note that an asynchronous classification job is capable of reading PDF, JPG, PNG and TIFF files since it can use Amazon Textract behind the scenes to extract the text from the documents.

Amazon Comprehend Async classification currently only works with UTF-8 encoded plaintext files only. So we will extract the text out of our sample documents with textract and upload them to S3 first. We will use `ONE_DOC_PER_FILE` mode which signifies that each plaintext file is a single document (the other mode is `ONE_DOC_PER_LINE` which means every line in the plaintext file is a document, this is best suited for small documents such as product reviews or customer service chat transcripts etc.). More on this, see [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/how-class-run.html)

In [33]:
# Convert images to .txt
import os
import shutil
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string

listfiles = os.listdir('./dataset/document_samples')
os.makedirs('./dataset/document_samples/txt', exist_ok=True)

for imagefile in listfiles:
    if imagefile != '.ipynb_checkpoints':
        # using amazon-textract-caller to call Textract DetectDocumntText
        response = call_textract(input_document=f'./dataset/document_samples/{imagefile}') 
        # using pretty printer to get all the lines int the document
        lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
        print(f"Writing plaintext file for {imagefile}")
        with open(f'./dataset/document_samples/txt/{imagefile}.txt', "w") as text_file:
            text_file.write(lines)
        

!rm -rf ./dataset/document_samples/txt/.ipynb_checkpoints
!aws s3 sync ./dataset/document_samples/txt s3://{data_bucket}/idp/comprehend/sampledocs/txt
    
#delete converted files
shutil.rmtree('./dataset/document_samples/txt')

upload: dataset/document_samples/txt/CMS1500.png.txt to s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/txt/CMS1500.png.txt
upload: dataset/document_samples/txt/dr-note-sample.png.txt to s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/txt/dr-note-sample.png.txt
upload: dataset/document_samples/txt/drivers_license.png.txt to s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/txt/drivers_license.png.txt
upload: dataset/document_samples/txt/insurance_invoice.png.txt to s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/txt/insurance_invoice.png.txt
upload: dataset/document_samples/txt/discharge-summary.png.txt to s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/txt/discharge-summary.png.txt
upload: dataset/document_samples/txt/insurance_card.png.txt to s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/txt/insurance_card.png.txt


### Start the async Comprehend classification job

Using the `StartDocumentClassificationJob` API.

In [36]:
import time

jobname = f'insurance-classification-job-{time.time()}'
print(f'Starting Comprehend Classification job {jobname} with model {document_classifier_arn}')

classify_response = comprehend.start_document_classification_job(
    JobName=jobname,
    DocumentClassifierArn=document_classifier_arn,
    InputDataConfig={
        'S3Uri': f's3://{data_bucket}/idp/comprehend/sampledocs/txt',
        'InputFormat': 'ONE_DOC_PER_FILE'
    },
    OutputDataConfig={
        'S3Uri': f's3://{data_bucket}/idp/comprehend/sampledocs/classifieroutput'
    },
    DataAccessRoleArn=role
)
classify_response

Starting Comprehend Classification job insurance-classification-job-1664815530.9945142...


{'JobId': 'c8b7b3b71d8cf04d35c7f510a2f17513',
 'JobArn': 'arn:aws:comprehend:us-east-1:710096454740:document-classification-job/c8b7b3b71d8cf04d35c7f510a2f17513',
 'JobStatus': 'SUBMITTED',
 'ResponseMetadata': {'RequestId': 'c776f7f6-abf0-4639-bf7d-6c010be2bc4e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c776f7f6-abf0-4639-bf7d-6c010be2bc4e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '182',
   'date': 'Mon, 03 Oct 2022 16:45:30 GMT'},
  'RetryAttempts': 0}}

### Check status of the classification job

If the job completes then download the output predictions file.

In [58]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime
import tarfile
import os

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_job = comprehend.describe_document_classification_job(
        JobId=classify_response['JobId']
    )
    status = describe_job["DocumentClassificationJobProperties"]["JobStatus"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document classifier Job: {status}")
    
    if status == "COMPLETED" or status == "FAILED":
        if status == "COMPLETED":
            classify_output_file = describe_job["DocumentClassificationJobProperties"]["OutputDataConfig"]["S3Uri"]
            print(f'Output generated - {classify_output_file}')
            !mkdir -p classifyoutput
            !aws s3 cp {classify_output_file} ./classifyoutput
            
            opfile = os.path.basename(classify_output_file)
            # open file
            file = tarfile.open(f'./classifyoutput/{opfile}')
            # extracting file
            file.extractall('./classifyoutput')
            file.close()
            
            with open('./classifyoutput/predictions.jsonl') as f:
                classification_predictions = f.readlines()
        else:
            print("Classification job failed")
            print(describe_job)
        break
        
    time.sleep(10)

17:37:03 : Custom document classifier Job: COMPLETED
Output generated - s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/classifieroutput/710096454740-CLN-c8b7b3b71d8cf04d35c7f510a2f17513/output/output.tar.gz
download: s3://sagemaker-us-east-1-710096454740/idp/comprehend/sampledocs/classifieroutput/710096454740-CLN-c8b7b3b71d8cf04d35c7f510a2f17513/output/output.tar.gz to classifyoutput/output.tar.gz
CPU times: user 40.2 ms, sys: 16.7 ms, total: 56.9 ms
Wall time: 1.35 s


### Review the classification output file

The output file is a file containing json lines. The inference output contains the name of the file and the classes and respective score for each class. Highest score of the the class is the class that document belongs to.

In [66]:
import json

for predictions in classification_predictions:
    pred = json.loads(predictions)
    print(f"File: {pred['File']}")
    for classification in pred['Classes']:
        print(f"\t - Class: {classification['Name']}, Score: {classification['Score']}")
    print("\n")

File: dr-note-sample.png.txt
	 - Class: DISCHARGE_SUMMARY, Score: 0.9986
	 - Class: MEDICAL_TRANSCRIPTION, Score: 0.0008
	 - Class: INSURANCE_ID, Score: 0.0002


File: insurance_card.png.txt
	 - Class: INSURANCE_ID, Score: 0.9958
	 - Class: CMS1500, Score: 0.0011
	 - Class: LICENSE, Score: 0.001


File: CMS1500.png.txt
	 - Class: CMS1500, Score: 0.991
	 - Class: PASSPORT, Score: 0.0026
	 - Class: INSURANCE_ID, Score: 0.0025


File: drivers_license.png.txt
	 - Class: LICENSE, Score: 0.9969
	 - Class: INSURANCE_ID, Score: 0.0008
	 - Class: CMS1500, Score: 0.0006


File: discharge-summary.png.txt
	 - Class: DISCHARGE_SUMMARY, Score: 0.9958
	 - Class: INSURANCE_ID, Score: 0.0033
	 - Class: INVOICE_RECEIPT, Score: 0.0003


File: insurance_invoice.png.txt
	 - Class: INVOICE_RECEIPT, Score: 0.9997
	 - Class: INSURANCE_ID, Score: 0.0001
	 - Class: CMS1500, Score: 0.0




---
# Step 2 (_optional_): Create Amazon Comprehend real time endpoint <a id="step2"></a>

<div class="alert alert-block alert-warning">
    <b>⚠️ Note:</b> Creation of a real-time endpoint can take up to 15 minutes.
</div>


Once our Comprehend custom classifier is fully trained (i.e. status = `TRAINED`). You can also create a real-time endpoint. You can then use this endpoint to classify documents in real time. The following code cells use the `comprehend` Boto3 client to create an endpoint, but you can also create one manually via the console. Instructions on how to do that can be found in the subsequent section.

In [11]:
#create comprehend endpoint
model_arn = document_classifier_arn
ep_name = 'insurance-custom-classifier-endpoint'

try:
    endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ENDPOINT_ARN=endpoint_response['EndpointArn']
    print(f'Endpoint created with ARN: {ENDPOINT_ARN}')    
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An endpoint with the name "{ep_name}" already exists.')
        ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:document-classifier-endpoint/{ep_name}'
        print(f'The classifier endpoint ARN is: "{ENDPOINT_ARN}"')
        %store ENDPOINT_ARN
    else:
        print(error)
    

Endpoint created with ARN: arn:aws:comprehend:us-east-1:710096454740:document-classifier-endpoint/insurance-custom-classifier-endpoint


In [12]:
%store ENDPOINT_ARN

Stored 'ENDPOINT_ARN' (str)


In [None]:
display(endpoint_response)


Alternatively, use the steps below to create a Comprehend endpoint using the AWS console.

- Go to [Comprehend on AWS Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints) and click on Endpoints in the left menu.
- Click on "Create endpoint"
- Give an Endpoint name; for Custom model type select Custom classification; for version select no version or the latest version of the model.
- For Classifier model select from the drop down menu
- For Inference Unit select 1
- Check "Acknowledge"
- Click "Create endpoint"

[It may take ~15 minutes](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints) for the endpoint to get created. The code cell below checks the creation status.


In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

ep_arn = endpoint_response["EndpointArn"]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_endpoint_resp = comprehend.describe_endpoint(
        EndpointArn=ep_arn
    )
    status = describe_endpoint_resp["EndpointProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document classifier: {status}")
    
    if status == "IN_SERVICE" or status == "FAILED":
        break
        
    time.sleep(10)
    

---
# Step 3 (_optional_): Classify Documents using the real-time endpoint <a id="step3"></a>

<div class="alert alert-block alert-warning">
    <b>⚠️ Note:</b> Execute this section only if you have created a real-time endpoint with the Amazon Comprehend custom classifier model.
</div>

Once the endpoint has been created, we will use a mix of documents under the `/samples/mixedbag/` directory and try to classify them to bank statement, invoice, and receipt documents respectively.

In [None]:
root = "./dataset/document_samples/"
files = []

for file in os.listdir(root):
    if not file.startswith('.'):
        files.append(f'./dataset/document_samples/{file}')

files_df = pd.DataFrame(files, columns=["Document"])
files_df

Let's view one of the documents

In [None]:
file = files_df.sample().iloc[0]['Document']
display(Image(filename=file, width=400, height=500))

Extract Text from the sample test documents using Textract. We will first convert the documents to ByteArray and then use Textract `detect_document_text` API to extract the text from the sample documents. We will create a utility function that reads each document and converts it into ByteArray for us to use with Textract. Once we extract text using Textract we will call Amazon Comprehend on each of them to classify them.

#### Extract the text from all the sample documents in the list

We will create yet another small utility function that receives the document bytearray, extracts text from the document with Textract and returns the extracted text.

In [None]:
def extract_text(doc):
    response = call_textract(input_document=doc)
    page_string = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
    return page_string

In [None]:
files_df['DocText'] = files_df.apply(lambda row : extract_text(row['Document']), axis = 1)
files_df

We have the extracted text in the dataframe for each of our document, the next step is to use the Amazon Comprehend real-time endpoint to classify them. We will create a small utility function that does the classification using the endpoint and returns the document type. Note that Comprehend will return all the classes of documents with a confidence score linked to each class in an array of key-value pairs (Name-Score), we will pick only the document class with the highest confidence score from the endpoint's response.

In [None]:
import time 
from datetime import datetime

def classify_doc(document):
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    print(f'{current_time} : Processing')
    time.sleep(10)                 #to avoid Comprehend API throttling
    try:
        response = comprehend.classify_document(
            Text= document,
            EndpointArn=ENDPOINT_ARN
        )
        print(response)
        response_df = pd.DataFrame(response['Classes'])
        result = response_df.iloc[response_df['Score'].idxmax()] # choose the class with highest score        
        return result.Name                                       # return the corresponding class name
    except Exception as e:
        print(e)
        return 'error'

Lets now run the inference on our data

In [None]:
# import time
files_df['DocType'] = files_df.apply(lambda row : classify_doc(row['DocText']), axis = 1)
files_df

We have now classified all these documents into their respective classes. Let's review to check if the classifier did a correct job of classifying.

In [None]:
document = files_df.iloc[0]['Document']
document_type = files_df.iloc[0]['DocType']

display(HTML(f'<h2>Document Category: "<u style="color:#00E676;">{document_type}</u>"</h2>'))
display(Image(filename=document, width=400, height=500))

---
# Conclusion

As we can see from above, our classifier was able to correctly classify the test document!

In this notebook we learned how to train an Amazon Comprehend custom classifier using our pre-prepared dataset, that was constructed from sample documents by extracting the text from the documents using Amazon Textract and labeling the data into a CSV file format. We then trained an Amazon Comprehend custom classifier with the extracted text and created an Amazon Comprehend Classifier real time endpoint to performe classification of documents.

We also learned how to import an existing trained model from a different account and run asynchronous inference with the imported model without having to create a real-time endpoint.

In the next notebook we will look at a few methods to perfrom extraction of key insights from our documents using Amazon Textract.