# Guidance for Document Processing Using Amazon Bedrock Data Automation

Amazon Bedrock Data Automation (BDA) is a fully managed capability of Amazon Bedrock that streamlines the generation of valuable insights from unstructured, multimodal content such as documents, images, audio, and videos. With Amazon Bedrock Data Automation, you can build automated intelligent document processing (IDP), media analysis, and Retrieval-Augmented Generation (RAG) workflows quickly and cost-effectively.

This workbook focuses on using BDA to process insights from unstructured documents. The use case we will focus on is for processing a loan applcation. We will process a packet of documents relavent to loans: ID Cards, Bank Statements, W2 Tax forms, Pay Stubs and checks.  


The diagram below shows an architecture for an Intelligent document processing workflow. This diagram is from solution 'Guidance for Multimodal Data Processing Using Amazon Bedrock Data Automation', published [here](https://aws.amazon.com/solutions/guidance/multimodal-data-processing-using-amazon-bedrock-data-automation/).


![Arch](./images/a_lending_flow_architecture.png)


1. Document Upload: The data science team uploads sample documents to an Amazon S3 bucket.

2. Blueprint Configuration: The data science team uses provided blueprints, and creates new custom blueprints for each document class: W2, Pay Slip, Drivers License, 1099 and Bank Statement.  Each sample is processed and the fields extracted with Generative AI prompts (e.g. First Name, Last Name, Gross Pay, SS Number, License Number, Capital Gains, Closing Balance).  The blueprints are managed and stored in Amazon Bedrock Data Automation feature. 

3. Test and Refine Blueprints: The blueprints are tested and refined. Key normalizations, key transformations, and key validations are added. 

4. Blueprint Published: The blueprints are managed and stored in the  Amazon Bedrock Data Automation feature. 

5. Amazon EventBridge triggers an AWS Lambda function when documents are uploaded to Amazon S3, using an "Object Created" event. This Lambda function then utilizes Amazon Bedrock's Data Automation feature to process the uploaded documents. 

6. The processing workflow in Amazon Bedrock Data Automation feature includes document splitting based on logical boundaries, with each split containing up to 20 pages. Each page is classified into a specific document type and matched to appropriate blueprints. The corresponding blueprint is then invoked for each page, executing key normalizations, transformations, and validations. This entire process operates asynchronously, allowing for efficient handling of multiple documents and large data volumes.

7. BDA stores the results in a Amazon S3 bucket for later processing and triggers Amazon EventBridge

8. AWS Lambda function is triggered by the Amazon EventBridge to process the JSON results of Amazon Bedrock Data Automation. The processing results send to downstream processing systems. 


In this workshop, we will explore the various aspects of this workflow such as the creating blueprints, processing sample documents, page classification.  We will process these documents:

1. ID Card
2. Bank Statements
3. W2 Tax forms
4. Pay Stubs
5. Check
6. Homeowner Insurance Application

We will then process a single PDF document with a 'loan application package', i.e. all 6 documents in one. 

This workbook follows these steps:

1. Step 1: Setup notebook and instantiate boto3 clients
2. Step 2: Process a simple PDF file using the standard output
3. Step 3: Create a Project, and blueprint for processing a Homeowner Insurance Form
4. Step 4: Add blueprints to the Automaton Project
5. Step 5: Use our custom Blueprint to process a Homeowner Insurance Form
6. Step 6: Document Splitting - Process a Multi-Page Document Package
7. Step 7: Display the results
8. Step 8: Cleanup

## Prerequisites:

Before starting the workshop you will need to create an Amazon SageMaker Studio notebook instance. https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-create-ws.html For IAM role, choose either an existing IAM role in your account or create a new role. The role must the necessary permissions to invoke the BDA, SageMaker and S3 APIs. 

These IAM policies can be assigned to the role: AmazonBedrockFullAccess, AmazonS3FullAccess, AmazonSageMakerFullAccess, IAMReadOnlyAccess

Note: The AdministratorAccess IAM policy can be used, if allowed by security policies at your organization. 




# Step 1: Setup notebook and instantiate boto3 clients

In this step, we will import some necessary libraries that will be used throughout this notebook. 
To use Amazon Bedrock Data Automation (BDA) with boto3, you'll need to ensure you have the latest version of the AWS SDK for Python (boto3) installed.

Note: At time of Public Preview launch, BDA is available in us-west-2 only. 

In [1]:
!pip install --upgrade boto3



In [2]:
import boto3, json
from time import sleep
import IPython.display as display
import sagemaker

print(boto3.__version__)

s3 = boto3.client('s3', region_name='us-west-2')
client = boto3.client('bedrock-data-automation', region_name='us-west-2')
run_client = boto3.client('bedrock-data-automation-runtime', region_name='us-west-2')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
1.35.96


We will give a unique name to our project and blueprint

In [3]:
project_name = 'my-bda-lending-workbook-v1'
blueprint_name = 'my-insurance-blueprint-v1'
bucket_name = sagemaker.Session().default_bucket()
print (bucket_name)


sagemaker-us-west-2-824467037051


# Step 2: Process a simple PDF file using the standard output



In this step, we will process a W2 Tax form using BDA Standard Output. Standard output is the default way of interacting with Amazon Bedrock Data Automation (BDA). If you pass a document to the BDA API with no established blueprint or project it returns the default standard output for that file type. 

https://docs.aws.amazon.com/bedrock/latest/userguide/bda-standard-output.html

Standard Output has three levels of granularity. We will use the default. 

1. Element level granularity (default) – This provides the text of the document in the output format of your choice, separated into different elements. These elements, such as figures, tables, or paragraphs. These are returned in logical reading order based off the structure of the document.

2. Page level granularity – This is enabled by default. Page level granularity provides each page of the document in the text output format of your choice.

3. Word level granularity – Provides information about individual words without using broader context analysis. Provides you with each word and its location on the page.



In [4]:
# Upload a W2 Form

file_name = 'documents/lending_package_w2.pdf'
object_name = f'data_automation/input/{file_name}'
output_name = 'data_automation/output/'
s3.upload_file(file_name, bucket_name, object_name)

display.IFrame("documents/lending_package_w2.pdf", width=1000, height=500)

We will now nvoke the BDA API to process the document image. 

In [5]:
response = run_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{object_name}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},)
response

invoke_arn = response['invocationArn']

The BDA call is asynchronous. We will poll until the operation is complete. 

In [6]:
in_progress = True
while in_progress:
    progress = run_client.get_data_automation_status(invocationArn=invoke_arn)
    if progress['status'] == 'InProgress':
        print(progress['status'])
        sleep(10)
    else:
        break
        
print(progress['status'])

InProgress
InProgress
Success


Once the status is 'Success', we will now retrieve the results

In [7]:
out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/standard_output/0/result.json"
s3.download_file(bucket_name, out_loc, 'result.json')

We will display the JSON of the Standard Output. \
Note the document layout elements: pages and text, along with the sub-types: paragraphs and fooder. 

In [8]:
data = json.load(open('result.json'))
print(json.dumps(data, indent=2))

{
  "metadata": {
    "asset_id": "0",
    "logical_subdocument_id": "0",
    "semantic_modality": "DOCUMENT",
    "s3_bucket": "sagemaker-us-west-2-824467037051",
    "s3_key": "data_automation/input/documents/lending_package_w2.pdf",
    "number_of_pages": 1,
    "start_page_index": 0,
    "end_page_index": 0
  },
  "document": {
    "statistics": {
      "element_count": 6,
      "table_count": 2,
      "figure_count": 0
    }
  },
  "pages": [
    {
      "id": "b6561ddf-46fd-4a65-8a6f-3fc9779d88ef",
      "page_index": 0,
      "representation": {
        "markdown": "a Employee's social security number\n\n22222\n\n75395184613\nOMB No. 1545-0008\n\nb Employer identification number (EIN)\n\n4963147952\n\nc Employer's name, address, and ZIP code\n\nJohn Stiles\n 100 Main Street, Anytown, USA\n\nd Control number\n\n753951852\n\ne Employee's first name and initial\nLast name\nSuff.\n\nArnav\nDesai\nM\n\n123 Any Street, Any Town, USA\n\nf Employee's address and ZIP code\n\n1 Wages, tip

# Step 3: Create a Project, and blueprint for processing a Homeowner Insurance Form

Amazon Bedrock Data Automation (BDA) includes several sample blueprints to help you get started with custom output for documents and images. 

We will next create out own Blueprint for the Homeowners Insurance document. This is a common document seen in a residential loan application. We need just 4 fields from this documment to proceses the loan application. 

1. The insured's name
2. The insurance company name
3. The address of the insured property
4. The primary email address

In [9]:
# Display the Form

file_name = 'documents/homeowner_insurance_application_sample.pdf'
object_name = f'data_automation/input/{file_name}'
output_name = 'data_automation/output/'
s3.upload_file(file_name, bucket_name, object_name)

display.IFrame("documents/homeowner_insurance_application_sample.pdf", width=1000, height=500)

In [10]:

response = client.create_blueprint(
    blueprintName=blueprint_name,
    type='DOCUMENT',
    blueprintStage='LIVE',
    schema=json.dumps({
    "$schema": "http://json-schema.org/draft-07/schema#",
    "description": "This blueprint will process a homeowners insurance applicatation form",
    "documentClass": "default",
    "type": "object",
    "properties": {
        "Insured Name":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"Please extract the Insured's Name",
        },
           "Insurance Company":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"Please extract the insurance company name",
        },  
           "Insured Address":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"Please extract the address of the insured property",
        },
           "Email Address":{
           "type":"string",
           "inferenceType":"extractive",
           "description":"Please extract the primary email address",
        }
        }
    })
)
response

{'ResponseMetadata': {'RequestId': '835635aa-7c38-4c0c-819e-215b8fe1796e',
  'HTTPStatusCode': 201,
  'HTTPHeaders': {'date': 'Fri, 10 Jan 2025 00:26:47 GMT',
   'content-type': 'application/json',
   'content-length': '1075',
   'connection': 'keep-alive',
   'x-amzn-requestid': '835635aa-7c38-4c0c-819e-215b8fe1796e'},
  'RetryAttempts': 0},
 'blueprint': {'blueprintArn': 'arn:aws:bedrock:us-west-2:824467037051:blueprint/350ea7969b3d',
  'schema': '{"$schema": "http://json-schema.org/draft-07/schema#", "description": "This blueprint will process a homeowners insurance applicatation form", "documentClass": "default", "type": "object", "properties": {"Insured Name": {"type": "string", "inferenceType": "extractive", "description": "Please extract the Insured\'s Name"}, "Insurance Company": {"type": "string", "inferenceType": "extractive", "description": "Please extract the insurance company name"}, "Insured Address": {"type": "string", "inferenceType": "extractive", "description": "Pleas

In [11]:
blueprint_arn = response['blueprint']['blueprintArn']
blueprint_arn

'arn:aws:bedrock:us-west-2:824467037051:blueprint/350ea7969b3d'

### Create Automation Project

Our project will need blueprints needed to process a loan application. We will add the W2 Tax Form blueprint we just created, as well as standard blueprints for these documents:
1. Drivers License ID Card
2. Bank Statements
3. W2 Tax forms  (Needed??)
4. Pay Stubs
5. A Check
6. A Homeowner Insurance Application

We will define the standard output configuration for Bedrock Data Automation

Document Reference: https://docs.aws.amazon.com/bedrock/latest/userguide/bda-how-it-works.html

In [12]:
output_config =  {
  "document": {
    "extraction": {
      "granularity": {
        "types": [
          "PAGE",
          "ELEMENT"
        ]
      },
      "boundingBox": {
        "state": "ENABLED"
      }
    },
    "generativeField": {
      "state": "ENABLED"               
    },
    "outputFormat": {
      "textFormat": {
        "types": ['PLAIN_TEXT','MARKDOWN','HTML','CSV']
      },
      "additionalFileFormat": {
        "state": "DISABLED"
      }
    }
  },
  "image": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": [
          "TEXT_DETECTION"
        ]
      },
      "boundingBox": {
        "state": "ENABLED"
      }
    },
    "generativeField": {
      "state": "ENABLED",
      "types": [
        "IMAGE_SUMMARY"
      ]
    }
  },
  "video": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": [
          "TEXT_DETECTION"
        ]
      },
      "boundingBox": {
        "state": "ENABLED"
      }
    },
    "generativeField": {
      "state": "ENABLED",
      "types": [
        "VIDEO_SUMMARY",
        "SCENE_SUMMARY"
      ]
    }
  },
  "audio": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": [
          "TRANSCRIPT"
        ]
      }
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["IAB"]
    }
  }
}

In [13]:


response = client.create_data_automation_project(
    projectName=project_name,
    projectDescription="Workbook to process Lending Applictions",
    projectStage='LIVE',
    standardOutputConfiguration=output_config,
)

print(response)

project_arn = response['projectArn']
print(project_arn)

{'ResponseMetadata': {'RequestId': 'c8017a63-0dbf-480a-ab53-1c2499297cda', 'HTTPStatusCode': 201, 'HTTPHeaders': {'date': 'Fri, 10 Jan 2025 00:26:47 GMT', 'content-type': 'application/json', 'content-length': '137', 'connection': 'keep-alive', 'x-amzn-requestid': 'c8017a63-0dbf-480a-ab53-1c2499297cda'}, 'RetryAttempts': 0}, 'projectArn': 'arn:aws:bedrock:us-west-2:824467037051:data-automation-project/87411ead6264', 'projectStage': 'LIVE', 'status': 'IN_PROGRESS'}
arn:aws:bedrock:us-west-2:824467037051:data-automation-project/87411ead6264


# Step 4: Add blueprints to the Automaton Project

Our project will need blueprints needed to processess a loan applicaiton. We will add the W2 Tax Form blueprint we just created, as well as standard blueprints for these documents:

1. Drivers License ID Card 
2. Bank Statements 
3. Pay Stubs 
4. A Check 

We will also add the Homeowner Insurance Appliction we created in step 3


In [14]:
print(blueprint_arn)

arn:aws:bedrock:us-west-2:824467037051:blueprint/350ea7969b3d


In [15]:
update_response = client.update_data_automation_project(
    projectArn=project_arn,
    standardOutputConfiguration=output_config,
    customOutputConfiguration={
        'blueprints': [
            {
                'blueprintArn': blueprint_arn,
                'blueprintStage': 'LIVE'
            },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license',
                  'blueprintStage': 'LIVE'
             },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-bank-check',
                  'blueprintStage': 'LIVE'
             },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-payslip',
                 'blueprintStage': 'LIVE'
             },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-bank-statement',
                  'blueprintStage': 'LIVE'
             },
        ]
    },
  )

project_arn = response['projectArn']


### List the Blueprints for our Automation Project

In [16]:
client.list_blueprints(projectFilter={'projectArn': project_arn})

{'ResponseMetadata': {'RequestId': '41c66aa1-c07a-4ad3-aea9-bee2b617411f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 10 Jan 2025 00:26:48 GMT',
   'content-type': 'application/json',
   'content-length': '632',
   'connection': 'keep-alive',
   'x-amzn-requestid': '41c66aa1-c07a-4ad3-aea9-bee2b617411f'},
  'RetryAttempts': 0},
 'blueprints': [{'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license',
   'blueprintStage': 'LIVE'},
  {'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-bank-check',
   'blueprintStage': 'LIVE'},
  {'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-payslip',
   'blueprintStage': 'LIVE'},
  {'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-bank-statement',
   'blueprintStage': 'LIVE'},
  {'blueprintArn': 'arn:aws:bedrock:us-west-2:824467037051:blueprint/350ea7969b3d',
   'bluepri

In [17]:
client.get_data_automation_project(projectArn=project_arn)

{'ResponseMetadata': {'RequestId': '94646961-945a-4203-9cda-6599cb3125de',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 10 Jan 2025 00:26:48 GMT',
   'content-type': 'application/json',
   'content-length': '1118',
   'connection': 'keep-alive',
   'x-amzn-requestid': '94646961-945a-4203-9cda-6599cb3125de'},
  'RetryAttempts': 0},
 'project': {'projectArn': 'arn:aws:bedrock:us-west-2:824467037051:data-automation-project/87411ead6264',
  'creationTime': datetime.datetime(2025, 1, 10, 0, 26, 47, 698000, tzinfo=tzlocal()),
  'lastModifiedTime': datetime.datetime(2025, 1, 10, 0, 26, 48, 212000, tzinfo=tzlocal()),
  'projectName': 'my-bda-lending-workbook-v1',
  'projectStage': 'LIVE',
  'standardOutputConfiguration': {'document': {'extraction': {'granularity': {'types': ['PAGE',
       'ELEMENT']},
     'boundingBox': {'state': 'ENABLED'}},
    'generativeField': {'state': 'ENABLED'},
    'outputFormat': {'textFormat': {'types': ['PLAIN_TEXT',
       'MARKDOWN',
       'HTML',


# Step 5 - Use our custom Blueprint to process a Homeowner Insurance Form

In [18]:
# Upload the Form

file_name = 'documents/homeowner_insurance_application_sample.pdf'
object_name = f'data_automation/input/{file_name}'
output_name = 'data_automation/output/'
s3.upload_file(file_name, bucket_name, object_name)

display.IFrame("documents/homeowner_insurance_application_sample.pdf", width=1000, height=500)

In [19]:
response = run_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{object_name}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},
    blueprints=[{'blueprintArn': blueprint_arn, 'stage': 'LIVE'}])
response

invoke_arn = response['invocationArn']
invoke_arn

'arn:aws:bedrock:us-west-2:824467037051:data-automation-invocation/cdbb7bcd-ba8c-4077-86aa-d82714a7fc45'

In [20]:
in_progress = True
while in_progress:
    progress = run_client.get_data_automation_status(invocationArn=invoke_arn)
    if progress['status'] == 'InProgress':
        print(progress['status'])
        sleep(5)
    else:
        break
        
print(progress['status'])

InProgress
InProgress
InProgress
InProgress
InProgress
Success


### Get Results

Note the four fields we requested in the blueprint have been returned

In [21]:
out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/custom_output/0/result.json"
out_loc

'data_automation/output//cdbb7bcd-ba8c-4077-86aa-d82714a7fc45/0/custom_output/0/result.json'

In [22]:
s3.download_file(bucket_name, out_loc, 'result.json')

In [23]:
data = json.load(open('result.json'))
print(json.dumps(data['inference_result'], indent=2))

#print(json.dumps(data, indent=2))

{
  "Insured Address": "42 Rainbow Sparkle Boulevard Unicornville, NV 12345",
  "Insurance Company": "Fake Insurance Co",
  "Insured Name": "Ziggy Starpixel",
  "Email Address": "rainbow.unicorn.987654@fakeemail.nowhere"
}


# Step 6 - Document Splitting - Process a Multi-Page Document Package

In [24]:


print(f"Activating document splitting for project: {project_name}, {project_arn}")


project = client.get_data_automation_project(projectArn=project_arn)["project"]

# Update project configuration
update_response = client.update_data_automation_project(
    projectArn=project_arn,
    standardOutputConfiguration=project["standardOutputConfiguration"],
    customOutputConfiguration={
        'blueprints': [
            {
                'blueprintArn': blueprint_arn,
                'blueprintStage': 'LIVE'
            },
            {
                'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-w2-form',
                'blueprintStage': 'LIVE'
            },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license',
                  'blueprintStage': 'LIVE'
             },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-bank-check',
                  'blueprintStage': 'LIVE'
             },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-payslip',
                 'blueprintStage': 'LIVE'
             },
             {
                 'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-bank-statement',
                  'blueprintStage': 'LIVE'
             },
        ]
    },
    overrideConfiguration={'document': {'splitter': {'state': 'ENABLED'}}})

# Get updated project configuration
updated_project = client.get_data_automation_project(projectArn=project_arn)

print("\nUpdated override configuration of project:")
print(updated_project)


Activating document splitting for project: my-bda-lending-workbook-v1, arn:aws:bedrock:us-west-2:824467037051:data-automation-project/87411ead6264

Updated override configuration of project:
{'ResponseMetadata': {'RequestId': 'c3250850-c542-4ddb-935c-e5f951377f7b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 10 Jan 2025 00:27:15 GMT', 'content-type': 'application/json', 'content-length': '1188', 'connection': 'keep-alive', 'x-amzn-requestid': 'c3250850-c542-4ddb-935c-e5f951377f7b'}, 'RetryAttempts': 0}, 'project': {'projectArn': 'arn:aws:bedrock:us-west-2:824467037051:data-automation-project/87411ead6264', 'creationTime': datetime.datetime(2025, 1, 10, 0, 26, 47, 698000, tzinfo=tzlocal()), 'lastModifiedTime': datetime.datetime(2025, 1, 10, 0, 27, 15, 535000, tzinfo=tzlocal()), 'projectName': 'my-bda-lending-workbook-v1', 'projectStage': 'LIVE', 'standardOutputConfiguration': {'document': {'extraction': {'granularity': {'types': ['PAGE', 'ELEMENT']}, 'boundingBox': {'state': 'E

In [25]:
##
## Upload a package of documents to an S3
##
file_name = 'documents/lending_package.pdf'
object_name = f'data_automation/input/{file_name}'
output_name = 'data_automation/output/'
s3.upload_file(file_name, bucket_name, object_name)

display.IFrame("documents/lending_package.pdf", width=1000, height=500)

In [26]:
# Process the document package


response = run_client.invoke_data_automation_async(
    dataAutomationConfiguration = { "dataAutomationArn" : project_arn,"stage" : 'LIVE'},
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{object_name}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},
)

response


invoke_arn = response['invocationArn']
invoke_arn


'arn:aws:bedrock:us-west-2:824467037051:data-automation-invocation/b5f8b62c-e803-4178-9d17-0018498fd538'

In [27]:
in_progress = True

while in_progress:
    progress = run_client.get_data_automation_status(invocationArn=invoke_arn)
    if progress['status'] == 'InProgress':
        print(progress['status'])
        sleep(10)
    else:
        break
        
print(progress['status'])

InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
Success


# Step 7 - Display the results

The first page of the lending package has a pay stub

In [28]:
display.IFrame("documents/lending_package_pay_stub.pdf", width=1000, height=500)

Display Page 1 Results: Pay Stub \
Note that BDA has selected this blueprint:  bedrock-data-automation-public-payslip

In [29]:
##
## Display Page 1 Results: Pay Stub
##

out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/custom_output/0/result.json"

s3.download_file(bucket_name, out_loc, 'result.json')

data = json.load(open('result.json'))
print(json.dumps(data['matched_blueprint'], indent=2))
print(json.dumps(data['inference_result'], indent=2))



{
  "arn": "arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-payslip",
  "name": "Payslip",
  "confidence": 0.9992805
}
{
  "YTDNetPay": "",
  "PayPeriodStartDate": "",
  "FederalTaxes": [
    {
      "YTD": 2111.2,
      "Period": 40.6,
      "ItemDescription": "Federal Income Tax"
    }
  ],
  "CurrentGrossPay": 452.43,
  "HolidayHourlyRate": 10,
  "CompanyAddress": {
    "State": "",
    "ZipCode": "10101",
    "City": "ANYTOWN",
    "Line1": "475 ANY AVENUE",
    "Line2": ""
  },
  "CityTaxes": [
    {
      "YTD": 308.88,
      "Period": 5.94,
      "ItemDescription": "NYC Income Tax"
    }
  ],
  "PayPeriodEndDate": "2008-07-18",
  "PayDate": "2008-07-25",
  "currency": "USD",
  "EmployeeAddress": {
    "State": "",
    "ZipCode": "12345",
    "City": "ANYTOWN",
    "Line1": "101 MAIN STREET",
    "Line2": ""
  },
  "YTDGrossPay": 23526.8,
  "is_gross_pay_valid": "",
  "StateFilingStatus": "",
  "YTDCityTax": 308.88,
  "EmployeeNumber": "",
  "RegularHourlyR

The second page of the lending package has a pay check

In [30]:
display.IFrame("documents/lending_package_check.pdf", width=1000, height=500)

Display Page 2 Results: Pay Check \
Note that BDA has selected this blueprint:  bedrock-data-automation-public-us-bank-check

In [31]:
##
## Display Page 2 Results: Pay Check
##

out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/custom_output/1/result.json"
out_loc

s3.download_file(bucket_name, out_loc, 'result.json')

data = json.load(open('result.json'))
print(json.dumps(data['matched_blueprint'], indent=2))
print(json.dumps(data['inference_result'], indent=2))


{
  "arn": "arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-bank-check",
  "name": "US-Bank-Check",
  "confidence": 0.99888223
}
{
  "date": "",
  "dollar_amount": 291.9,
  "check_number": 1379,
  "account_holder_name": "ANY COMPANY CORP",
  "payee_name": "JOHN STILES",
  "bank_name": "",
  "memo": "",
  "routing_number_valid": true,
  "bank_routing_number": "122000496",
  "amount_in_words": "TWO HUNDRED NINETY-ONE AND 90/100 DOLLARS",
  "is_signed": true
}


The third page of the lending package has a drivers license


In [32]:
display.IFrame("documents/lending_package_ID_Card.pdf", width=1000, height=500)

Display PPage 3 Results: Drivers License \
Note that BDA has selected this blueprint:  bedrock-data-automation-public-us-driver-license

In [33]:
##
## Display Page 3 Results: Drivers License
##

out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/custom_output/2/result.json"
out_loc

s3.download_file(bucket_name, out_loc, 'result.json')

data = json.load(open('result.json'))
print(json.dumps(data['matched_blueprint'], indent=2))
print(json.dumps(data['inference_result'], indent=2))


{
  "arn": "arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license",
  "name": "US-Driver-License",
  "confidence": 0.9998486
}
{
  "NAME_DETAILS": {
    "SUFFIX": "",
    "MIDDLE_NAME": "",
    "LAST_NAME": "GARCIA",
    "FIRST_NAME": "Mar\u00e1a"
  },
  "STATE_NAME": "MASSACHUSETTS",
  "ID_NUMBER": "736HDV7874JSB",
  "EXPIRATION_DATE": "2028-01-20",
  "ENDORSEMENTS": [
    "NONE"
  ],
  "PERSONAL_DETAILS": {
    "SEX": "F",
    "HAIR_COLOR": "",
    "HEIGHT": "4-6\"",
    "WEIGHT": "",
    "EYE_COLOR": "BLK"
  },
  "RESTRICTIONS": [
    "NONE"
  ],
  "ADDRESS_DETAILS": {
    "CITY": "BIGTOWN",
    "ZIP_CODE": "02801",
    "STATE": "MA",
    "STREET_ADDRESS": "100 MARKET STREET"
  },
  "CLASS": "D",
  "DATE_OF_BIRTH": "2001-03-18",
  "COUNTY": "",
  "DATE_OF_ISSUE": "2018-03-18"
}


The fourth page of the lending package has a bank statement


In [34]:
display.IFrame("documents/lending_package_account_statement.pdf", width=1000, height=500)

Display Page 4 Results: Bank Statement \
Note that BDA has selected this blueprint:  blueprint/bedrock-data-automation-public-bank-statement

In [35]:
##
## Display Page 4 Results: Bank Statement
##

out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/custom_output/3/result.json"
out_loc

s3.download_file(bucket_name, out_loc, 'result.json')

data = json.load(open('result.json'))
print(json.dumps(data['matched_blueprint'], indent=2))
print(json.dumps(data['inference_result'], indent=2))

{
  "arn": "arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-bank-statement",
  "name": "Bank-Statement",
  "confidence": 0.99911547
}
{
  "account_holder_address": "100 Main Street, Anytown, USA",
  "account_number": "333 008755555",
  "account_type": "",
  "account_summary": [
    {
      "summary_desc": "Your opening account balance as at 1 MAY 2021",
      "summary_amount": 50000
    },
    {
      "summary_desc": "Your closing account balance as at 31 MAY 2021",
      "summary_amount": 123084.85
    }
  ],
  "statement_end_date": "05/31/2021",
  "statement_start_date": "05/01/2021",
  "account_holder_name": "Jane Doe",
  "branch_transit_number": "",
  "bank_name": "",
  "transaction_details": [
    {
      "date": "",
      "balance": "",
      "description": "",
      "deposits": "",
      "withdrawals": ""
    }
  ]
}


The fifth page of the lending package has a W2 tax form


In [36]:
display.IFrame("documents/lending_package_w2.pdf", width=1000, height=500)

Display Page 5 Results: W2 Form \
Note that BDA has selected this blueprint:  bedrock-data-automation-public-w2-form

In [37]:
##
## Display Page 5 Results: W2 Form
##

out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/custom_output/4/result.json"
out_loc

s3.download_file(bucket_name, out_loc, 'result.json')

data = json.load(open('result.json'))
print(json.dumps(data['matched_blueprint'], indent=2))
print(json.dumps(data['inference_result'], indent=2))

{
  "arn": "arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-w2-form",
  "name": "W2-Form",
  "confidence": 0.9997762
}
{
  "employer_info": {
    "employer_address": "100 Main Street, Anytown, USA",
    "control_number": "753951852",
    "employer_name": "John Stiles",
    "ein": "4963147952",
    "employer_zip_code": ""
  },
  "filing_info": {
    "omb_number": "1545-0008",
    "verification_code": ""
  },
  "codes": [
    {
      "amount": 500,
      "code": "A"
    },
    {
      "amount": 1500,
      "code": "C"
    },
    {
      "amount": 500,
      "code": "A"
    },
    {
      "amount": 1000,
      "code": "B"
    }
  ],
  "other": "NA",
  "federal_tax_info": {
    "federal_income_tax": 500,
    "allocated_tips": 150,
    "social_security_tax": 100,
    "medicare_tax": 5000
  },
  "state_taxes_table": [
    {
      "state_name": "Any Town",
      "local_wages_tips": 100,
      "employer_state_id_number": 7414568313,
      "state_wages_and_tips": 50,
    

The sixth page of the lending package has a Home Insurance form


In [38]:
display.IFrame("documents/homeowner_insurance_application_sample.pdf", width=1000, height=500)

Display Page 6 Results: Home Insurance Application \
Note that BDA has selected our custom blueprint

In [39]:
##
## Display Page 6 Results: Home Insurance
##

out_loc = progress['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_loc += "/0/custom_output/5/result.json"
out_loc

s3.download_file(bucket_name, out_loc, 'result.json')

data = json.load(open('result.json'))
print(json.dumps(data['matched_blueprint'], indent=2))
print(json.dumps(data['inference_result'], indent=2))

{
  "arn": "arn:aws:bedrock:us-west-2:824467037051:blueprint/350ea7969b3d",
  "name": "my-insurance-blueprint-v1",
  "confidence": 0.99264
}
{
  "Insured Address": "",
  "Insurance Company": "XYZ Insurance",
  "Insured Name": "Alejandro Rosalez",
  "Email Address": "alejandrorosalez@example.com"
}


# Step 8 - Cleanup

This step is needed before we run through the workbook a second time. 

In [40]:
# Delete the project
response = client.delete_data_automation_project(projectArn=project_arn)

# Delete the blueprint
response = client.delete_blueprint(blueprintArn=blueprint_arn)