# Define customized extract for documents using blueprints


## Introduction

In Intelligent Document Processing (IDP) and similar use cases, customers need direct control over extracted data from unstructured documents. This enables direct integration of extracted insights into applications and workflows without the complexity of managing multiple models or or stitching together outputs. An example use case would be in the Financial Services to automate processing of mortgage applications. A mortgage packet can come with up to 20 different types of forms such as W-2's, bank statements, and deed information which makes it difficult to use traditional technologies to automate the process. By leveraging BDA custom output with blueprints, you can automate the classification and extraction of these documents whether they are structured forms like W-2’s or semi-structured documents like mortgage forms.

To help with this Amazon Bedrock Data Automation (BDA) offers the `Custom Output` feature which lets you define the target structure for information which you want to extract or generate from documents or images. This capability is particularly useful when working with complex or specialized data. You can configure custom output in BDA by using `Blueprints`. Blueprints are artifacts that specify which fields to extract, the desired data format for each field (such as string, number, or boolean), and rules for data normalization and validation. Blueprints can be customized for specific document types like W2s, pay stubs, or ID cards.

`Blueprints` are essentially lists of instructions and types that guide the extraction or generation of information based on your documents. This feature works in conjunction with BDA projects, enabling the processing of up to 40 document inputs and one image input. 


In this notebook we configure custom output to define extractions customized to our data schema requirements. 

## Prerequisites

In [None]:
%pip install "boto3>=1.37.4" pdf2image itables==2.2.4 PyPDF2==3.0.1 --upgrade -qq

In [None]:
%load_ext autoreload
%autoreload 2

## Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [None]:
import boto3
import json
import pprint
from IPython.display import JSON, display, IFrame, HTML
import sagemaker
import pandas as pd
from itables import show
import time
from utils import helper_functions
from utils import display_functions
from pathlib import Path
import os

session = sagemaker.Session()
default_bucket = session.default_bucket()
current_region = boto3.session.Session().region_name

sts = boto3.client('sts')
account_id = sts.get_caller_identity()['Account']

# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

## Prepare sample document

For this lab, we'll use a sample bank statement containing account holder details, banking information, and transaction data. We will use a catalog blueprint with custom output to extract and analyse the document content.

In [None]:
# Download the document
document_url = "s3://bedrock-data-automation-prod-assets-us-west-2/demo-assets/Document/BankStatement.jpg"
local_download_path = 'samples'

# Create full path of directories
os.makedirs(local_download_path, exist_ok=True)
local_file_name = 'BankStatement.jpg'
local_file_path = os.path.join(local_download_path, local_file_name)
(bucket, key) = helper_functions.get_bucket_and_key(document_url)
response = s3_client.download_file(bucket, key, local_file_path)

document_s3_uri = f'{bda_s3_input_location}/{local_file_name}'

target_s3_bucket, target_s3_key = helper_functions.get_bucket_and_key(document_s3_uri)
s3_client.upload_file(local_file_path, target_s3_bucket, target_s3_key)

print(f"Downloaded file to: {local_file_path}")
print(f"Uploaded file to S3: {target_s3_key}")
print(f"document_s3_uri: {document_s3_uri}")

### View Sample Document

In [None]:
IFrame(local_file_path, width=600, height=400)

## Using catalog blueprint

Now that we have our sample document available in S3, let's start with using the blueprints. 

You can use blueprints to configure file processing business logic in Amazon Bedrock Data Automation (BDA). Each blueprint consists of a list of field names to extract, the desired data format for each field (e.g., string, number, boolean), and natural language context for data normalization and validation rules. 

BDA has ready-to-use blueprints (`Catalog Blueprints`) for a number of commonly used document types such as W2, Paystub or a Receipt. Catalog blueprints are a great way to start if the document you want to extract from matches the blueprint. To extract from documents that are not matched by blueprints in the catalog you can create your own blueprints. When creating the blueprint, you have the option to let BDA generate blueprint after providing a sample document and an optional prompt. You can also create the blueprint by adding individual fields or by defining a JSON schema for the blueprint.

In this notebook, we would explore custom output using blueprints and data automation projects

### List blueprints in the catalog
Bedrock Data Automation provides sample blueprints (`catalog blueprints`) for common document types like W2s, pay stubs, and ID card that provide a starting place for extracting insights from these known document types.

Let's view the blueprints that BDA offers in the sample catalog.

In [None]:
import pandas as pd
from itables import show
list_blueprints_response = bda_client.list_blueprints(resourceOwner='SERVICE')
list_blueprints_response_json = JSON(list_blueprints_response,root="blueprint_recommendation", expanded=True)
list_blueprints_response_table = pd.DataFrame(list_blueprints_response['blueprints'])[['blueprintName','blueprintArn']].style.hide(axis='index')
list_blueprints_response_html = list_blueprints_response_table.to_html()
list_blueprints_response_html = list_blueprints_response_html.replace('<tr>', '<tr onclick="handleRowClick(this.rowIndex)">')
list_blueprints_response_html = HTML(f"""
<div style="height:400px; overflow-y:auto;">
    {list_blueprints_response_html}
</div>
""")
display_functions.display_multiple([
    display_functions.get_view(list_blueprints_response_html),
    display_functions.get_view(list_blueprints_response_json)],["Table View", "Raw JSON"]
)


### Invoke Blueprint Recommendation
Let's now use our sample document as input to invoke the BDA blueprint recommendation. BDA will use the provided input document to match a sample blueprint from global catalog. 

In [None]:
payload = {
    "inputDataConfiguration":{
        "s3Uri":f'{document_s3_uri}'
    },
    "dataAutomationProfileArn": f"arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1"
}
response = helper_functions.invoke_blueprint_recommendation_async(bda_client=bda_client,
                                      region_name=current_region, 
                                      payload=json.dumps(payload))

job_id = response['jobId']

### Wait for blueprint recommendation results

In [None]:
status_response = helper_functions.wait_for_completion(
            client=None,
            get_status_function=helper_functions.get_blueprint_recommendation,
            status_kwargs={
                'bda_client': bda_client,
                'job_id': job_id,
                'region_name': current_region,
                'credentials': boto3.Session().get_credentials().get_frozen_credentials(),
            },
            completion_states=['Completed'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)

### Identify Blueprint
BDA outputs a blueprint recommendation along with a prompt recommendation that is useful to create a custom blueprint, if needed.

For this example, we will fetch the blueprint that was recommended by BDA.

In [None]:
blueprint_recommendation = next((result for result in status_response['results'] if result['type'] == 'BLUEPRINT_RECOMMENDATION'),None)

In [None]:
recommended_blueprint_info = blueprint_recommendation['blueprintRecommendation']
recommended_blueprint_json = JSON(blueprint_recommendation['blueprintRecommendation'], root='blueprintRecommendation', expanded=True)
recommended_blueprint_table = pd.DataFrame([{
        'Document Class': recommended_blueprint_info['documentClass']['type'],
        'Blueprint Name': recommended_blueprint_info['matchedBlueprint']['name'],
        'Confidence': recommended_blueprint_info['matchedBlueprint']['confidence'],
        'Blueprint ARN': recommended_blueprint_info['matchedBlueprint']['blueprintArn']
    }]
)
display_functions.display_multiple(
    [display_functions.get_view(recommended_blueprint_table.style.hide(axis='index')), 
     display_functions.get_view(recommended_blueprint_json)], ["Table View", "Raw JSON"]
)


### View Schema

Now that we have identified the matching blueprint, we can view the blueprint schema. The blueprint schema defines the structure of fields for BDA's custom output extraction. There are two types of fields—explicit and implicit—located in the extraction table. Explicit extractions are used for information visible in the document. Implicit extractions are used for information that needs to be transformed from how they appear in the document

In [None]:
JSON(json.loads(blueprint_recommendation['blueprintRecommendation']['schema']), root='Schema', expanded=False)

In [None]:
blueprint_arn = blueprint_recommendation['blueprintRecommendation']['matchedBlueprint']['blueprintArn']

### Invoke Data Automation with Catalog Blueprint
With a blueprint identified, we can proceed to invoke data automation. Note that in addition to the input and output configuration we also provide the blueprint id when calling the `invoke_data_automation_async` operation.

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': f'{document_s3_uri}'
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    blueprints=[
        {
            'blueprintArn': blueprint_arn
        }
    ],
    dataAutomationProfileArn=f"arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1",
)

invocationArn = response['invocationArn']
print(f'Invoked data automation job with invocation arn {invocationArn}')

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [None]:
status_response = helper_functions.wait_for_completion(
            client=bda_client,
            get_status_function=bda_runtime_client.get_data_automation_status,
            status_kwargs={'invocationArn': invocationArn},
            completion_states=['Success'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

### View Job Metadata - Custom Output
Once the job is completed successfully, we can view the metadata associated with the BDA data automation job.


In [None]:
job_metadata = json.loads(helper_functions.read_s3_object(job_metadata_s3_location))

job_metadata_table = pd.DataFrame(job_metadata['output_metadata'][0]['segment_metadata']).style.hide(axis='index')
job_metadata_json = JSON(job_metadata, root="job_metadata", expanded=True)
# Display the widget
display_functions.display_multiple([display_functions.get_view(job_metadata_table),display_functions.get_view(job_metadata_json)], ["Table View", "Raw JSON"])


We can now use the output paths in the metadata to fetch the extracted results.

### Fetch BDA Output
Now that we have both the Standard output and the custom output, we can extract the results for both from the output S3 bucket.

In [None]:
asset_id = 0
segments_metadata = next(item["segment_metadata"]
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
custom_outputs = [json.loads(helper_functions.read_s3_object(segment_metadata.get('custom_output_path'))) if segment_metadata.get('custom_output_status') == 'MATCH' else None for segment_metadata in segments_metadata]


In [None]:
custom_outputs_json = JSON(custom_outputs, root="custom_outputs", expanded=True)
custom_outputs_table = pd.DataFrame(helper_functions.get_summaries(custom_outputs))

display_functions.display_multiple(
    [display_functions.get_view(custom_outputs_table.style.hide(axis='index')),
     display_functions.get_view(custom_outputs_json)],["Table View", "Raw JSON"]
)


You should see from the summary of the custom output that BDA has used the provide blueprint (that's why the confidence score is 1). Let's now explore the extracted results.

### Explore Document Insights extracted using blueprint

We can now explore the custom output received from processing documents using the blueprint we used for the Data Automation job.

In [None]:
views =  []
titles = []
# Use the function
for custom_output in custom_outputs:
    result = helper_functions.transform_custom_output(custom_output['inference_result'], custom_output['explainability_info'][0])
    document_image_uris = [document_s3_uri]
    views += [display_functions.segment_view(
        document_image_uris=document_image_uris, 
        inference_result=result)]
    titles += [custom_output.get('matched_blueprint', {}).get('name', None)]

display_functions.display_multiple(views, titles)

## Summary and Next Steps

In this lab we saw how you can leverage custom output and blueprints to achieve a higher degree of granular control over the unstructured data extraction and transformation with BDA. Using custom output allows you to better address your specific use cases and optimize the performance of your applications. 

In a subsequent lab, you will explore how you can combine the capabilities of custom output, blueprints and data projects to 

## Clean Up
Let's delete uploaded sample file from s3 input directory and the generated job output files.

In [None]:
import os
from pathlib import Path
import shutil
import os


# Delete S3 File
s3_client.delete_object(Bucket=target_s3_bucket, Key=target_s3_key)

# Delete local file
if os.path.exists(local_file_path):
    os.remove(local_file_path)	

# Delete bda job output
bda_s3_job_location = str(Path(job_metadata_s3_location).parent).replace("s3:/","s3://")
!aws s3 rm {bda_s3_job_location} --recursive