# Custom output and blueprints

## Introduction

In addition to the `Standard Output` Amazon Bedrock Data Automation (BDA) offers the `Custom Output` feature that lets you further fine-tune your extractions from documents and images. This capability is particularly useful when working with complex or specialized data. 

You can configure Custom output in BDA using `Blueprints`. `Blueprints` are essentially a lists of instructions that guide the extraction of information from your file, including allowing for transformation and adjustment of output. This feature works in conjunction with BDA projects, enabling the processing of up to 40 document inputs and one image input. 

Custom outputs provide users with greater control and flexibility in how they extract and structure information from their files, making it easier to tailor the results to their particular use cases.

## Blueprints

You can use blueprints to configure file processing business logic in Amazon Bedrock Data Automation (BDA). Each blueprint consists of a list of field names to extract, the desired data format for each field (e.g., string, number, boolean), and natural language context for data normalization and validation rules. 

BDA has ready-to-use blueprints (`Catalog Blueprints`) for a number of commonly used document types such as W2, Paystub or a Receipt. Catalog blueprints are a great way to start if the document you want to extract from matches the blueprint. To extract from documents that are not matched by blueprints in the catalog you can create your own blueprints. When creating the blueprint using the AWS Console, you have the option to let BDA generate blueprint after providing a sample document and an optional prompt. You can also create the blueprint by adding individual fields or by using a JSON editor to define the JSON for the blueprint.

In this notebook, we would use explore custom output using blueprints and data automation projects.

## Prerequisites

In [None]:
pip install "boto3>=1.35.76" PyPDF2 --upgrade -qq

## Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [None]:
import boto3
import json
import pprint
from IPython.display import JSON, display, HTML, Markdown, IFrame
import IPython.display as display
from sagemaker import get_execution_role

default_execution_role = get_execution_role()
region_name = 'us-west-2'
# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = 's3://genai.octankmarkets.com/input'
bda_s3_output_location = 's3://genai.octankmarkets.com/output'
print(f'Using default execution role {default_execution_role}')

## Configure IAM Permissions


The features being explored in the notebook require the following IAM Policies for the role being used. If you're running this notebook within SageMaker Studio in your own Account, update the default execution role for the SageMaker user profile to include the following IAM policies. 

```json
[
    {
        "Sid": "BDACreatePermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:CreateDataAutomationProject",
            "bedrock:CreateBlueprint"
        ],
        "Resource": "*"
    },
    {
        "Sid": "BDAOProjectsPermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:CreateDataAutomationProject",
            "bedrock:UpdateDataAutomationProject",
            "bedrock:GetDataAutomationProject",
            "bedrock:GetDataAutomationStatus",
            "bedrock:ListDataAutomationProjects",
            "bedrock:InvokeDataAutomationAsync"
        ],
        "Resource": "arn:aws:bedrock:::data-automation-project/*"
    },
    {
        "Sid": "BDABlueprintPermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:GetBlueprint",
            "bedrock:ListBlueprints",
            "bedrock:UpdateBlueprint",
            "bedrock:DeleteBlueprint"
        ],
        "Resource": "arn:aws:bedrock:::blueprint/*"
    }
]

## Project using catalog blueprint

Now that we have our sample document available in S3, let's start with using the blueprints. Bedrock offers sample blueprints for most common document types such as W2, pay stub or bank statement. To begin with, we'll use a sample bank statement 

### View Sample Document

In [None]:
IFrame("examples/BankStatement.pdf", width=900, height=800)

### Upload sample document to S3
For this lab, we use a CMS 1500 Medical claim for with dummy data to explore the blueprint feature of BDA. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [None]:
%load_ext autoreload
%autoreload 1
import sys
sys.path.append('..')
%aimport utils.helper_functions 
from utils.helper_functions import wait_for_job_to_complete, read_s3_object, create_sample_file, get_bucket_and_key, wait_for_completion, display_html

input_bucket, input_prefix = get_bucket_and_key(bda_s3_input_location)
local_file_name = 'examples/BankStatement.pdf'
s3_file_name = 'BankStatement.pdf'
s3_response = s3_client.upload_file(local_file_name, input_bucket,
                                    f'{input_prefix}/{s3_file_name}')

### List blueprints in the Catalog
Let's view the blueprints that BDA offers in the catalog of sample blueprints

In [None]:
import pandas as pd
list_blueprints_response = bda_client.list_blueprints(resourceOwner='SERVICE')
df = pd.DataFrame(list_blueprints_response['blueprints'])[['blueprintName','blueprintArn']]
HTML(f'''
    <div style="height:300px; overflow:auto;">
        {df.to_html()}
    </div>
''')

### Invoke Blueprint Recommendation
With our sample ready, we can have BDA recommend a blueprint for our sample document from the sample set.

In [None]:
from utils.helper_functions import invoke_blueprint_recommendation_async, get_blueprint_recommendation
import json
inputConfiguration = {
    "inputDataConfiguration":{
        "s3Uri":f'{bda_s3_input_location}/{s3_file_name}'
    }
}
response = invoke_blueprint_recommendation_async(bda_client=bda_client,
                                      region_name=region_name, 
                                      credentials = boto3.Session().get_credentials().get_frozen_credentials(),
                                      payload=json.dumps(inputConfiguration))

job_id = response['jobId']

### Wait for blueprint recommendation results

In [None]:
status_response = wait_for_completion(
            client=None,
            get_status_function=get_blueprint_recommendation,
            status_kwargs={
                'bda_client': bda_client,
                'job_id': job_id,
                'region_name': region_name,
                'credentials': boto3.Session().get_credentials().get_frozen_credentials(),
            },
            completion_states=['Completed'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)

### Identify Blueprint
BDA outputs a blueprint recommendation along with a prompt recommendation that is useful to create a custom blueprint, if needed.

For this example, we will fetch the blueprint that was recommended by BDA.

In [None]:
blueprint_recommendation = next((result for result in status_response['results'] if result['type'] == 'BLUEPRINT_RECOMMENDATION'),None)

In [None]:
JSON(blueprint_recommendation['blueprintRecommendation'], root='blueprintRecommendation', expanded=True)

### Blueprint Schema

Now that we have identified the matching Blueprint, we can view the blueprint schema. The blueprint schema describes the data structure that contains fields, which in turn contain the information extracted by BDA custom output. There are two types of fields—explicit and implicit—located in the extraction table. Explicit extractions are used for clearly stated information that can be seen in the document. Implicit extractions are used for information that need to be transformed from how they appear in the document

In [None]:
JSON(json.loads(blueprint_recommendation['blueprintRecommendation']['schema']), root='Schema', expanded=False)

In [None]:
blueprint_arn = blueprint_recommendation['blueprintRecommendation']['matchedBlueprint']['blueprintArn']

### Invoke Data Automation Async
Now that we have identified a blueprint, we can proceed to invoke data automation. Note that in addition to the input and output configuration we also provide the blueprint id when calling the `invoke_data_automation_async` operation.

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': f'{bda_s3_input_location}/{s3_file_name}'
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    blueprints=[
        {
            'blueprintArn': blueprint_arn
        }
    ]
)

invocationArn = response['invocationArn']
print(f'Invoked data automation job with invocation arn {invocationArn}')

### Get Data Automation Status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [None]:
status_response = wait_for_completion(
            client=bda_client,
            get_status_function=bda_runtime_client.get_data_automation_status,
            status_kwargs={'invocationArn': invocationArn},
            completion_states=['Success'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

### Extract Custom output

In [None]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))
JSON(job_metadata, root='job_metadata', expanded=True)

### Explore Custom output using Blueprint 
We can now explore the custom output received from processing documents using the blueprint we used for the Data Automation job.

Note,that Standard output is always produced.

Let's break down the main sections of this JSON output from Bedrock Data Automation:

In [None]:
asset_id = 0
custom_output_path = next(item["segment_metadata"][0]["custom_output_path"] 
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
custom_output = json.loads(read_s3_object(custom_output_path))

In [None]:
JSON(custom_output)

#### _matched_blueprint_

- Contains information about the template used for analysis
- Includes the ARN, name ("Bank-Statement"), and confidence score for the match.

The confidence score is the degree of certainty with which BDA has matched the provided document to a blueprint.

Note: Since we passed in the blueprint arn in the `invoke_data_automation_async` BDA uses the blueprint with that Arn and hence the confidence score is 1. Later in this notebook, when using projects, we will see an example where BDA determines a blueprint for a document from a configured set of blueprints.

In [None]:
JSON(custom_output['matched_blueprint'], root='matched_blueprint')

#### _document class_

In [None]:
JSON(custom_output['document_class'], root='document_class')

#### _inference_results_

Inference results section contain the data BDA extracted from the document using the blueprint provided.

In [None]:
JSON(custom_output['inference_result'], root='inference_result')

#### _explainability_info

---

## Custom output using custom blueprint

In [None]:
For documents and images that aren't in the catalog, you can create custom blueprints. In the following example, we will extract data from a sample medical claim form along using a blueprint that we create.

### View Sample Document

In [None]:
IFrame("examples/sample1_cms-1500-P.pdf", width=900, height=800)

### Upload sample document to S3
For this lab, we use a CMS 1500 Medical claim for with dummy data to explore the blueprint feature of BDA. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [None]:
input_bucket, input_prefix = get_bucket_and_key(bda_s3_input_location)
local_file_name = 'examples/sample1_cms-1500-P.pdf'
s3_file_name = 'sample1_cms-1500.pdf'
s3_response = s3_client.upload_file(local_file_name, input_bucket,
                                    f'{input_prefix}/{s3_file_name}')

### Invoke Blueprint Recommendation

Before we start creating our own blueprint, let's explore the Blueprint recommendation with our sample document

In [None]:
from utils.helper_functions import invoke_blueprint_recommendation_async, get_blueprint_recommendation
import json
inputConfiguration = {
    "inputDataConfiguration":{
        "s3Uri":f'{bda_s3_input_location}/{s3_file_name}'
    }
}
response = invoke_blueprint_recommendation_async(bda_client=bda_client,
                                      region_name=region_name, 
                                      credentials = boto3.Session().get_credentials().get_frozen_credentials(),
                                      payload=json.dumps(inputConfiguration))

job_id = response['jobId']

### Wait for blueprint recommendation results

In [None]:
status_response = wait_for_completion(
            client=None,
            get_status_function=get_blueprint_recommendation,
            status_kwargs={
                'bda_client': bda_client,
                'job_id': job_id,
                'region_name': region_name,
                'credentials': boto3.Session().get_credentials().get_frozen_credentials(),
            },
            completion_states=['Completed'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)

In [None]:
blueprint_recommendation = next((result for result in status_response['results'] if result['type'] == 'BLUEPRINT_RECOMMENDATION'),None)
JSON(blueprint_recommendation['blueprintRecommendation'],root='blueprintRecommendation',expanded=False)

Note that BDA identified the type of the sample file provided as `DOCUMENT` and the document class as 'Health Insurance Claim Form'. However, the `matchedBlueprint` section is missing, indicating that BDA did not find an existing blueprint in the BDA catalog of premade blueprints.

Now, Let's start by creating our first blueprint.

### Define Blueprint properties
To create a blueprint you start with defining a blueprint name, description, the blueprint type (`DOCUMENT` or `IMAGE`), the blueprint stage (`LIVE` or `DEVELOPMENT`) along with blueprint schema in JSON schema format.

 You can create a blueprint using an API providing a name, type, stage and a schema in JSON format.

In [None]:
# create blueprint using Boto3
blueprint_name = 'medical-claim-form-cms1500'
blueprint_description = 'Blueprint for CMS 1500 Claim Form'	
blueprint_type = 'DOCUMENT'
blueprint_stage = 'LIVE'

with open('assets/blueprint_schema.json') as f:
    blueprint_schema = json.load(f)
JSON(blueprint_schema)

### Create (or Update) Blueprint

We will use the `create_blueprint` operation (or `update_blueprint` to update an existing blueprint) in the  `boto3` API to create/update the blueprint. You could also create/update blueprints using the AWS console. Each blueprint that you create is an AWS resource with its own blueprint ID and ARN. 

In [None]:
list_blueprints_response = bda_client.list_blueprints(
    blueprintStageFilter='ALL'
)
blueprint = next((blueprint for blueprint in
                  list_blueprints_response['blueprints']
                  if 'blueprintName' in blueprint and
                  blueprint['blueprintName'] == blueprint_name), None)


print(f'Found existing blueprint with name={blueprint_name}, updating Stage and Schema')

if not blueprint:
    response = bda_client.create_blueprint(
        blueprintName=blueprint_name,
        type=blueprint_type,
        blueprintStage=blueprint_stage,
        schema=json.dumps(blueprint_schema)
    )
else:
    response = bda_client.update_blueprint(
        blueprintArn=blueprint['blueprintArn'],
        blueprintStage=blueprint_stage,
        schema=json.dumps(blueprint_schema)
    )

blueprint_arn = response['blueprint']['blueprintArn']

### Invoke Data Automation Async
Now that our blueprint has been setup, we can proceed to invoke data automation. Note that in addition to the input and output configuration we also provide the blueprint id when calling the `invoke_data_automation_async` operation.

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': f'{bda_s3_input_location}/{s3_file_name}'
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    blueprints=[
        {
            'blueprintArn': blueprint_arn
        }
    ]
)

invocationArn = response['invocationArn']
print(f'Invoked data automation job with invocation arn {invocationArn}') 

### Get Data Automation Status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [None]:
status_response = wait_for_completion(
            client=bda_client,
            get_status_function=bda_runtime_client.get_data_automation_status,
            status_kwargs={'invocationArn': invocationArn},
            completion_states=['Success'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

In [None]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))
JSON(job_metadata, root='job_metadata', expanded=True)

### Explore the Custom output with custom blueprint

In [None]:
asset_id=0
custom_output_path = next(item["segment_metadata"][0]["custom_output_path"] 
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
custom_output = json.loads(read_s3_object(custom_output_path))

The structure of the custom output would be the same as that of the output produced when using a catalog blueprint. However, the `inference_result` now contain data that map to the blueprint schema we provided to BDA with the `InvokeDataAutomationAsync` operation.

In [None]:
inference_result = custom_output['inference_result']
medical_procedures = inference_result.pop['medical_procedures']

In [None]:
import pandas as pd
df = pd.DataFrame(inference_result, index=[0])
df.T.style.hide(axis='columns')

In [None]:
df2 = pd.DataFrame(inference_result['medical_procedures'], index=[0])