# Custom output and blueprints

## Introduction

In addition to the `Standard Output` Amazon Bedrock Data Automation (BDA) offers the `Custom Output` feature that lets you to fine-tune your extractions from documents and images. This capability is particularly useful when working with complex or specialized data. 

You can configure Custom output in BDA using `Blueprints`. `Bblueprints` are essentially a lists of instructions that guide the extraction of information from your file, including allowing for transformation and adjustment of output. This feature works in conjunction with BDA projects, enabling the processing of up to 40 document inputs and one image input. 

Custom outputs provide users with greater control and flexibility in how they extract and structure information from their files, making it easier to tailor the results to their particular use cases.

## Blueprints

You can use blueprints to configure file processing business logic in Amazon Bedrock Data Automation (BDA). Each blueprint consists of a list of field names to extract, the desired data format for each field (e.g., string, number, boolean), and natural language context for data normalization and validation rules. 

BDA has ready-to-use blueprints (`Catalog Blueprints`) for a number of commonly used document types such as W2, Paystub or a Receipt. Catalog blueprints are a great way to start if the document you want to extract from matches the blueprint. To extract from documents that are not matched by blueprints in the catalog you can create your own blueprints. When creating the blueprint using the AWS Console, you have the option to let BDA generate blueprint after providing a sample document and an optional prompt. You can also create the blueprint by adding individual fields or by using a JSON editor to define the JSON for the blueprint.

In this notebook, we would use explore custom output using blueprints and data automation projects.

## Prerequisites

In [1]:
pip install "boto3>=1.35.76" PyPDF2 --upgrade

Note: you may need to restart the kernel to use updated packages.


## Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [2]:
import boto3
import json
import pprint
from IPython.display import JSON
import IPython.display as display
from utils.helper_functions import wait_for_job_to_complete, read_s3_object, create_sample_file, get_bucket_and_key,wait_for_completion, display_html
from sagemaker import get_execution_role

default_execution_role = get_execution_role()
region_name = 'us-west-2'
# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = 's3://genai.octankmarkets.com/input'
bda_s3_output_location = 's3://genai.octankmarkets.com/output'
print(f'Using default execution role {default_execution_role}')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Using default execution role arn:aws:iam::870569814485:role/service-role/AmazonSageMaker-ExecutionRole-20250102T175813


### Configure IAM Permission

## Prepare sample document
For this lab, we use a CMS 1500 Medical claim for with dummy data to explore the blueprint feature of BDA. 

### Upload sample document to S3


Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [3]:
%load_ext autoreload
%autoreload 2

from utils.helper_functions import wait_for_job_to_complete, read_s3_object, create_sample_file, get_bucket_and_key,wait_for_completion, display_html

# Download sample pdf file, extract pages and  
input_bucket, input_prefix = get_bucket_and_key(bda_s3_input_location)
local_file_name = 'examples/sample1_cms-1500-P.pdf'
s3_file_name = 'MonthlyTreasuryStatement.pdf'
s3_response = s3_client.upload_file(local_file_name, input_bucket,
                                    f'{input_prefix}/{s3_file_name}')

## Using Blueprints

### Configure IAM Permissions

The features being explored in the notebook require the following IAM Policies for the role being used. If you're running this notebook within SageMaker Studio in your own Account, update the default execution role for the SageMaker user profile to include the following IAM policies. 

```json
[
    {
        "Sid": "BDACreatePermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:CreateDataAutomationProject",
            "bedrock:CreateBlueprint"
        ],
        "Resource": "*"
    },
    {
        "Sid": "BDAOProjectsPermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:CreateDataAutomationProject",
            "bedrock:UpdateDataAutomationProject",
            "bedrock:GetDataAutomationProject",
            "bedrock:GetDataAutomationStatus",
            "bedrock:ListDataAutomationProjects",
            "bedrock:InvokeDataAutomationAsync"
        ],
        "Resource": "arn:aws:bedrock:::data-automation-project/*"
    },
    {
        "Sid": "BDABlueprintPermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:GetBlueprint",
            "bedrock:ListBlueprints",
            "bedrock:UpdateBlueprint",
            "bedrock:DeleteBlueprint"
        ],
        "Resource": "arn:aws:bedrock:::blueprint/*"
    }
]

### Create Blueprint

Now, Let's start by creating our first blueprint. You can create a blueprint using an API providing a name, type, stage and a schema in JSON format.

In [4]:
# create blueprint using Boto3
blueprint_name = 'medical-claim-form-cms1500'
blueprint_description = 'Blueprint for CMS 1500 Claim Form'	
blueprint_type = 'DOCUMENT'
blueprint_stage = 'LIVE'

with open('assets/blueprint_schema.json') as f:
    blueprint_schema = json.load(f)	

In [5]:
list_blueprints_response = bda_client.list_blueprints(
    blueprintStageFilter='ALL'
)
blueprint = next((blueprint for blueprint in
                  list_blueprints_response['blueprints']
                  if 'blueprintName' in blueprint and
                  blueprint['blueprintName'] == blueprint_name), None)


print(f'Found existing blueprint with name={blueprint_name}, updating Stage and Schema')

if not blueprint:
    response = bda_client.create_blueprint(
        blueprintName=blueprint_name,
        type=blueprint_type,
        blueprintStage=blueprint_stage,
        schema=json.dumps(blueprint_schema)
    )
else:
    response = bda_client.update_blueprint(
        blueprintArn=blueprint['blueprintArn'],
        blueprintStage=blueprint_stage,
        schema=json.dumps(blueprint_schema)
    )

blueprint_arn = response['blueprint']['blueprintArn']

Found existing blueprint with name=medical-claim-form-cms1500, updating Stage and Schema


In [7]:
response

{'ResponseMetadata': {'RequestId': 'df77f40e-e166-4e6a-b7a2-6cffb1cf6bf8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Wed, 15 Jan 2025 12:33:10 GMT',
   'content-type': 'application/json',
   'content-length': '8351',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'df77f40e-e166-4e6a-b7a2-6cffb1cf6bf8'},
  'RetryAttempts': 0},
 'blueprint': {'blueprintArn': 'arn:aws:bedrock:us-west-2:870569814485:blueprint/f95942cc2085',
  'schema': '{"$schema": "http://json-schema.org/draft-07/schema#", "documentClass": "CMS 1500 Claim Form", "description": "A standard medical claim form used by healthcare providers in the US to bill health insurance companies for medical services.", "definitions": {"Procedure_Service_Supplies": {"properties": {"service_start_date": {"type": "string", "inferenceType": "extractive", "description": "The service start date from item 24A in YYYY-MM-DD format"}, "service_end_date": {"type": "string", "inferenceType": "extractive", "description": "The servic