# Using Multiple Blueprints with Projects

## Introduction

Data projects in Amazon Bedrock data automation (BDA) provide a easy way of grouping your standard and customt output configuration for processing files. You can create a BDA project and use the ARN of the project to call the `InvokeDataAutomationAsync` API. BDA processes the input file automatically using the configuration settings defined in that project. Output is then generated based on the project's configuration. You can use a single project resource for multiple file types. 

You can also configure a project with Blueprints for documents (or images), to define custom output. In this notebook, we will explore the capability of using project with blueprints for processing documents. We will start with creating a project with a single blueprint and progress to add multiple blueprints (preexisting and custom) to process file with multiple documents.

## Using projects with custom output

You can configure custom output for documents by adding a new blueprint (or a preexisting blueprint from BDA global catalog) to the BDA project. If your use case has different kinds of documents then you can use  multiple blueprints for the different document kinds with the project.

**Note: A project chan have up to 40 document blueprints attached.**

When you attach multiple blueprints with a project, BDA would automatically find an appropriate blueprint matching using the input document. Once a matching blueprint is found, BDA generates custom output using that blueprint.

Let's go through the steps to creating a project and attaching a set of blueprints to process different file types.

## Prerequisites

In [2]:
pip install "boto3>=1.35.76" PyPDF2 --upgrade -qq

Note: you may need to restart the kernel to use updated packages.


## Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [82]:
%load_ext autoreload
%autoreload 1
import boto3
import json
import pprint
from IPython.display import JSON, display, HTML, Markdown, IFrame
import sagemaker
import pandas as pd
from itables import show
import time
import pandas as pd
import json

import sys
sys.path.append('..')

%aimport utils.helper_functions
%aimport utils.display_functions
    
from utils.helper_functions import read_s3_object, download_document, get_bucket_and_key, get_summaries, transform_custom_output,  wait_for_completion, display_html, create_or_update_blueprint
from utils.display_functions import display_multiple, segment_view, get_view, display_collapsable


session = sagemaker.Session()
default_bucket = session.default_bucket()

region_name = 'us-west-2'
# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Configure IAM Permissions


The features being explored in the notebook require the following IAM Policies for the role being used. If you're running this notebook within SageMaker Studio in your own Account, update the default execution role for the SageMaker user profile to include the following IAM policies. 

```json
[
    {
        "Sid": "BDACreatePermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:CreateDataAutomationProject",
            "bedrock:CreateBlueprint"
        ],
        "Resource": "*"
    },
    {
        "Sid": "BDAOProjectsPermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:CreateDataAutomationProject",
            "bedrock:UpdateDataAutomationProject",
            "bedrock:GetDataAutomationProject",
            "bedrock:GetDataAutomationStatus",
            "bedrock:ListDataAutomationProjects",
            "bedrock:InvokeDataAutomationAsync"
        ],
        "Resource": "arn:aws:bedrock:::data-automation-project/*"
    },
    {
        "Sid": "BDABlueprintPermissions",
        "Effect": "Allow",
        "Action": [
            "bedrock:GetBlueprint",
            "bedrock:ListBlueprints",
            "bedrock:UpdateBlueprint",
            "bedrock:DeleteBlueprint"
        ],
        "Resource": "arn:aws:bedrock:::blueprint/*"
    }
]

### View Sample Document

In [4]:
local_file_name = 'data/documents/claims-pack.pdf'
IFrame("data/documents/claims-pack.pdf", width=900, height=800)

### Upload sample document to S3
For this lab, we use a CMS 1500 Medical claim for with dummy data to explore the blueprint feature of BDA. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [5]:
input_bucket, input_prefix = get_bucket_and_key(bda_s3_input_location)
s3_file_name = 'claims-pack.pdf'
s3_response = s3_client.upload_file(local_file_name, input_bucket,
                                    f'{input_prefix}/{s3_file_name}')

### Create custom blueprints for our documents (where needed)

We will use the `create_blueprint` operation (or `update_blueprint` to update an existing blueprint) in the  `boto3` API to create/update the blueprint. You could also create/update blueprints using the AWS console. Each blueprint that you create is an AWS resource with its own blueprint ID and ARN. 

In [6]:
# create blueprint using Boto3
blueprints = [
    {
        "name": 'claim-form',
        "description": 'Blueprint for Medical Claim form CMS 1500',
        "type": 'DOCUMENT',
        "stage": 'LIVE',
        "schema_path": 'data/blueprints/claims_form.json'
    },
    {
        "name": 'hospital-discharge-report',
        "description": 'Blueprint for Hospital discharge summary report',
        "type": 'DOCUMENT',
        "stage": 'LIVE',
        "schema_path": 'data/blueprints/discharge_summary.json'
    },
    {
        "name": 'medical-transcription',
        "description": ' Medical Transcription',
        "type": 'DOCUMENT',
        "stage": 'LIVE',
        "schema_path": 'data/blueprints/medical_transcription.json'
    },
    
]




In [7]:
blueprint_arns = []
for blueprint in blueprints:
    with open(blueprint['schema_path']) as f:
        blueprint_schema = json.load(f)
        blueprint_arn = create_or_update_blueprint(
            bda_client, 
            blueprint['name'], 
            blueprint['description'], 
            blueprint['type'],
            blueprint['stage'],
            blueprint_schema
        )
        blueprint_arns += [blueprint_arn]

Found existing blueprint with name=claim-form, updating Stage and Schema
Found existing blueprint with name=hospital-discharge-report, updating Stage and Schema
Found existing blueprint with name=medical-transcription, updating Stage and Schema


The `update_data_automation_project` API takes a project name, description, stage (LIVE / DEVELOPMENT), the standard output configuration and a custom output configuration as input. We are only focussing on the custom output in this notebook, so we leave the standard output configuration as empty so BDA will use the defaults. Additionally, we use a custom configuration with the arn for the recommended blueprint.

### Create data project to process file with multiple document types
With our sample S3, we can now go ahead an create our data project. Check if the project with the given name already exists, if no create a new project with given configuration otherwise update the data project with the given configuration and stage.

A few points to note - 
* We have added a set of blueprints to our data project to match the document types we would expect to file in the claim pack
* We pass the overrideConfiguration to the api call, with document splitter enabled.

In [8]:
bda_project_name = 'document-custom-output-multiple'
bda_project_stage = 'LIVE'
standard_output_configuration = {
    'document': {
        'extraction': {
            'granularity': {
                'types': [
                    'DOCUMENT', 'PAGE'
                ]
            },
            'boundingBox': {
                'state': 'ENABLED'
            }
        },
        'generativeField': {
            'state': 'ENABLED'
        },
        'outputFormat': {
            'textFormat': {
                'types': [
                    'MARKDOWN'
                ]
            },
            'additionalFileFormat': {
                'state': 'ENABLED'
            }
        }
    }
}

custom_output_configuration = {
    "blueprints": [
        {
            'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-bank-statement',
            'blueprintStage': 'LIVE'
        },
        {
            'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license',
            'blueprintStage': 'LIVE'
        },
        {
            'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-invoice',
            'blueprintStage': 'LIVE'
        },
        {
            'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-prescription-label',
            'blueprintStage': 'LIVE'
        }
    ]
}
custom_output_configuration['blueprints'] += [
    {
        'blueprintArn': blueprint_arn,
        'blueprintStage': 'LIVE'
    } for blueprint_arn in blueprint_arns
]

override_configuration={'document': {'splitter': {'state': 'ENABLED'}}}

In [9]:
list_project_response = bda_client.list_data_automation_projects(
    projectStageFilter=bda_project_stage)

project = next((project for project in list_project_response['projects']
               if project['projectName'] == bda_project_name), None)

if not project:
    response = bda_client.create_data_automation_project(
        projectName=bda_project_name,
        projectDescription='A BDA Project for Standard Output from Document',
        projectStage=bda_project_stage,
        standardOutputConfiguration=standard_output_configuration,
        customOutputConfiguration=custom_output_configuration,
        overrideConfiguration=override_configuration
    )
else:
    response = bda_client.update_data_automation_project(
        projectArn=project['projectArn'],
        standardOutputConfiguration=standard_output_configuration,
        customOutputConfiguration=custom_output_configuration,
        overrideConfiguration=override_configuration
    )

project_arn = response['projectArn']

### Wait for create/update data project operation completion

In [10]:
status_response = wait_for_completion(
            client=bda_client,
            get_status_function=bda_client.get_data_automation_project,
            status_kwargs={'projectArn': project_arn},
            completion_states=['COMPLETED'],
            error_states=['FAILED'],
            status_path_in_response='project.status',
            max_iterations=15,
            delay=30
)

Operation completed successfully with status: COMPLETED


### Invoke Data Automation Async
With the data project configured, we can now invoke data automation for our sample document. When we submit the document for processing, BDA scans the file and splits it into individual documents based on contextand matches it against the list of blueprints provided.

In [11]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': f'{bda_s3_input_location}/{s3_file_name}'
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationConfiguration={
        'dataAutomationArn': project_arn,
        'stage': 'LIVE'
    }
)

invocationArn = response['invocationArn']
print(f'Invoked data automation job with invocation arn {invocationArn}')

Invoked data automation job with invocation arn arn:aws:bedrock:us-west-2:870569814485:data-automation-invocation/b2c9975b-2b87-40f8-b8c5-676127d488d3


### Get Data Automation Status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [42]:
status_response = wait_for_completion(
            client=bda_client,
            get_status_function=bda_runtime_client.get_data_automation_status,
            status_kwargs={'invocationArn': invocationArn},
            completion_states=['Success'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

Operation completed successfully with status: Success


### Retrieve Job Metadata
Let's retrieve the job metadata. As expected, the Job metadata contains the S3 uri's for the standard output, status of custom output 

The custom output status could be either of `MATCH` or `NO_MATCH`. `MATCH` indicates BDA was able to find a matching blueprint for the specific segment, 'NO_MATCH' if BDA was unable to match the segment to a blueprint associated with the project.

In [103]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))

job_metadata_table = pd.DataFrame(job_metadata['output_metadata'][0]['segment_metadata'])
job_metadata_json = JSON(job_metadata, root="job_metadata", expanded=True)
# Display the widget
display_multiple([get_view(job_metadata_json), get_view(job_metadata_table)], ["Raw JSON", "Table View"])

Tab(children=(Output(), Output()), selected_index=0, titles=('Raw JSON', 'Table View'))

### View Segments and Matched Blueprints
As we can see in the `job metadata`, BDA creates a segment section each for each individual document that it has identified in the file. Each segment section has details on the matched blueprint and the results of the extraction. For each segment, BDA also outputs the page indices (one or more) from the original file.

We can now get the custom output corresponding to each segment and look at the insights that BDA custom output produces.

In [142]:
asset_id = 0
segments_metadata = next(item["segment_metadata"]
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)

standard_outputs = [json.loads(read_s3_object(segment_metadata.get('standard_output_path')))for segment_metadata in segments_metadata]
custom_outputs = [json.loads(read_s3_object(segment_metadata.get('custom_output_path'))) if segment_metadata.get('custom_output_status') == 'MATCH' else None for segment_metadata in segments_metadata]

### View Custom output

In [146]:
custom_outputs_json = JSON(custom_outputs, root="custom_outputs", expanded=False)
custom_outputs_table = pd.DataFrame(get_summaries(custom_outputs))

display_multiple([get_view(custom_outputs_json), get_view(custom_outputs_table.style.hide(axis='index'))], ["Raw JSON", "Table View"])

Tab(children=(Output(), Output()), selected_index=0, titles=('Raw JSON', 'Table View'))

In [120]:
pip install PyPDF

Collecting PyPDF
  Downloading pypdf-5.2.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.2.0-py3-none-any.whl (298 kB)
Installing collected packages: PyPDF
Successfully installed PyPDF-5.2.0
Note: you may need to restart the kernel to use updated packages.


In [122]:
from PIL import Image
from pypdf import PdfWriter
import os

def images_to_pdf(image_paths, output_pdf):
    # Initialize PDF writer
    writer = PdfWriter()
    
    # Convert images to individual PDFs first
    pdf_paths = []
    
    for i, image_path in enumerate(image_paths):
        # Open image
        image = Image.open(image_path)
        # Convert to RGB if necessary
        if image.mode != 'RGB':
            image = image.convert('RGB')
        # Create temporary PDF name
        temp_pdf = f'temp_{i}.pdf'
        # Save as PDF
        image.save(temp_pdf, 'PDF')
        # Append to writer
        writer.append(temp_pdf)
        pdf_paths.append(temp_pdf)
    
    # Save final PDF
    with open(output_pdf, 'wb') as output:
        writer.write(output)
    
    # Clean up temporary PDFs
    for pdf in pdf_paths:
        os.remove(pdf)
        
# Example usage
image_list = ['data/documents/claim-form.png', 'data/documents/claim-form.png']
images_to_pdf(image_list, 'combined_output.pdf')


### Explore Document Insights using Standard and Custom output

In [97]:
views=[]
titles=[]
# Use the function
for custom_output, standard_output in zip(custom_outputs, standard_outputs):
    result = transform_json_with_confidence(custom_output['inference_result'], custom_output['explainability_info'][0])
    document_image_uris = [page.get('asset_metadata',{}).get('rectified_image') for page in standard_output.get('pages',[])
    views += [segment_view(document_image_uris=document_image_uris,
               inference_result=result)]
    titles += [custom_output.get('matched_blueprint', {}).get('name', None)]
display_multiple(views, titles)

Tab(children=(HBox(children=(VBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01…

We can now explore the custom output received from processing document.

### Extract Standard output

In [100]:
JSON(standard_output)

<IPython.core.display.JSON object>

In [102]:
!aws s3 cp {standard_output.get('pages')[0]['asset_metadata']['rectified_image']} 'examples/rectified_claim_form.jpeg'

download: s3://sagemaker-us-west-2-870569814485/bda/output/b2c9975b-2b87-40f8-b8c5-676127d488d3/0/standard_output/0/assets/rectified_image_0.png to examples/rectified_claim_form.jpeg


## Clean up

In [None]:
!rm examples/rectified_claim_form.jpeg

## Conclusion
In this notebook, we configured a project with Blueprints for documents, to define custom output.

## Next Steps

When processing documents, you might want to use multiple blueprints for different kinds of documents that are passed to your project. A project can have up to 40 document blueprints attached. BDA automatically matches your documents to the appropriate blueprint that's configured in your project, and generates custom output using that blueprint.

As a next step, we will configure a project with multiple blueprints for different kinds of documents. A project can have up to 40 document blueprints attached. We will explore how BDA automatically matches your documents to the appropriate blueprint that's configured in your project, and generates custom output using that blueprint.

In [None]:
IFrame("examples/sample1_cms-1500-P.pdf", width=900, height=800)