# Bedrock Data Automation Projects

## Introduction

In the last section we explored the default way to invoke data automation with Standard output. A more flexible way of setting up standard outputs is using [Bedrock Automation Projects](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-projects.html).

A project encapsulates your standard output (mandatory) and custom output(optional) configurations. Projects grants you flexibility to define your standard output with its own set of configurable options specific to your use case.

Once you create a project with standard output and (optionally) custom output configuration, you can call the `InvokeDataAutomationAsync` with the project ARN. BDA then uses the configuration defined in the project as the basis to generate the output.

You can define configuration for more than one data type in a BDA project. BDA processes the input provided using the appropriate configuration associated with the type of the input. For example, an audio file sent to BDA using project name ABC will be processed using project ABC’s audio standard output configuration. A document sent to BDA using project name ABC will be processed using project ABC’s document standard output configuration.

In this notebook, we will explore the various ways through which we can leverage projects to define our standard output configuration.


## Prerequisites

In [None]:
pip install "boto3>=1.35.76" PyPDF2 --upgrade -qq

## Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [None]:
import boto3
import json
import pprint
from IPython.display import JSON, display, HTML, Markdown, IFrame
import random


region_name = 'us-west-2'
# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')
bda_s3_input_location = 's3://genai.octankmarkets.com/input'
bda_s3_output_location = 's3://genai.octankmarkets.com/output'

### Prepare Sample Document
For this lab, we use a `Monthly Treasury Statement for the United States Government` for Fiscal Year 2025 through November 30, 2024. The document is prepared by the Bureau of the Fiscal Service, Department of the Treasury and provides detailed information on the government's financial activities. We will extract a subset of pages from the `PDF` document and use BDA to extract and analyse the document content.

### Download and store sample document
we use the document url to download the document and store it a S3 location. 
Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [None]:
%load_ext autoreload
%autoreload 1
import sys
sys.path.append('..')
%aimport utils.helper_functions

from utils.helper_functions import read_s3_object, create_sample_file, get_bucket_and_key,wait_for_completion, display_html


# Download sample pdf file, extract pages and  
input_bucket, input_prefix = get_bucket_and_key(bda_s3_input_location)
sample_data_url = "https://fiscaldata.treasury.gov/static-data/published-reports/mts/MonthlyTreasuryStatement_202411.pdf"
local_file_name = 'examples/MonthlyTreasuryStatement_202411.pdf'
create_sample_file(sample_data_url, 0, 9, local_file_name)
s3_file_name = 'MonthlyTreasuryStatement.pdf'
s3_response = s3_client.upload_file(local_file_name, input_bucket,
                                    f'{input_prefix}/{s3_file_name}')

### View Sample Document

In [None]:
display(IFrame("examples/MonthlyTreasuryStatement_202411.pdf", width=900, height=800))

## Using Data Project with Standard Output

Let's start by creating our first project. You can create a project with a name, description and a standard output configuration (for one or more modalities). You can set standard output options specific to your use case. For our example, we will define standard output for Document modality with the following options to explore output - 

**Response Granularity** - DOCUMENT, PAGE, LINE, WORD, ELEMENT

**Bounding Box**  - ENABLED

**Generative Field** - ENABLED

**Output Format** - Markdown, CSV, HTML

**AdditionalFileFormation** - ENABLED

All other options (including for other modalities) would take default values

### Configure IAM Permissions

The features being explored in the notebook require the following IAM Policies for the role being used. If you're running this notebook within SageMaker Studio in your own Account, update the default execution role for the SageMaker user profile to include the following IAM policies. 

```json
        {
            "Sid": "BDAPermissions",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateDataAutomationProject",
                "bedrock:CreateBlueprint",
                "bedrock:ListDataAutomationProjects",
                "bedrock:UpdateDataAutomationProject",
                "bedrock:GetDataAutomationProject",
                "bedrock:GetDataAutomationStatus",
                "bedrock:InvokeDataAutomationAsync",
                "bedrock:GetBlueprint",
                "bedrock:ListBlueprints",
                "bedrock:UpdateBlueprint",
                "bedrock:DeleteBlueprint"
            ],
            "Resource": "*"
        }
Note - The policy uses wildcard(s) for demo purposes. AWS recommends using least priviledges when defining IAM Policies in your own AWS Accounts

### Define data project input

The `create_data_automation_project` API takes a project name, description, stage (LIVE / DEVELOPMENT) and the standard output configuration as input

In [None]:
bda_project_name = 'document-std-output'
bda_project_stage = 'LIVE'
standard_output_configuration = {
        'document': {
            'extraction': {
                'granularity': {
                    'types': [
                        'DOCUMENT', 'PAGE', 'LINE', 'WORD', 'ELEMENT'
                    ]
                },
                'boundingBox': {
                    'state': 'ENABLED'
                }
            },
            'generativeField': {
                'state': 'ENABLED'
            },
            'outputFormat': {
                'textFormat': {
                    'types': [
                        'MARKDOWN', 'CSV','HTML'
                    ]
                },
                'additionalFileFormat': {
                    'state': 'ENABLED'
                }
            }
        }
    }

### Create data automation project

Check if the project with the given name already exists, if no create a new project with given configuration otherwise update the data project with the given configuration and stage

In [None]:
list_project_response = bda_client.list_data_automation_projects(
    projectStageFilter=bda_project_stage)

project = next((project for project in list_project_response['projects']
               if project['projectName'] == bda_project_name), None)

if not project:
    response = bda_client.create_data_automation_project(
        projectName=bda_project_name,
        projectDescription='A BDA Project for Standard Output from Document',
        projectStage=bda_project_stage,
        standardOutputConfiguration=standard_output_configuration)
else:
    response = bda_client.update_data_automation_project(
        projectArn=project['projectArn'],
        standardOutputConfiguration=standard_output_configuration)

project_arn = response['projectArn']

### Wait for create/update data project operation completion

In [None]:
status_response = wait_for_completion(
            client=bda_client,
            get_status_function=bda_client.get_data_automation_project,
            status_kwargs={'projectArn': project_arn},
            completion_states=['COMPLETED'],
            error_states=['FAILED'],
            status_path_in_response='project.status',
            max_iterations=15,
            delay=30
)

### Invoke Data Automation Async
Note that now we also provide the data automation configuration with our project arn 

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': f'{bda_s3_input_location}/{s3_file_name}'
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationConfiguration={
        'dataAutomationArn': project_arn,
        'stage': 'LIVE'
    }
)

invocationArn = response['invocationArn']
print(f'Invoked data automation job with invocation arn {invocationArn}') 

### Get Data Automation Status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [None]:
status_response = wait_for_completion(
            client=bda_client,
            get_status_function=bda_runtime_client.get_data_automation_status,
            status_kwargs={'invocationArn': invocationArn},
            completion_states=['Success'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

### Retrieve Job Metadata

In [None]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))
JSON(job_metadata, root='job_metadata', expanded=True)

---

## Explore the Data Project Standard output
We can now explore the standard output received from processing documents using our own settings configured using our Data Automation project.


The Standard output for `Document` modality always includes a `metadata` section and a `document` section in the json result. Because in our project configuration we specified the response granularity to additionally include only Page, Word and Line  the results would include section `pages`, `text_lines` and `text_words` information. Let's move on to exploring the content of these sections.
 
Note that the result would not contain the `element` level granularity because we specifically left it out standard output configuration in the project.

In [None]:
asset_id=0
standard_output_path = next(item["segment_metadata"][0]["standard_output_path"] 
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
standard_output = json.loads(read_s3_object(standard_output_path))

In [None]:
JSON(standard_output)

### metadata section
The metadata section in the response provides an overview of the metadata associated with the document. This include the S3 bucket and key for the input document. The metadata also contains the modality that was selected for your response `DOCUMENT` for our sample input. Metadata also includes the number of pages processed as well as the start and end page index.

In [None]:
JSON(standard_output['metadata'], root='metadata', expanded=True)

### _document_ section
The BDA standard output will always contain the document section.     
The document section of the standard output provides document level granularity information. Document level granularity would include an analysis of information from the document providing key pieces of info.

The detail present within the document section would depend on the level of granularity and other options enabled in the project standard output configuration
- When `DOCUMENT` granularity is enabled, the document section would contain the representation in the various text output format (e.g. MARKDOWN, HTML etc.) configured in the project standard output configuration
- With `ELEMENT` granularity the document section includes a statistics on the type and count of various elements found in the document
- When you enable `Generative Fields`, then the document section would include a (about) 10-word description and (about) 250 word summary
- When `WORD` or `LINE` granularity is enabled the document include a statistic on count of word and line in the document respectively.
- 
Below is an example of the document section using our standard output configuration that we earlier defined in our data automation project.

In [None]:
JSON(standard_output['document'], root='document', expanded=False)

### _pages_ section
With Page level granularity (enabled by default) text in a page are consolidated and are listed in the pages section with one item for each page. The page entity in the Standard output include the page index. The individual page entities also include the statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. The asset metadata represents the page bounds using coordinates of the four corners.

Let look at the content in a selected page section in the result from our sample input.

In [None]:
num_pages = standard_output['metadata']['number_of_pages']
page = standard_output['pages'][2]
#JSON(page, root='pages[2]', expanded=True)

display_html(page['representation']['html'])

### _elements_ section
The elements section contains the various semantic elements extracted from the documents including - 
- TEXT
- TABLE
- FIGURE

Let's explore a selective set of content in each of these element types extracted from the sample input document.

#### a sample _TEXT_ element 
This is the entity used for text within a document. The entity contains the `representation` field for the text found in various configured output format. The `sub_type` field provides more detailed information on what kind of text is being detected. The `TEXT` element could include the following sub types - 

- TITLE
- HEADER
- PAGE_NUMBER
- SECTION_HEADER
- FOOTER
- PARAGRAPH

In [None]:
text = JSON(standard_output['elements'][5], root='elements[5]', expanded=True)

#### a sample _FIGURE_ element


- IMAGE
- CHART
- LOGO

**example output for an _IMAGE_ subtype**

In [None]:
JSON(standard_output['elements'][38],root='elements[38]', expanded=True)

**example output for an _CHART_ subtype**

In [None]:
JSON(standard_output['elements'][6],root='elements[6]', expanded=True)

#### a sample _TABLE_ element

In [None]:
table_element = standard_output['elements'][34]
JSON(table_element, root='table_element', expanded=False)

##### Representation of the table using Markdown

In [None]:
display_html(table_element['representation']['html'] + '\n'.join(table_element['footers']))

#### Extracted Table contents in S3 assets

In [None]:
import pandas as pd
import io
extracted_csv = read_s3_object(s3_uri=table_element['csv_s3_uri'])
df = pd.read_csv(io.StringIO(extracted_csv)).fillna('')
df = df.iloc[:, 1:]
display(df.style.hide(axis='index'))

---

## Explore Standard output for a single page

Let's start with exploring Page 2 that is mostly text.

In [None]:
page = standard_output['pages'][2]

display(Markdown(page['representation']['markdown']))

### Consolidating content for a single page


In [None]:
elements = [element for element in standard_output['elements'] if ('page_indices' in element and 2 in element['page_indices'])]	
display(JSON(elements, expanded=True))

In [None]:
# get all such element in standard_output['elements'] list if the element has 0 anywhere in page_indices list
elements = [element['representation']['markdown'] for element in standard_output['elements'] if (element['type']=='TEXT' and 'page_indices' in element and 2 in element['page_indices'])]	

for element in elements:
    display(Markdown(element))

## Clean Up
Let's delete the sample files that were downloaded locally and that uploaded to S3

In [None]:
## Delete S3 File

s3_client.delete_object(Bucket=input_bucket, Key=f'{input_prefix}/{s3_file_name}')

#Delete local file
import os
if os.path.exists(local_file_name):
    os.remove(local_file_name)	

## Conclusion

In this notebook we started with the default way of interacting with Amazon Bedrock Automation (BDA) by passing a sample document to the BDA API with no established blueprint or project. We then explored the default default standard output for the sample document.

## Next Steps
Standard output can be modified using projects, which store configuration information for each data type. In the next part of the workshop we would explore Bedrock data automation projects.