# Getting document insights with standard output

## Introduction

Amazon Bedrock Data Automation (BDA) lets you configure output based on your processing needs for a specific data type: images, documents, audio or video. BDA can generate standard output or custom output.

You can use standard outputs for all four modalities: images, documents, audio, and videos. BDA always provides a standard output response even if it's alongside a custom output response.

Standard outputs are modality-specific default structured insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. 

In this notebook we will explore the standard output for documents.

## Prerequisites

### Configure IAM permissions

The features being explored in the workshop require multiple IAM Policies for the role being used. If you're running this notebook within SageMaker Studio in your own AWS Account, update the default execution role for the SageMaker user profile to include the IAM policies described in [README.md](../README.md).

### Install required libraries

In [None]:
%pip install "boto3>=1.37.4" itables==2.2.4 PyPDF2==3.0.1 pdf2image==1.17.0 markdown==3.7 --upgrade -qq

In [None]:
%load_ext autoreload
%autoreload 2

### Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [None]:
import boto3
import json
import pprint
from IPython.display import JSON, display, IFrame, Markdown, HTML
import sagemaker
import pandas as pd
from itables import show
import time
import pdf2image
from utils import helper_functions
from utils import display_functions
from pathlib import Path
import os

session = sagemaker.Session()
default_bucket = session.default_bucket()
current_region = boto3.session.Session().region_name

sts = boto3.client('sts')
account_id = sts.get_caller_identity()['Account']

# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

## Prepare sample document
For this lab, we use a `Monthly Treasury Statement for the United States Government` for Fiscal Year 2025 through November 30, 2024. This document is prepared by the Bureau of the Fiscal Service, Department of the Treasury. It provides detailed information on the government's financial activities. We will first extract a subset of pages from this document and then use BDA to extract and analyze the document content.

### Download and store sample document
we use the document URL to download the document and store it a S3 location. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [None]:
# Download the document
document_url = "https://fiscaldata.treasury.gov/static-data/published-reports/mts/MonthlyTreasuryStatement_202411.pdf"
local_download_path = 'samples'

# Create full path of directories
os.makedirs(local_download_path, exist_ok=True)
local_file_name = f"{local_download_path}/MonthlyTreasuryStatement_202411.pdf"
file_path_local = helper_functions.download_document(document_url, start_page_index=0, end_page_index=10, output_file_path=local_file_name)

# Upload the document to S3
file_name = Path(file_path_local).name
document_s3_uri = f'{bda_s3_input_location}/{file_name}'

target_s3_bucket, target_s3_key = helper_functions.get_bucket_and_key(document_s3_uri)
s3_client.upload_file(local_file_name, target_s3_bucket, target_s3_key)

print(f"Downloaded file to: {file_path_local}")
print(f"Uploaded file to S3: {target_s3_key}")
print(f"document_s3_uri: {document_s3_uri}")

### View sample document

In [None]:
IFrame(local_file_name, width=800, height=800)

## Define standard output configuration

The standard output provides a comprehensive set of options to control the granularity, format, and additional metadata extracted from the input documents. This allows tailoring the output to the specific needs of the use case. 

Below is a summary of the options that you can set when using standard output with documents. For more details see the [bda-output-documents](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html) documentation.

<table width="100%">
    <tr>
        <th width="20%" align="left">Setting Type</th>
        <th width="20%" align="left">Option</th>
        <th width="40%" align="left">Description</th>
        <th width="10%" align="left">Default</th>
    </tr>
    <tr>
        <td><b>Response Granularity</b></td>
        <td>Page</td>
        <td>Each page of document in text output</td>
        <td align="left">✅</td>
    </tr>
    <tr>
        <td></td>
        <td>Element</td>
        <td>Text separated into elements such as figures, tables, or paragraphs</td>
        <td align="left">✅</td>
    </tr>
    <tr>
        <td></td>
        <td>Word</td>
        <td>Individual words and their page locations</td>
        <td></td>
    </tr>
    <tr>
        <td></td>
        <td>Line</td>
        <td>Lines of text with their page locations</td>
        <td></td>
    </tr>    
    <tr>
        <td><b>Output Settings</b></td>
        <td>JSON</td>
        <td>Produces JSON output file using configuration info</td>
        <td align="left">✅</td>
    </tr>
    <tr>
        <td></td>
        <td>JSON+files</td>
        <td>JSON output plus additional files (text, markdown, CSV)</td>
        <td></td>
    </tr>
    <tr>
        <td><b>Text Format</b></td>
        <td>Plaintext</td>
        <td>Text-only output without formatting</td>
        <td></td>
    </tr>
    <tr>
        <td></td>
        <td>Text with markdown</td>
        <td>Text with markdown elements</td>
        <td align="left">✅</td>
    </tr>
    <tr>
        <td></td>
        <td>Text with HTML</td>
        <td>Text with integrated HTML elements</td>
        <td></td>
    </tr>
    <tr>
        <td></td>
        <td>CSV</td>
        <td>CSV format for tables only</td>
        <td></td>
    </tr>
    <tr>
        <td><b>Bounding Boxes</b></td>
        <td>Enabled</td>
        <td>Outputs coordinates of four corners for document elements</td>
        <td></td>
    </tr>
    <tr>
        <td><b>Generative Fields</b></td>
        <td>Enabled</td>
        <td>• 10-word summary<br>• 250-word description<br>• Figure captions (when element granularity enabled)</td>
        <td></td>
    </tr>
</table>

We configure the standard output for our sample document to enable all possible settings in order to demonstrate the full range of capabilities.

Note: When creating a project, you must define your configuration settings for the type of file you tend to process (the modality you intend to use). In our case, we have defined the standard output configuration for `Document` modality.

In [None]:
standard_output_config =  {
  "document": {
    "extraction": {
      "granularity": {"types": ["DOCUMENT","PAGE", "ELEMENT","LINE","WORD"]},
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {"state": "ENABLED"},
    "outputFormat": {
      "textFormat": {"types": ["PLAIN_TEXT", "MARKDOWN", "HTML", "CSV"]},
      "additionalFileFormat": {"state": "ENABLED"}
    }
  }
}

## Create a data automation project with standard output configuration

To begin processing files with BDA using our standard output configuration, we will first create a Data Automation Project. We use the project to store the standard output configurations used to process our sample document. To get an overview of all the available parameters for project creation, see the [create project documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-data-automation/client/create_data_automation_project.html).

The API creates a new project with a unique ARN. The project stores the output settings for future use. If a project is created with no parameters, the default settings will apply.

To continue to get standard default output, configure the parameter 'DataAutomationProjectArn'to use 'arn:aws:bedrock:<region>:aws:data-automation-project/public-default'. 

We first check if the project already exists, if yes then we delete the project before we call the `create_data_automation_project` API

In [None]:
project_name= "my_bda_project"

# delete project if it already exists
projects_existing = [project for project in bda_client.list_data_automation_projects()["projects"] if project["projectName"] == project_name]
if len(projects_existing) > 0:
    print(f"Deleting existing project: {projects_existing[0]}")
    bda_client.delete_data_automation_project(projectArn=projects_existing[0]["projectArn"])
    time.sleep(1)

print(f"\nCreating project: {project_name}...\n")
response = bda_client.create_data_automation_project(
    projectName=project_name,
    projectDescription="project to demonstration the full range of standard output options",
    projectStage='LIVE',
    standardOutputConfiguration=standard_output_config
)
project_arn = response["projectArn"]
status_response = helper_functions.wait_for_completion(
    client=bda_client,
    get_status_function=bda_client.get_data_automation_project,
    status_kwargs={'projectArn': project_arn},
    completion_states=['COMPLETED'],
    error_states=['FAILED'],
    status_path_in_response='project.status',
    max_iterations=15,
    delay=30
)
JSON(status_response,expanded=False)

## Invoke Data Automation Async
You have a project set up, you can start processing images using the `invoke_data_automation_async` operation. The InvokeDataAutomationAsync operation allows you to trigger the asynchronous processing of documents or images stored in an S3 bucket, using the data automation project we created earlier. The operation returns an `invocation arn` that can be used to monitor the progress of the processing task.

In [None]:
print(f"Invoking bda - input: {document_s3_uri}")
print(f"Invoking bda - output: {bda_s3_output_location}")

response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': document_s3_uri
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationConfiguration={
        'dataAutomationProjectArn': project_arn,
        'stage': 'LIVE'
    },
    dataAutomationProfileArn=f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1'
)

invocationArn = response['invocationArn']

### Get data automation job status

We use the `get_data_automation_status` API to check the status and monitor the progress of the Invocation job. This API takes the invocation arn we retrieved from the response to the `invoke_data_automation_async` operation above. The API checks the current status of the job and returns the detail about the job status. 

If the job is still in progress, it returns the current state (e.g., "RUNNING", "QUEUED"). If the job is complete, it returns "COMPLETED" along with the S3 location of the results. If there was an error, it returns "FAILED" with error details.

In [None]:
status_response = helper_functions.wait_for_completion(
            client=bda_client,
            get_status_function=bda_runtime_client.get_data_automation_status,
            status_kwargs={'invocationArn': invocationArn},
            completion_states=['Success'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

### Retrieve job metadata
BDA stores the output from data automation job in the S3 bucket provided as input to the `InvokeDataAutomationAsync` API call earlier. We can retrieve and explore the job metadata response that BDA produces in the configured S3 output bucket.

The `job metadata` contains the job details including job_id, the job_status and the identified modality for the job. It also contains the output information with an S3 location for the results.

In [None]:
job_metadata = json.loads(helper_functions.read_s3_object(job_metadata_s3_location))
pd.set_option('display.max_colwidth', None)


job_status = pd.DataFrame({
    'job_id': [job_metadata['job_id']],
    'job_status': [job_metadata['job_status']],
    'semantic_modality': [job_metadata['semantic_modality']]
}).T
job_metadata_table = pd.DataFrame(job_metadata['output_metadata'][0]['segment_metadata']).fillna('').T
job_metadata_table.index.name='Segment Index'
job_metadata_json = JSON(job_metadata, root="job_metadata", expanded=True)
# Display the widget
display_functions.display_multiple(
    [display_functions.get_view(job_status), display_functions.get_view(job_metadata_table), display_functions.get_view(job_metadata_json)], 
    ["Job Status", "Output Info", "Metadata (JSON)"])

## Bedrock Data Automation document response

This section focuses on the different response objects you receive from running the API operation `invoke_data_automation_async` on a document file. The results of the file processing are stored in the S3 bucket provided when calling the `invoke_data_automation_async` API. We can find the S3 uri of the results file in the `job_metadata.json` file. 

The output includes unique structures depending on both the file modality. Our sample asset is of `Document` modality and based on our standard output configuration, we should see the following main sections in the results:

* **[metadata](#standard_output_metadata)**
* **document**
* **pages**
* **elements**
* **text_lines**
* **text_words**

Let's download the results and start exploring the response objects.

### Downloading the standard output results from S3
Here we download the Standard output results in JSON format using the `standard_output_path` that we found in the `job_metadata.json` associated with our `invoke_data_automation_async` job.

In [None]:
asset_id = 0
standard_output_path = next(item["segment_metadata"][0]["standard_output_path"] 
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
standard_output = json.loads(helper_functions.read_s3_object(standard_output_path))

### Explore standard output response objects

In the following sub-sections, we'll break down each section of the response object and then see the raw response content as well as formatted user friendly view to understand the information you have in each of these response object.

#### metadata
The metadata section in the response provides an overview of the metadata associated with the document. This include the S3 bucket and key for the input document. The metadata also contains the modality that was selected for your response, the number of pages processed as well as the start and end page index.

In [None]:
metadata = standard_output['metadata']
metadata_table = pd.DataFrame([metadata]).fillna('').T
metadata_json = JSON(metadata,root='metadata',expanded=True)
# Display the widget
display_functions.display_multiple(
    [display_functions.get_view(metadata_table), display_functions.get_view(metadata_json)], 
    ["Metadata (Table)", "Metadata (JSON)"])



#### document
The document section of the standard output provides document level granularity information. Document level granularity would include an analysis of information from the document providing key pieces of info.

By default the document level granularity includes statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. We will look at further information that would be presented in the document level granularity when we modify the standard output using projects.

In [None]:
document_df = pd.json_normalize(standard_output["document"]).T
document_json = JSON(standard_output["document"],root='document',expanded=True)
pd.set_option('display.max_colwidth', 200)
display_functions.display_multiple(
    [display_functions.get_view(document_df), display_functions.get_view(document_json)], 
    ["Document (Table)", "Document (JSON)"])

#### pages
With Page level granularity (enabled by default) text in a page is consolidated and are listed in the pages section with one item for each page. The page entity in the Standard output include the page index. The individual page entities also include the statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. The asset metadata represents the page bounds using coordinates of the four corners.

Below, we look at a snippet of the output pertaining to a specific page.

In [None]:
pages = standard_output["pages"]
pages_json = JSON(pages,root='pages',expanded=False)

views=[display_functions.get_view(pages_json)]
titles=["Pages (JSON)"]

modal_id, frame_id = display_functions.display_modal()
for page_index, page in enumerate(pages):
    if page:
        views += [display_functions.get_page_view(page, modal_id, frame_id)]
        titles += [f'Page {page_index+1}']
display_functions.display_multiple(views, titles)

#### elements
The element section contains the various semantic elements extracted from the documents including Text content, Tables and figures. The text and figure entities are further sub-classified for example TITLE/SECTION_TITLE for Text or Chart for figures.

##### TEXT elements

Below we extract the TEXT elements found in a page and display it along with the rectified image of the page output by BDA. When you hover over the text in the right side pane, you can see the TEXT element subtype as tooltip. Also, we use the bounding boxes provided by BDA to draw bounding boxes when we click on the TEXT element displayed on the right side pane.

In [None]:
s3_uri = standard_output['pages'][2]['asset_metadata']['rectified_image']
elements = standard_output["elements"]
# Filter dataframe for text elements
page2_text = [item for item in elements 
                 if item['type'] == 'TEXT' and 2 in item['page_indices']]

page2_text.sort(key=lambda x: x['reading_order'])
display_functions.create_page_viewer(s3_uri, page2_text)

##### FIGURE elements
The FIGURE entity is used for figures in document such as graphs and charts. These figures will be cropped and images sent to the output S3 bucket we provide when calling the `invoke_data_automation_sync` operation.  Additionally, you'll receive a `sub_type` and a `figure title` response for the title text and an indication on what kind of figure it is.

In [None]:
elements = standard_output["elements"]
# Filter dataframe for text elements
figures = [item for item in elements 
                 if item['type'] == 'FIGURE']

figures.sort(key=lambda x: x['reading_order'])
figures_json = JSON(figures,root='figures',expanded=False)
figure_views=[display_functions.get_view(figures_json)]
figure_titles=["Figures (JSON)"]
model_id, frame_id = display_functions.display_modal()

for figure_index, figure in enumerate(figures):
    if figure:
        figure_views += [display_functions.get_figure_view(figure, model_id, frame_id)]
        figure_titles += [f'Figure {figure_index+1}']
display_functions.display_multiple(figure_views, figure_titles)

##### TABLE elements

In [None]:
# Filter dataframe for text elements
df_elements = pd.json_normalize(standard_output["elements"])
df_table = df_elements[df_elements["type"] == "TABLE"]

embedded_images=df_table.apply( lambda row: helper_functions.create_image_html_column(row, "crop_images","500px"), axis=1)
df_table.insert(6, 'image', embedded_images)
cols = ["type","locations","image", 
        #'representation.text', 'representation.markdown', 
        'representation.html','title', 'summary', 'footers', 'headers', 'csv_s3_uri',
       'representation.csv']
# Display formatted dataframe
show(
    df_table[cols],
    columnDefs=[                
        {"width": "120px", "targets": [0,1]},   
        {"width": "340px", "targets": [2]},  
        {"width": "380px", "targets": [3]},
        {"width": "150px", "targets": [5,6,7,8]},        
        {"className": "dt-left", "targets": "_all"}
    ],
    # style="width:1200px",
    # autoWidth=True,
    classes="compact",
    showIndex=False,
    scrollY="400"    
)

In [None]:
JSON([el for el in standard_output["elements"]if el["type"]=="TABLE"][2], root="sample_table")

#### text_lines elements

In [None]:
JSON(standard_output["text_lines"][:10], root="text_lines")

In [None]:
df = pd.json_normalize(standard_output["text_lines"])
show(df, classes="compact")

#### text_words elements

In [None]:
JSON(standard_output["text_words"][3:4], root="text_words[3:4]", expanded=True)

## Conclusion

We explored the standard output of BDA for documents which can be configured and allows us to detailed insights about a document and its structure,  like headers, sections, paragraphs, tables, figures, charts, etc.

It does not only detect these elements but also interprets these elements, e.g. by giving a description of a figures, or by extracting the chart depicted values into a structured table. 

## Clean Up
Let's delete uploaded sample file from s3 input directory and the generated job output files.

In [None]:
import os
from pathlib import Path
import shutil

# Delete S3 File
s3_client.delete_object(Bucket=target_s3_bucket, Key=target_s3_key)

# Delete local file
if os.path.exists(file_path_local):
    os.remove(file_path_local)	

# Delete bda job output
bda_s3_job_location = str(Path(job_metadata_s3_location).parent).replace("s3:/","s3://")
!aws s3 rm {bda_s3_job_location} --recursive

## Summary and next steps

In this lab we saw how BDA's standard output provides a default set of commonly required information for documents, such as document summaries, text extraction, and metadata. 

In a subsequent lab you will explore using custom output and blueprints, how you can ensure that the generated output adheres to a specific format or a schema tailored to your downstream systems such as a structured database.