# Processing document using Standard output

# Introduction

Amazon Bedrock Data Automation (BDA) lets you configure output based on your processing needs for a specific data type: documents, images, video or audio. BDA can generate standard output or custom output.

You can use standard outputs for all four modalities: documents, images, videos, and audio. BDA always provides a standard output response even if it's alongside a custom output response.

Standard outputs are modality-specific default insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. 

In this notebook we will explore the Standard output for the documents.

## Prerequisites

In [48]:
pip install "boto3>=1.35.76" PyPDF2 --upgrade -q

Note: you may need to restart the kernel to use updated packages.


## Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [49]:
import boto3
import json
import pprint
from IPython.display import JSON
import IPython.display as display
import sagemaker

session = sagemaker.Session()
default_bucket = session.default_bucket()

region_name = 'us-west-2'
# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

## Prepare Sample Document
For this lab, we use a `Monthly Treasury Statement for the United States Government` for Fiscal Year 2025 through November 30, 2024. The document is prepared by the Bureau of the Fiscal Service, Department of the Treasury and provides detailed information on the government's financial activities. We will extract a subset of pages from the `PDF` document and use BDA to extract and analyse the document content.

### Download and store sample document
we use the document url to download the document and store it a S3 location. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [50]:
%load_ext autoreload
%autoreload 2

from utils.helper_functions import wait_for_job_to_complete, read_s3_object, create_sample_file, get_bucket_and_key

# Download sample pdf file, extract pages and  
input_bucket, input_prefix = get_bucket_and_key(bda_s3_input_location)
sample_data_url = "https://fiscaldata.treasury.gov/static-data/published-reports/mts/MonthlyTreasuryStatement_202411.pdf"
local_file_name = 'examples/MonthlyTreasuryStatement_202411.pdf'
create_sample_file(sample_data_url, 0, 9, local_file_name)
s3_file_name = 'MonthlyTreasuryStatement.pdf'
s3_response = s3_client.upload_file(local_file_name, input_bucket,
                                    f'{input_prefix}/{s3_file_name}')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [59]:
project_name= "my_bda_project"

# delete project if it already exists
projects_existing = [project for project in bda_client.list_data_automation_projects()["projects"] if project["projectName"] == project_name]
if len(projects_existing) >0:
    print(f"Deleting existing project: {projects_existing[0]}")
    bda_client.delete_data_automation_project(projectArn=projects_existing[0]["projectArn"])

# delete blueprint if it already exists
# blueprints_existing = [blueprint for blueprint in bda_client.list_blueprints()["blueprints"] if blueprint["blueprintName"] == blueprint_name]
# if len(blueprints_existing) >0:
#     print(f"Deleting existing blueprint: {blueprints_existing[0]}")
#     bda_client.delete_blueprint(blueprintArn=blueprints_existing[0]["blueprintArn"])

Deleting existing project: {'projectArn': 'arn:aws:bedrock:us-west-2:762233765926:data-automation-project/d20e2743b851', 'projectStage': 'LIVE', 'projectName': 'my_bda_project', 'creationTime': datetime.datetime(2025, 1, 17, 10, 9, 23, 761000, tzinfo=tzlocal())}


In [60]:
standard_output_config = {
  "document": {
    "extraction": {
      "granularity": {"types": ["DOCUMENT", "PAGE", "ELEMENT", "WORD", "LINE"]},
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {"state": "ENABLED"},
    "outputFormat": {
      "textFormat": {"types": ["PLAIN_TEXT", "MARKDOWN", "HTML", "CSV"]},
      "additionalFileFormat": {"state": "ENABLED"}
    }
  },
  "image": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ["CONTENT_MODERATION", "TEXT_DETECTION"]
      },
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {
      "state": "ENABLED", 
      "types": ["IMAGE_SUMMARY", "IAB"]
    }
  },
  "video": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ["CONTENT_MODERATION", "TEXT_DETECTION", "TRANSCRIPT"]
      },
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["VIDEO_SUMMARY", "SCENE_SUMMARY", "IAB"]
    }
  },
  "audio": {
    "extraction": {
      "category": {
        "state": "ENABLED", 
        "types": ["AUDIO_CONTENT_MODERATION", "CHAPTER_CONTENT_MODERATION", "TRANSCRIPT"]
      }
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["AUDIO_SUMMARY", "CHAPTER_SUMMARY", "IAB"]
    }
  }
}

JSON(standard_output_config)

<IPython.core.display.JSON object>

In [61]:
response = bda_client.create_data_automation_project(
    projectName=project_name,
    projectDescription="project to get extended standard output",
    projectStage='LIVE',
    standardOutputConfiguration=standard_output_config,
    # customOutputConfiguration={
    #     'blueprints': [
    #         {'blueprintArn': 'string', 'blueprintVersion': 'string', 'blueprintStage': 'DEVELOPMENT'|'LIVE'},
    #     ]
    # },
    overrideConfiguration={'document': {'splitter': {'state': 'DISABLED'}}},
    # clientToken='33characters',    
)
project_arn = response["projectArn"]
JSON(response)

<IPython.core.display.JSON object>

In [62]:
project_arn

'arn:aws:bedrock:us-west-2:762233765926:data-automation-project/1b6e89c8f0fc'

### View Sample Document

In [57]:
display.IFrame("examples/MonthlyTreasuryStatement_202411.pdf", width=900, height=800)

## Standard output in Bedrock Data Automation

We start with document data type. Here we use a sample credit report which is of type `pdf`. Below is a summary of the response options that you can set when using standard output with documents. 

For more details see [Documents](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html).

- **Response granularity**
This setting indicates to BDA the kind of response you want to receive from document text extraction. Each level of granularity gives you more and more seperated responses, with page providing all of the text extracted together, and word providing each word as a seperate response. The available granularity levels are:

    - Page
    
    - Element
    
    - Word

    More details [Response Granularity](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html#document-granularity)

- **Output settings**
Output settings determine the structure of the results produced by BDA. The options for output settings are:

    - **JSON** - The result would be a JSON output file with the information from your configuration settings. This is the **default** for document analysis.
    
    - **JSON+files**  The result would include a JSON output along with files that correspond with different outputs. For example, this setting gives you a text file for the overall text extraction, a markdown file for the text with structural markdown, and CSV files for each table that's found in the text.

- **Text Format**
Text format determines the different kinds of texts that will be provided via various extraction operations. You can select any number of the following options for your text format.

    - **Plaintext** – This setting provides a text-only output with no formatting or other markdown elements noted.
    
    - **Text with markdown** – The **default** output setting for standard output. Provides text with markdown elements integrated.
    
    - **Text with HTML** – Provides text with HTML elements integrated in the response.
    
    - **CSV** – Provides a CSV structured output for tables within the document. This will only give a response for tables, and not other elements of the document

- **Bounding Boxes**

    - With the Bounding Boxes option enabled, BDA would output `Bounding Boxes` for elements in the document in form of coordinates of four corners of the box. This helps in creating a visual outline of the element in the document.


- **Generative Fields**
    
    - When `Generative Fields` are enabled, BDA generates a 10-word summary and a 250 word description of the document in the output. Additionally with Response Granularity at element level enabled, BDA also generates a descriptive caption of each figure detected in the document. Figures include things like charts, graphs, and images.


  Both these options are **disabled by default**.


Now that we have looked at the default options, we can proceed to invoking document processing using the default Standard output configuration.

<div class="alert alert-block alert-info">
<b>Note: TBC-Disclaimer</b> 
</div>

In [63]:
print(project_arn)
#project_arn = "arn:aws:bedrock:us-west-2:762233765926:data-automation-project/f796a09e907a"

arn:aws:bedrock:us-west-2:762233765926:data-automation-project/1b6e89c8f0fc


### Invoke Data Automation Async

In [64]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': f'{bda_s3_input_location}/{s3_file_name}'
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
 dataAutomationConfiguration={
        'dataAutomationArn': project_arn,
        'stage': 'LIVE'
    }
)

invocationArn = response['invocationArn']

### Get Data Automation Status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [65]:
status_response = wait_for_job_to_complete(invocationArn=invocationArn)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

Waiting for Job to Complete. Current status is InProgress
Waiting for Job to Complete. Current status is InProgress
Waiting for Job to Complete. Current status is InProgress
Waiting for Job to Complete. Current status is InProgress
Waiting for Job to Complete. Current status is InProgress
Waiting for Job to Complete. Current status is InProgress
Waiting for Job to Complete. Current status is InProgress
Invocation Job with id da1e39db-8417-4ece-92f2-cc66594c956d completed. Status is Success


### Retrieve Job Metadata

In [66]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))
JSON(job_metadata,root='job_metadata',expanded=True)

<IPython.core.display.JSON object>

### Explore the default Standard output
We can now explore the standard output received from processing documents using Data Automation. 

The default Output Format is Json so the standard output for our sample document is in Json format. 

The Standard output for `Document` modality always includes a `metadata` section and a `document` section in the json result. Because for Standard output with default values, the Response granularity of Page level and Element level are enabled by default, the results would include sections for `page` and `element` level information.

We go through the different components in the response in the following sections. 

In [67]:
asset_id=0
standard_output_path = next(item["segment_metadata"][0]["standard_output_path"] 
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
standard_output = json.loads(read_s3_object(standard_output_path))

### 

In [68]:
JSON(standard_output)

<IPython.core.display.JSON object>

In [33]:
from pathlib import Path
result_dir = Path(standard_output_path).parent
!mkdir -p result_dir
!aws s3 cp --recursive {str(result_dir).replace("s3:/","s3://")} "./result_dir"

download: s3://sagemaker-us-west-2-762233765926/bda/output/f4db56ee-741b-4e75-b3cb-1011922cf40d/0/standard_output/0/assets/27bc8eb1-9b12-4f9f-8372-a07eb668859b.csv to result_dir/assets/27bc8eb1-9b12-4f9f-8372-a07eb668859b.csv
download: s3://sagemaker-us-west-2-762233765926/bda/output/f4db56ee-741b-4e75-b3cb-1011922cf40d/0/standard_output/0/assets/3a94d79c-5d4e-4461-8dd5-375d4ca900fa.png to result_dir/assets/3a94d79c-5d4e-4461-8dd5-375d4ca900fa.png
download: s3://sagemaker-us-west-2-762233765926/bda/output/f4db56ee-741b-4e75-b3cb-1011922cf40d/0/standard_output/0/assets/38cc7567-eea5-4212-be1e-dbf48c2d354a.png to result_dir/assets/38cc7567-eea5-4212-be1e-dbf48c2d354a.png
download: s3://sagemaker-us-west-2-762233765926/bda/output/f4db56ee-741b-4e75-b3cb-1011922cf40d/0/standard_output/0/assets/4c575f3b-a39b-4ed2-bf0e-44a0d7e2b29d.png to result_dir/assets/4c575f3b-a39b-4ed2-bf0e-44a0d7e2b29d.png
download: s3://sagemaker-us-west-2-762233765926/bda/output/f4db56ee-741b-4e75-b3cb-1011922cf40d/

In [31]:
# copy s3 directory contained in variable result_dir from s3 to local folder using cp
result_dir

PosixPath('s3:/sagemaker-us-west-2-762233765926/bda/output/f4db56ee-741b-4e75-b3cb-1011922cf40d/0/standard_output/0')

In [30]:
!aws s3 cp s3://sagemaker-us-west-2-762233765926/bda/output/f4db56ee-741b-4e75-b3cb-1011922cf40d/0/standard_output/0 ./results_dir


usage: aws s3 cp <LocalPath> <S3Uri> or <S3Uri> <LocalPath> or <S3Uri> <S3Uri>
Error: Invalid argument type


### metadata
The metadata section in the response provides an overview of the metadata associated with the document. This include the S3 bucket and key for the input document. The metadata also contains the modality that was selected for your response, the number of pages processed as well as the start and end page index.

In [10]:
JSON(standard_output['metadata'],root='metadata',expanded=True)

<IPython.core.display.JSON object>

### document
The document section of the standard output provides document level granularity information. Document level granularity would include an analysis of information from the document providing key pieces of info.

By default the document level granularity includes statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. We will look at further information that would be presented in the document level granularity when we modify the standard output using projects.

In [11]:
JSON(standard_output['document'],root='document', expanded=True)

<IPython.core.display.JSON object>

### pages
With Page level granularity (enabled by default) text in a page are consolidated and are listed in the pages section with one item for each page. The page entity in the Standard output include the page index. The individual page entities also include the statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. The asset metadata represents the page bounds using coordinates of the four corners.

Below, we look at a snippet of the output pertaining to a specific page.

In [12]:
JSON(standard_output['pages'][8],root='pages[7]',expanded=False)

<IPython.core.display.JSON object>

In [13]:
from IPython.display import Markdown, display

pages_md = [page["representation"]["markdown"] for page in standard_output['pages']]
display(Markdown(pages_md[4]))

Table 1. Summary of Receipts, Outlays, and the Deficit/Surplus of the U.S. Government, Fiscal Years 2024 and 2025, by Month [$ millions]

| Period       | Receipts   | Outlays   | Deficit/Surplus (-)   |
|--------------|------------|-----------|-----------------------|
| FY 2024      |            |           |                       |
| October      | 403,434    | 469,997   | 66,564                |
| November     | 274,830    | 588,842   | 314,012               |
| December     | 429,311    | 558,665   | 129,354               |
| January      | 477,320    | 499,250   | 21,930                |
| February     | 271,126    | 567,401   | 296,275               |
| March        | 332,079    | 568,635   | 236,556               |
| April        | 776,198    | 566,669   | -209,529              |
| May          | 323,647    | 670,778   | 347,131               |
| June         | 466,255    | 537,220   | 70,965                |
| July         | 330,377    | 574,119   | 243,741               |
| August       | 306,540    | 686,620   | 380,080               |
| September    | 526,988    | 462,290   | -64,698               |
| Year-to-Date | 4,918,104  | 6,750,485 | 1,832,381             |
| FY 2025      |            |           |                       |
| October      | 326,770    | 584,221   | 257,450               |
| November     | 301,754    | 668,517   | 366,763               |
| Year-to-Date | 628,525    | 1,252,738 | 624,213               |

Note: Details may not add to totals due to rounding.

Table 2. Summary of Budget and Off-Budget Results and Financing of the U.S. Government, November 2024 and Other Periods [$ millions]

| Classification                            | This Month   | Current Fiscal Year to Date   | Budget Estimates Full Fiscal Year 1   | Comparable Prior Period Year to Date (2024)   | Budget Estimates Next Fiscal Year (2026) 1   |
|-------------------------------------------|--------------|-------------------------------|---------------------------------------|-----------------------------------------------|----------------------------------------------|
| Total On-Budget and Off-Budget Results:   |              |                               |                                       |                                               |                                              |
| Total Receipts                            | 301,754      | 628,525                       | 5,561,646                             | 678,264                                       | 6,011,381                                    |
| On-Budget Receipts                        | 207,036      | 446,196                       | 4,255,251                             | 508,841                                       | 4,644,964                                    |
| Off-Budget Receipts                       | 94,718       | 182,329                       | 1,306,395                             | 169,423                                       | 1,366,417                                    |
| Total Outlays                             | 668,517      | 1,252,738                     | 7,439,295                             | 1,058,839                                     | 7,612,734                                    |
| On-Budget Outlays                         | 546,637      | 1,021,999                     | 6,035,465                             | 842,116                                       | 6,124,968                                    |
| Off-Budget Outlays                        | 121,880      | 230,739                       | 1,403,830                             | 216,723                                       | 1,487,766                                    |
| Total Surplus (+) or Deficit (-)          | -366,763     | -624,213                      | -1,877,649                            | -380,576                                      | -1,601,353                                   |
| On-Budget Surplus (+) or Deficit (-)      | -339,601     | -575,803                      | -1,780,214                            | -333,276                                      | -1,480,004                                   |
| Off-Budget Surplus (+) or Deficit (-)     | -27,162      | -48,410                       | -97,435                               | -47,300                                       | -121,349                                     |
| Total On-Budget and Off-Budget Financing  | 366,763      | 624,213                       | 1,877,649                             | 380,576                                       | 1,601,353                                    |
| Means of Financing:                       |              |                               |                                       |                                               |                                              |
| Borrowing from the Public                 | 221,295      | 483,466                       | 1,901,128                             | 487,108                                       | 1,695,149                                    |
| Reduction of Operating Cash, Increase (-) | 164,163      | 128,847                       | ......                                | -101,962                                      | ......                                       |
| By Other Means                            | -18,695      | 11,900                        | -23,479                               | -4,570                                        | -93,796                                      |

1 These estimates are based on the FY 2025 Mid-Session Review, released by the Office of Management and Budget on July 19, 2024.

Note: Details may not add to totals due to rounding. No Transactions

5

### element
The element section contains the various semantic elements extracted from the documents including Text content, Tables and figures. The text and figure entites are further sub-classified for example TITLE/SECTION_TITLE for Text or Chart for figures.

#### TEXT element

In [14]:
JSON(standard_output['elements'][5],root='elements[5]', expanded=True)

<IPython.core.display.JSON object>

#### FIGURE element

In [29]:
#save json to file
with open('standard_output.json', 'w') as f:
    json.dump(standard_output, f)
#standard_output

In [23]:
JSON(standard_output, expanded=False)
# JSON([el for el in standard_output["elements"]if el["type"]=="FIGURE"], expanded=True)

<IPython.core.display.JSON object>

#### TABLE element

In [16]:
JSON(standard_output['elements'][27],root='elements[27]', expanded=True)

<IPython.core.display.JSON object>

## Clean Up
Let's delete the sample files that were downloaded locally and that uploaded to S3

In [17]:
## Delete S3 File

s3_client.delete_object(Bucket=input_bucket, Key=f'{input_prefix}/{s3_file_name}')

#Delete local file
import os
if os.path.exists(local_file_name):
    os.remove(local_file_name)	

## Conclusion

In this notebook we started with the default way of interacting with Amazon Bedrock Automation (BDA) by passing a sample document to the BDA API with no established blueprint or project. We then explored the default default standard output for the sample document.

## Next Steps
Standard output can be modified using projects, which store configuration information for each data type. In the next part of the workshop we would explore Bedrock data automation projects.