# Getting document insights with standard output

# Introduction

Amazon Bedrock Data Automation (BDA) lets you configure output based on your processing needs for a specific data type: images, documents, audio or video. BDA can generate standard output or custom output.

You can use standard outputs for all four modalities: images, documents, audio, and videos. BDA always provides a standard output response even if it's alongside a custom output response.

Standard outputs are modality-specific default structured insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. 

In this notebook we will explore the standard output for documents.

### Prerequisites

In [None]:
%pip install "boto3>=1.35.76" itables==2.2.4 PyPDF2==3.0.1 --upgrade -q

In [None]:
%load_ext autoreload
%autoreload 2

### Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook

In [None]:
import boto3
import json
import pprint
from IPython.display import JSON, display, IFrame
import sagemaker
import pandas as pd
from itables import show
import time

session = sagemaker.Session()
default_bucket = session.default_bucket()

region_name = 'us-west-2'
# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

# Prepare sample document
For this lab, we use a `Monthly Treasury Statement for the United States Government` for Fiscal Year 2025 through November 30, 2024. The document is prepared by the Bureau of the Fiscal Service, Department of the Treasury and provides detailed information on the government's financial activities. We will extract a subset of pages from the `PDF` document and use BDA to extract and analyse the document content.

### Download and store sample document
we use the document url to download the document and store it a S3 location. 

Note - We will configure BDA to use the sample input from this S3 location, so we need to ensure that BDA has `s3:GetObject` access to this S3 location. If you are running the notebook in your own AWS Account, ensure that the SageMaker Execution role configured for this JupyterLab app has the right IAM permissions.

In [None]:
from pathlib import Path

document_url = "https://fiscaldata.treasury.gov/static-data/published-reports/mts/MonthlyTreasuryStatement_202411.pdf"

# Download the document
local_file_name = "./treasury1.pdf"
file_path_local = download_document(document_url, output_file_path=local_file_name)

# Upload the document to S3
file_name = Path(file_path_local).name
document_s3_uri = f'{bda_s3_input_location}/{file_name}'

target_s3_bucket, target_s3_key = get_bucket_and_key(document_s3_uri)
s3_client.upload_file(local_file_name, target_s3_bucket, target_s3_key)

print(f"Downloaded file to: {file_path_local}")
print(f"Uploaded file to S3: {target_s3_key}")
print(f"document_s3_uri: {document_s3_uri}")

In [None]:
import os
import time
import boto3
from urllib.parse import urlparse
import requests
import io
from PyPDF2 import PdfReader, PdfWriter
from botocore.exceptions import ClientError
from IPython.display import HTML
from IPython.display import display
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
import json


bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')

def get_bucket_and_key(s3_uri):
    parsed_uri = urlparse(s3_uri)
    bucket_name = parsed_uri.netloc
    object_key = parsed_uri.path.lstrip('/')
    return (bucket_name, object_key)

def wait_for_job_to_complete(invocationArn):
    get_status_response = bda_runtime_client.get_data_automation_status(
         invocationArn=invocationArn)
    status = get_status_response['status']
    job_id = invocationArn.split('/')[-1]
    max_iterations = 60
    iteration_count = 0
    while status not in ['Success', 'ServiceError', 'ClientError']:
        print(f'Waiting for Job to Complete. Current status is {status}')
        time.sleep(10)
        iteration_count += 1
        if iteration_count >= max_iterations:
            print(f"Maximum number of iterations ({max_iterations}) reached. Breaking the loop.")
            break
        get_status_response = bda_runtime_client.get_data_automation_status(
         invocationArn=invocationArn)
        status = get_status_response['status']
    if iteration_count >= max_iterations:
        raise Exception("Job did not complete within the expected time frame.")
    else:
        print(f"Invocation Job with id {job_id} completed. Status is {status}")
    return get_status_response


def read_s3_object(s3_uri):
    # Parse the S3 URI
    parsed_uri = urlparse(s3_uri)
    bucket_name = parsed_uri.netloc
    object_key = parsed_uri.path.lstrip('/')
    # Create an S3 client
    s3_client = boto3.client('s3')
    try:
        # Get the object from S3
        response = s3_client.get_object(Bucket=bucket_name, Key=object_key)
        
        # Read the content of the object
        content = response['Body'].read().decode('utf-8')
        return content
    except Exception as e:
        print(f"Error reading S3 object: {e}")
        return None

def download_document(url, start_page_index=None, end_page_index=None, output_file_path=None):

    if not output_file_path:
        filename = os.path.basename(url)
        output_file_path = filename
        
    # Download the PDF
    response = requests.get(url)
    print(response)
    pdf_content = io.BytesIO(response.content)
    
    # Create a PDF reader object
    pdf_reader = PdfReader(pdf_content)
    
    # Create a PDF writer object
    pdf_writer = PdfWriter()
    
    start_page_index = 0 if not start_page_index else max(start_page_index,0)
    end_page_index = len(pdf_reader.pages)-1 if not end_page_index else min(end_page_index,len(pdf_reader.pages)-1)

    # Specify the pages you want to extract (0-indexed)
    pages_to_extract = list(range(start_page_index, end_page_index))
    
    # Add the specified pages to the writer
    for page_num in pages_to_extract:
        page = pdf_reader.pages[page_num]
        pdf_writer.add_page(page)

    print(output_file_path)
    # Save the extracted pages to a new PDF
    with open(output_file_path, "wb") as output_file:
        pdf_writer.write(output_file)
    return output_file_path


import boto3
from botocore.config import Config
from urllib.parse import urlparse
from typing import Optional
import pandas as pd

def generate_presigned_url(s3_uri: str, expiration: int = 3600) -> Optional[str]:
    """
    Generate a presigned URL for an S3 object with retry logic.
    
    Args:
        s3_uri (str): S3 URI in format 's3://bucket-name/key'
        expiration (int): URL expiration time in seconds
        
    Returns:
        Optional[str]: Presigned URL or None if generation fails
    """
    try:
        parsed = urlparse(s3_uri)
        bucket = parsed.netloc
        key = parsed.path.lstrip('/')
        
        config = Config(
            signature_version='s3v4',
            retries={'max_attempts': 3}
        )
        s3_client = boto3.client('s3', config=config)
        
        return s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': key},
            ExpiresIn=expiration
        )
    except Exception as e:
        print(f"Error generating presigned URL for {s3_uri}: {e}")
        return None

def create_image_html_column(row: pd.Series, image_col: str, width: str = '300px') -> str:
    """
    Create HTML embedded image from S3 URI using presigned URL for a DataFrame row.
    
    Args:
        row (pd.Series): DataFrame row
        image_col (str): Name of column containing S3 URI
        width (str): Fixed width for image
        
    Returns:
        str: HTML string for embedded image
    """
    s3_uri = row[image_col]
    if type(s3_uri)==list:
        s3_uri=s3_uri[0]    
    if pd.isna(s3_uri):
        return ''
    
    presigned_url = generate_presigned_url(s3_uri)
    if presigned_url:
        return f'<img src="{presigned_url}" style="width: {width}; object-fit: contain;">'
    return ''


# Example usage:
"""
# Add embedded images column
df['embedded_images'] = add_embedded_images(df, 'crop_images', width='300px')

# For Jupyter notebook display:
from IPython.display import HTML
HTML(df['embedded_images'].iloc[0])
"""



def wait_for_completion(
    client,
    get_status_function,
    status_kwargs,
    status_path_in_response,
    completion_states,
    error_states,
    max_iterations=60,
    delay=10
):
    for _ in range(max_iterations):
        try:
            response = get_status_function(**status_kwargs)
            status = get_nested_value(response, status_path_in_response)

            if status in completion_states:
                print(f"Operation completed successfully with status: {status}")
                return response

            if status in error_states:
                raise Exception(f"Operation failed with status: {status}")

            print(f"Current status: {status}. Waiting...")
            time.sleep(delay)

        except ClientError as e:
            raise Exception(f"Error checking status: {str(e)}")

    raise Exception(f"Operation timed out after {max_iterations} iterations")


def get_nested_value(data, path):
    """
    Retrieve a value from a nested dictionary using a dot-separated path.

    :param data: The dictionary to search
    :param path: A string representing the path to the value, e.g., "Job.Status"
    :return: The value at the specified path, or None if not found
    """
    keys = path.split('.')
    for key in keys:
        if isinstance(data, dict) and key in data:
            data = data[key]
        else:
            return None
    return data


def display_html(data, root='root', expanded=True, bg_color='#f0f0f0'):
    html = f"""
        <div class="custom-json-output" style="background-color: {bg_color}; padding: 10px; border-radius: 5px;">
            <button class="toggle-btn" style="margin-bottom: 10px;">{'Collapse' if expanded else 'Expand'}</button>
            <pre class="json-content" style="display: {'block' if expanded else 'none'};">{data}</pre>
        </div>
        <script>
        (function() {{
            var toggleBtn = document.currentScript.previousElementSibling.querySelector('.toggle-btn');
            var jsonContent = document.currentScript.previousElementSibling.querySelector('.json-content');
            toggleBtn.addEventListener('click', function() {{
                if (jsonContent.style.display === 'none') {{
                    jsonContent.style.display = 'block';
                    toggleBtn.textContent = 'Collapse';
                }} else {{
                    jsonContent.style.display = 'none';
                    toggleBtn.textContent = 'Expand';
                }}
            }});
        }})();
        </script>
        """
    display(HTML(html))

def send_request(region, url, method, credentials, payload=None, service='bedrock'):
    host = url.split("/")[2]
    request = AWSRequest(
            method,
            url,
            data=payload,
            headers={'Host': host, 'Content-Type':'application/json'}
    )    
    SigV4Auth(credentials, service, region).add_auth(request)
    response = requests.request(method, url, headers=dict(request.headers), data=payload, timeout=50)
    response.raise_for_status()
    content = response.content.decode("utf-8")
    data = json.loads(content)
    return data

def invoke_blueprint_recommendation_async(bda_client,region_name, payload):
    credentials = boto3.Session().get_credentials().get_frozen_credentials()
    url = f"{bda_client.meta.endpoint_url}/invokeBlueprintRecommendationAsync"
    print(f'Sending request to {url}')
    result = send_request(
        region = region_name,
        url = url,
        method = "POST", 
        credentials = credentials,
        payload=payload
    )
    return result


def get_blueprint_recommendation(bda_client, region_name, credentials, job_id):
    url = f"{bda_client.meta.endpoint_url}/getBlueprintRecommendation/{job_id}/"
    result = send_request(
        region = region_name,
        url = url,
        method = "POST",
        credentials = credentials        
    )
    return result

def create_or_update_blueprint(bda_client, blueprint_name, blueprint_description, blueprint_type, blueprint_stage, blueprint_schema):
    list_blueprints_response = bda_client.list_blueprints(
        blueprintStageFilter='ALL'
    )
    blueprint = next((blueprint for blueprint in
                      list_blueprints_response['blueprints']
                      if 'blueprintName' in blueprint and
                      blueprint['blueprintName'] == blueprint_name), None)

    if not blueprint:
        print(f'No existing blueprint found with name={blueprint_name}, creating custom blueprint')
        response = bda_client.create_blueprint(
            blueprintName=blueprint_name,
            type=blueprint_type,
            blueprintStage=blueprint_stage,
            schema=json.dumps(blueprint_schema)
        )
    else:
        print(f'Found existing blueprint with name={blueprint_name}, updating Stage and Schema')
        response = bda_client.update_blueprint(
            blueprintArn=blueprint['blueprintArn'],
            blueprintStage=blueprint_stage,
            schema=json.dumps(blueprint_schema)
        )

    return response['blueprint']['blueprintArn']


def transform_custom_output(input_json, explainability_info):
    result = {
        "forms": {},
        "tables": {}
    }
    
    def add_confidence(value, confidence_info):
        # For simple key-value pairs
        if isinstance(confidence_info, dict) and 'confidence' in confidence_info:
            return {
                "value": value,
                "confidence": confidence_info['confidence']
            }
        return value
    
    def process_list_item(item, confidence_info):
        # For handling nested dictionaries within lists
        processed_item = {}
        for key, value in item.items():
            if isinstance(confidence_info, dict) and key in confidence_info:
                processed_item[key] = add_confidence(value, confidence_info[key])
        return processed_item

    # Iterate through the input JSON
    for key, value in input_json.items():
        confidence_data = explainability_info.get(key, {})
        
        if isinstance(value, list):
            # Handle lists (tables)
            processed_list = []
            for idx, item in enumerate(value):
                if isinstance(item, dict):
                    # Process each item in the list using its corresponding confidence info
                    conf_info = confidence_data[idx] if isinstance(confidence_data, list) else confidence_data
                    processed_list.append(process_list_item(item, conf_info))
            result["tables"][key] = processed_list
        else:
            # Handle simple key-value pairs (forms)
            result["forms"][key] = add_confidence(value, confidence_data)
            
    return result

def get_summaries(custom_outputs):
    custom_output_summaries = []
    for custom_output in custom_outputs:
        custom_output_summary = {}
        if custom_output:
            custom_output_summary = {
                'page_indices': custom_output.get('split_document', {}).get('page_indices', None),
                'matched_blueprint_name': custom_output.get('matched_blueprint', {}).get('name', None),
                'confidence': custom_output.get('matched_blueprint', {}).get('confidence', None),
                'document_class_type': custom_output.get('document_class', {}).get('type', None),
                #'matched_blueprint_arn': custom_output.get('matched_blueprint', {}).get('arn', None)
            }
        else:
            custom_output_summary = {}
        custom_output_summaries += [custom_output_summary]
    return custom_output_summaries


In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, JSON
from PIL import Image
import io
import pandas as pd
  


def load_image(image_path):
    # Open the image
    img = Image.open(image_path)
    
    # Convert to JPEG if it's not already
    if img.format != 'JPEG':
        # Create a byte stream
        buf = io.BytesIO() 
        # Save as JPEG to the byte stream
        img.save(buf, format='JPEG')
        # Get the byte value
        image_bytes = buf.getvalue()
    else:
        # If already JPEG, read directly
        with open(image_path, 'rb') as file:
            image_bytes = file.read()
    
    return image_bytes

def onclick_function():
    return """
        <script>
            function handleClick(event) {
                var row = event.target;
                if (!row) return;  // Click wasn't on a row

                // Get the bbox data from the row
                var bbox = row.getAttribute('data-bbox');
                if (!bbox) return;  // No bbox data found
                row.style.backgroundColor = '#ffe0e0';
                
                // Parse the bbox string back to array
                //bbox = JSON.parse(bbox);
                row.style.backgroundColor = '#fff0f0';

                // Send custom event to Python
                var event = new CustomEvent('bbox_click', { detail: bbox });
                document.dispatchEvent(event);
                row.style.backgroundColor = '#ffe0e0';
                
                
                // First, reset all rows to default background
                var rows = document.getElementsByClassName('kc-item');
                for(var i = 0; i < rows.length; i++) {
                    rows[i].style.backgroundColor = '#f8f8f8';
                }
                
                // Then highlight only the clicked row
                row.style.backgroundColor = '#e0e0e0';
            }
        </script>
    """

def create_form_view(forms_data):
    """Create a formatted view for key-value pairs with nested dictionary support"""
    html_content = """
    <style>
        .kv-container {
            display: flex;
            flex-direction: column;
            gap: 4px;
            margin: 4px;
            width: 100%;
        }    
        .kv-box {
            border: 0px solid #e0e0e0;
            border-radius: 4px;
            padding: 4px;
            margin: 0;
            background-color: #f8f9fa;
            width: auto;
        }
        .kv-item {
            display: flex;
            justify-content: space-between;
            align-items: center;
            margin-bottom: 2px;
        }
        .kc-item {
            background-color: #fff;
            display: flex;
            justify-content: space-between;
            align-items: center;
            margin-bottom: 2px;
        }
        .key {
            font-weight: 600; 
            padding: 1px 4px;
            font-size: 0.85em; 
            color: #333;
        }
        .value {
            background-color: #fff;
            padding: 1px 4px;
            border-radius: 4px;
            font-size: 0.85em;
            color: #666;
            margin-top: 1px;
        }
        .confidence {
            padding: 1px 4px;
            border-radius: 4px;
            font-size: 0.85em;
            color: #2196F3;        
        }
        .nested-container {
            margin-left: 8px;
            margin-top: 4px;
            border-left: 2px solid #e0e0e0;
            padding-left: 4px;
        }
        .parent-key {
            color: #6a1b9a;
            font-size: 0.9em;
            font-weight: 600;
            margin-bottom: 2px;
        }
    </style>       
    """

    html_content += onclick_function()
    html_content += '<div class="kv-container">'

    def render_nested_dict(data, level=0):
        nested_html = ""
        if isinstance(data, dict):
            for key, value in data.items():
                if isinstance(value, dict):
                    confidence = value.get('confidence', 0) * 100
                    if 'value' in value:
                        # Handle standard key-value pair with confidence
                        nested_html += f"""
                            <div class='kv-box'>
                                <div class='kv-item'>
                                    <div class='key'>{key}</div>
                                </div>
                                <div class='kc-item' onclick=handleClick(event) data-bbox='(10,40,110,200)'>
                                    <div class="value" >{value['value']}</div>
                                    <div class='confidence'>{confidence:.1f}%</div>
                                </div>
                            </div>
                        """
                    else:
                        # Handle nested dictionary
                        nested_html += f"""
                            <div class='kv-box'>
                                <div class='kv-item'>
                                    <div class='key'>{key}</div>
                                </div>
                                <div class="nested-container">
                                    {render_nested_dict(value, level + 1)}
                                </div>
                            </div>
                        """
                else:
                    # Handle direct key-value pairs without confidence
                    nested_html += f"""
                        <div class='kv-box'>
                            <div class='kv-item'>
                                <div class='key'>{key}</div>
                            </div>
                            <div class="value">{value}</div>
                        </div>
                    """
        return nested_html

    html_content += render_nested_dict(forms_data)
    html_content += "</div>"
    
    return HTML(html_content)


def create_table_view(tables_data):
    """Create a formatted view for tables"""
    html_content = """
    <style>
        .table-container {
            margin: 20px;
        }
        .table-view {
            width: 100%;
            border-collapse: collapse;
            background-color: white;
        }
        .table-view th {
            background-color: #f8f9fa;
            padding: 12px;
            text-align: left;
            font-size: 0.85em;
            border: 1px solid #dee2e6;
        }
        .table-view td {
            padding: 12px;
            border: 1px solid #dee2e6;
            font-size: 0.8em;
        }
    </style>
    """
    
    for table_name, table_data in tables_data.items():
        if table_data:
            df = pd.DataFrame(table_data)
            html_content += f"""
            <div class="table-container">
                <h3>{table_name}</h3>
                {df.to_html(classes='table-view', index=False)}
            </div>
            """
    
    return HTML(html_content)


def get_view(data, display_function=None):
    out = widgets.Output()
    with out:
        if callable(display_function):
            display_function(data)
        else:
            display(data)
    return out


def segment_view(document_image_uris, inference_result):
    # Create the layout with top alignment
    main_hbox_layout = widgets.Layout(
        width='100%',
        display='flex',
        flex_flow='row nowrap',
        align_items='stretch',
        margin='0'
    )
    image_widget = widgets.Image(
        value=b'',
        format='png',
        width='auto',
        height='auto'
    )
    image_widget.value = load_image(image_path=document_image_uri)
    image_container = widgets.VBox(
        children=[image_widget],
        layout=widgets.Layout(
            border='1px solid #888',
            padding='1px',
            margin='2px',
            width='60%',
            flex='0 0 60%',
            min_width='300px',
            height='auto',
            display='flex',
            align_items='stretch',
            justify_content='center'
        )
    )
    
    
    # Create tabs for different views
    tab = widgets.Tab(
        layout=widgets.Layout(
            width='40%',
            flex='0 0 40%',
            min_width='300px',
            height='auto'
        )
    )
    form_view = widgets.Output()
    table_view = widgets.Output()
    
    with form_view:
        display(create_form_view(inference_result['forms']))
        
    with table_view:
        display(create_table_view(inference_result['tables']))
    
    tab.children = [form_view, table_view]
    tab.set_title(0, 'Key Value Pairs')
    tab.set_title(1, 'Tables')

    
    # Add custom CSS for scrollable container
    custom_style = """
    <style>
        .scrollable-vbox {
            max-height: 1000px;
            overflow-y: auto;
            overflow-x: hidden;
        }
        .main-container {
            display: flex;
            height: 1000px;  /* Match with max-height above */
        }
        .jupyter-widgets-output-area .p-TabBar-tab {
            min-width: fit-content !important;
            max-width: fit-content !important;
            padding: 6px 10px !important;
    </style>
    """
    display(HTML(custom_style))
    
    # Create the main layout
    main_layout = widgets.HBox(
        children=[image_container, tab],
        layout=main_hbox_layout
    )

    
    # Add the scrollable class to the right VBox
    main_layout.add_class('main-container')
    return main_layout


def display_collapsable(data, title):
    accordion = widgets.Accordion(children=[widgets.Output()])
    accordion.set_title(0, title)
    
    with accordion.children[0]:
        display(data)
    
    return accordion
    

def display_multiple(views, view_titles = None):
    main_tab = widgets.Tab()
    for i, view in enumerate(views):
        main_tab.children = (*main_tab.children, view)
        tab_title = view_titles[i] if view_titles and view_titles[i] else f'Document {i}'
        main_tab.set_title(i, title=tab_title)
    display(main_tab)

In [None]:
import utils

### View Sample Document

In [None]:
IFrame(local_file_name, width=600, height=400)

# Define standard output types and granularity

We can configure various type of insights and their granularity for the standard output using `standard_output_config`.
Below is a summary of the options that you can set when using standard output with documents. 

For more details see the [bda-output-documents](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html) documentation.

- **[Response Granularity](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html#document-granularity)**
This setting indicates to BDA the kind of response you want to receive from document text extraction. Each level of granularity gives you more and more seperated responses, with page providing all of the text extracted together, and word providing each word as a seperate response. The available granularity levels are:
  - Page
  - Element    
  - Word

- **Output settings**
Output settings determine the structure of the results produced by BDA. The options for output settings are:
    - **JSON** - The result would be a JSON output file with the information from your configuration settings. This is the **default** for document analysis.    
    - **JSON+files**  The result would include a JSON output along with files that correspond with different outputs. For example, this setting gives you a text file for the overall text extraction, a markdown file for the text with structural markdown, and CSV files for each table that's found in the text.
- **Text Format**
Text format determines the different kinds of texts that will be provided via various extraction operations. You can select any number of the following options for your text format.

    - **Plaintext** – This setting provides a text-only output with no formatting or other markdown elements noted.
    - **Text with markdown** – The **default** output setting for standard output. Provides text with markdown elements integrated.    
    - **Text with HTML** – Provides text with HTML elements integrated in the response.    
    - **CSV** – Provides a CSV structured output for tables within the document. This will only give a response for tables, and not other elements of the document

- **Bounding Boxes**

    - With the Bounding Boxes option enabled, BDA would output `Bounding Boxes` for elements in the document in form of coordinates of four corners of the box. This helps in creating a visual outline of the element in the document.


- **Generative Fields**
    
    - When `Generative Fields` are enabled, BDA generates a 10-word summary and a 250 word description of the document in the output. Additionally with Response Granularity at element level enabled, BDA also generates a descriptive caption of each figure detected in the document. Figures include things like charts, graphs, and images.


  Both these options are **disabled by default**.


Now that we have looked at the default options, let's create a config which activates all the different types, so that we can see how the output looks like. We leave image, audio, and video types for illustrational purposes.

In [None]:
standard_output_config =  {
  "document": {
    "extraction": {
      "granularity": {"types": ["DOCUMENT","PAGE", "ELEMENT","LINE","WORD"]},
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {"state": "ENABLED"},
    "outputFormat": {
      "textFormat": {"types": ["PLAIN_TEXT", "MARKDOWN", "HTML", "CSV"]},
      "additionalFileFormat": {"state": "ENABLED"}
    }
  },
  "image": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ["CONTENT_MODERATION","TEXT_DETECTION"]
      },
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["IMAGE_SUMMARY","IAB"]
    }
  },
  "video": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ["CONTENT_MODERATION","TEXT_DETECTION","TRANSCRIPT"]
      },
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["VIDEO_SUMMARY", "SCENE_SUMMARY","IAB"]
    }
  },
  "audio": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ['AUDIO_CONTENT_MODERATION','CHAPTER_CONTENT_MODERATION','TRANSCRIPT']
      }
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ['AUDIO_SUMMARY','CHAPTER_SUMMARY','IAB']
    }
  }
}

# JSON(standard_output_config["document"], expanded=True)
JSON(standard_output_config, expanded=False)

# Create project with standard output config

To utilize standard output configurations, we create a project and utilize the previously defined standard output config. To get an overview of all the available parameters for project creation, see the [create project documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-data-automation/client/create_data_automation_project.html).

In [None]:
project_name= "my_bda_project"

# delete project if it already exists
projects_existing = [project for project in bda_client.list_data_automation_projects()["projects"] if project["projectName"] == project_name]
if len(projects_existing) > 0:
    print(f"Deleting existing project: {projects_existing[0]}")
    bda_client.delete_data_automation_project(projectArn=projects_existing[0]["projectArn"])
    time.sleep(1)

In [None]:

response = bda_client.create_data_automation_project(
    projectName=project_name,
    projectDescription="project to get our extended standard output",
    projectStage='LIVE',
    standardOutputConfiguration=standard_output_config    
)
project_arn = response["projectArn"]
time.sleep(1)
JSON(response)

# Invoke data automation async

In [None]:

response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': 's3://testing-rhubarb-0411/bda/input/treasury.pdf'
    },
    outputConfiguration={
        's3Uri': 's3://testing-rhubarb-0411/bda/output/'
    },
 dataAutomationConfiguration={
        'dataAutomationArn': project_arn,
        'stage': 'LIVE'
    }
)

invocationArn = response['invocationArn']

### Get data automation job status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [None]:
status_response = wait_for_job_to_complete(invocationArn=invocationArn)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

### Retrieve job metadata

Let's retrieve and explore the job metadata response.
It will contain a field `standard_output_path` where the results have been saved.

In [None]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))
JSON(job_metadata,root='job_metadata',expanded=True)

# Explore standard output results

We can now explore the standard output received from processing documents using Data Automation. 

Based on the standard output configuration, we used above, we can have the following fields:
* **metadata**
* **document**
* **pages**
* **elements**
* **text_lines**
* **text_words**

We will review each of these fields in the sections below.

First lets download and parse the standard_output json file, which we received from the job metadata.

In [None]:
asset_id=0
standard_output_path = next(item["segment_metadata"][0]["standard_output_path"] 
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)
standard_output = json.loads(read_s3_object(standard_output_path))
JSON(standard_output)

### metadata
The metadata section in the response provides an overview of the metadata associated with the document. This include the S3 bucket and key for the input document. The metadata also contains the modality that was selected for your response, the number of pages processed as well as the start and end page index.

In [None]:
JSON(standard_output['metadata'],root='metadata',expanded=True)

### document
The document section of the standard output provides document level granularity information. Document level granularity would include an analysis of information from the document providing key pieces of info.

By default the document level granularity includes statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. We will look at further information that would be presented in the document level granularity when we modify the standard output using projects.

In [None]:
df_document = pd.json_normalize(standard_output["document"])

df = df_document.T
pd.set_option('display.max_colwidth', 200)
df

### pages
With Page level granularity (enabled by default) text in a page are consolidated and are listed in the pages section with one item for each page. The page entity in the Standard output include the page index. The individual page entities also include the statistics that contain information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc. The asset metadata represents the page bounds using coordinates of the four corners.

Below, we look at a snippet of the output pertaining to a specific page.

In [None]:
df_pages = pd.json_normalize(standard_output["pages"])
pd.reset_option('display.max_colwidth')  
df_pages.loc[3].T
# show(df_pages.loc[3].T, classes="compact")

In [None]:
JSON(standard_output['pages'][8],root='pages[7]',expanded=False)

In [None]:
from IPython.display import Markdown, display

pages_md = [page["representation"]["markdown"] for page in standard_output['pages']]
display(Markdown(pages_md[4]))

### elements
The element section contains the various semantic elements extracted from the documents including Text content, Tables and figures. The text and figure entites are further sub-classified for example TITLE/SECTION_TITLE for Text or Chart for figures.

#### TEXT elements

In [None]:
# Filter dataframe for text elements
df_elements = pd.json_normalize(standard_output["elements"])
df_text = df_elements[df_elements["type"] == "TEXT"]

# Display formatted dataframe
show(
    df_text.iloc[:50, 2:8],
    columnDefs=[
        {"width": "280px", "targets": [4, 5]},
        {"width": "150px", "targets": [3]},
        {"className": "dt-left", "targets": "_all"}
    ],
    style="width:1200px",
    autoWidth=False,
    classes = "compact",
    showIndex=False
)

In [None]:
JSON(standard_output['elements'][5],root='elements[5]')

#### FIGURE elements

In [None]:

# Filter dataframe for text elements
df_elements = pd.json_normalize(standard_output["elements"])
df_figure = df_elements[df_elements["type"] == "FIGURE"]

embedded_images=df_figure.apply( lambda row: create_image_html_column(row, "crop_images","200px"), axis=1)
df_figure.insert(6, 'image', embedded_images)

# Display formatted dataframe
show(
    df_figure.iloc[:, 2:9],
    columnDefs=[                
        {"width": "120px", "targets": [0,1,3]},          
        {"width": "220px", "targets": [2,4]},
        {"width": "280px", "targets": [5]},        
        {"width": "480px", "targets": [6]},        
        {"className": "dt-left", "targets": "_all"}
    ],
    style="width:1200px",
    # autoWidth=False,
    classes="compact",
    showIndex=False,
    # column_filters="header"
)

In [None]:
JSON([el for el in standard_output["elements"]if el["type"]=="FIGURE"])

#### TABLE elements

In [None]:

# Filter dataframe for text elements
df_elements = pd.json_normalize(standard_output["elements"])
df_table = df_elements[df_elements["type"] == "TABLE"]

embedded_images=df_table.apply( lambda row: create_image_html_column(row, "crop_images","500px"), axis=1)
df_table.insert(6, 'image', embedded_images)
cols = ["type","locations","image", 
        #'representation.text', 'representation.markdown', 
        'representation.html','title', 'summary', 'footers', 'headers', 'csv_s3_uri',
       'representation.csv']
# Display formatted dataframe
show(
    df_table[cols],
    columnDefs=[                
        {"width": "120px", "targets": [0,1]},   
        {"width": "340px", "targets": [2]},  
        {"width": "380px", "targets": [3]},
        {"width": "150px", "targets": [5,6,7,8]},        
        {"className": "dt-left", "targets": "_all"}
    ],
    # style="width:1200px",
    # autoWidth=True,
    classes="compact",
    showIndex=False,
    scrollY="400"    
)

In [None]:
JSON([el for el in standard_output["elements"]if el["type"]=="TABLE"][2], root="sample_table")

### text_lines elements

In [None]:
JSON(standard_output["text_lines"][:10], root="text_lines")

In [None]:
df = pd.json_normalize(standard_output["text_lines"])
show(df, classes="compact")

### text_words elements

In [None]:
JSON(standard_output["text_words"][3:4], root="text_words[3:4]", expanded=True)

## Conclusion

We explored the standard output of BDA for documents which can be configured and allows us to detailled insights about a document and its structure,  like headers, sections, paragraphs, tables, figures, charts, etc.

It does not only detect these elements but als interprets these elements, e.g. by giving a description of a figures, or by extracting the chart depicted values into a structured table. This structured output is very powerful 

This allows

## Clean Up
Let's delete uploaded sample file from s3 input directory and the generated job output files.

In [None]:
import os
from pathlib import Path

# Delete S3 File
s3_client.delete_object(Bucket=target_s3_bucket, Key=target_s3_key)

# Delete local file
if os.path.exists(local_file_name):
    os.remove(local_file_name)	

# Delete bda job output
bda_s3_job_location = str(Path(job_metadata_s3_location).parent).replace("s3:/","s3://")
!aws s3 rm {bda_s3_job_location} --recursive

## Next Steps


In [None]:
import boto3
import json
import pprint
from IPython.display import JSON, display, HTML, Markdown, IFrame
import IPython.display as display
from sagemaker import get_execution_role

default_execution_role = get_execution_role()
region_name = 'us-west-2'
# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = 's3://genai.octankmarkets.com/input'
bda_s3_output_location = 's3://genai.octankmarkets.com/output'
print(f'Using default execution role {default_execution_role}')

