# Extract Information using Amazon Bedrock Data Automation

> This notebook should work well with the **`conda_python3`** kernel in SageMaker Notebook instances.

---

In this notebook, we demonstrate an alternative IDP approach: using Bedrock Data Automation as a managed service.

---

Bedrock Data Automation (BDA) lets you configure output based on your processing needs for a specific data type: documents, images, video or audio. BDA can generate standard output or custom output. Below are some key concepts for understanding how BDA works:

* **Standard output** – Sending a file to BDA with no other information returns the default standard output, which consists of commonly required information that's based on the data type. Examples include audio transcriptions, scene summaries for video, and document summaries. For more information, see [Standard output for documents in Bedrock Data Automation](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-output-documents.html).

* **Custom output** – For documents and images only. Choose custom output to define exactly what information you want to extract using a blueprint. A blueprint consists of a list of expected fields that you want retrieved from a document or an image. Each field represents a piece of information that needs to be extracted to meet your specific use case. For more information, see [Custom output and blueprints](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-custom-output-idp.html).

For our use case, we will set up a custom blueprint with the attributes schema, and then run a data automation job. Not that here, OCR and information extraction are performed in one go.

# 1. PREPARATIONS

In [2]:
import json
import os

import boto3
import botocore

We will need to set up two `boto3` clients:

In [None]:
bda_client = boto3.client("bedrock-data-automation", region_name="us-west-2")
bda_runtime_client = boto3.client(
    "bedrock-data-automation-runtime", region_name="us-west-2"
)

## Define folders

We will select the same PDF document we used in `01_process_pdf.ipynb`. Check out the notebook for the code to upload your document(s) to S3.

Since Amazon Bedrock expects input files to be uploaded to S3, we will read them from a prepopulated S3 bucket. The name of the bucket is `s3://information-extraction-workshop-<ACCOUNT_ID>`. Note here that you need to replace the `<ACCOUNT_ID>` suffix with the aws account ID you're using. Or simply visit the [s3 console](https://console.aws.amazon.com/s3/home) and copy the s3 bucket name.  

You can check copies of the input files in `demo-files` folder for reference.

In [15]:
input_path = "demo-files"
output_path = "processed-files"
s3_bucket = "information-extraction-workshop-<ACCOUNT_ID>"  # replace the suffix <ACCOUNT_ID> with the aws account ID

## List input documents

List objects on the S3 bucket, iterate over the objects and print their keys (file names).

In [None]:
s3_client = boto3.client("s3")
response = s3_client.list_objects_v2(Bucket=s3_bucket)
file_objects = response["Contents"]

for obj in file_objects:
    print(obj["Key"])

s3_file_key = file_objects[0]["Key"]
s3_file_key

![screenshots/2302.13971v1.png](screenshots/2302.13971v1.png)

---

# 2. PROCESSING

### Define Blueprint properties
To create a blueprint you start with defining a blueprint name, description, the blueprint type (`DOCUMENT` or `IMAGE`), the blueprint stage (`LIVE` or `DEVELOPMENT`) along with blueprint schema in JSON schema format.

You can create a blueprint using an API providing a name, type, stage and a schema in JSON format.

Feel free to modify and extend the properties - attributes to be extracted from the document! Note that you need to specify `type`, `inferenceType`, and `description` for each field. See [BDA documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/idp-cases-extraction.html) for the possible values.

In [21]:
blueprint_name = "test-brand-new-blueprint"
blueprint_description = "blueprint-description"
blueprint_type = "DOCUMENT"
blueprint_stage = "LIVE"

blueprint_schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "documentClass": "custom-document-type",
    "description": blueprint_description,
    "definitions": {},
    "properties": {
        "summary": {
            "type": "string",
            "inferenceType": "generative",
            "description": "document summary",
        },
        "title": {
            "type": "string",
            "inferenceType": "generative",
            "description": "title of the document",
        },
        "language": {
            "type": "string",
            "inferenceType": "generative",
            "description": "language of the document",
        },
    },
}

We will use the `create_blueprint` operation (or `update_blueprint` to update an existing blueprint) in the  `boto3` API to create/update the blueprint. You could also create/update blueprints using the AWS console. Each blueprint that you create is an AWS resource with its own blueprint ID and ARN. 

In [None]:
list_blueprints_response = bda_client.list_blueprints(blueprintStageFilter="ALL")

blueprint = next(
    (
        blueprint
        for blueprint in list_blueprints_response["blueprints"]
        if "blueprintName" in blueprint and blueprint["blueprintName"] == blueprint_name
    ),
    None,
)

if not blueprint:
    response = bda_client.create_blueprint(
        blueprintName=blueprint_name,
        type=blueprint_type,
        blueprintStage=blueprint_stage,
        schema=json.dumps(blueprint_schema),
    )
    print(
        f"Creating new blueprint with name={blueprint_name}, updating Stage and Schema"
    )
else:
    response = bda_client.update_blueprint(
        blueprintArn=blueprint["blueprintArn"],
        blueprintStage=blueprint_stage,
        schema=json.dumps(blueprint_schema),
    )
    print(
        f"Found existing blueprint with name={blueprint_name}, updating Stage and Schema"
    )

blueprint_arn = response["blueprint"]["blueprintArn"]

### Invoke data automation job

Now that our blueprint has been setup, we can proceed to invoke data automation. Note that in addition to the input and output configuration we also provide the blueprint id when calling the `invoke_data_automation_async` operation.

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={"s3Uri": f"s3://{s3_bucket}/{s3_file_key}"},
    outputConfiguration={"s3Uri": f"s3://{s3_bucket}/bda-outputs"},
    blueprints=[{"blueprintArn": blueprint_arn}],
)

invocationArn = response["invocationArn"]
print(f"Invoked data automation job with invocation arn {invocationArn}")

### Get Data Automation Status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details.

In [None]:
status_response = bda_runtime_client.get_data_automation_status(
    invocationArn=invocationArn
)
print(status_response)

job_metadata_s3_location = status_response["outputConfiguration"]["s3Uri"]

---

# 3. CHECK OUTPUTS

To explore the extracted attributes, we need to first identify the location of the corresponding JSON file with the custom outputs on S3. This is provided as `"custom_output_path"` key in the reponse dictionary below:

In [None]:
s3_uri_parts = job_metadata_s3_location.removeprefix("s3://").split("/", 1)
response = s3_client.get_object(Bucket=s3_uri_parts[0], Key=s3_uri_parts[1])

job_metadata = json.loads(response["Body"].read().decode("utf-8"))
print(job_metadata)

custom_output_path = job_metadata["output_metadata"][0]["segment_metadata"][0][
    "custom_output_path"
]

The structure of the custom output would be the same as that of the output produced when using a catalog blueprint. However, the `inference_result` now contain data that map to the blueprint schema we provided to BDA with the `InvokeDataAutomationAsync` operation.

In [None]:
s3_uri_parts = custom_output_path.removeprefix("s3://").split("/", 1)
response = s3_client.get_object(Bucket=s3_uri_parts[0], Key=s3_uri_parts[1])

custom_outputs = json.loads(response["Body"].read().decode("utf-8"))
print(custom_outputs["inference_result"])

Congrats! You were able to run the IDP pipeline using a managed BDA service. How does the result compare to the custom extracted outputs from the previous notebooks?