Analysis Document State Machine

The Analysis Document State Machine runs Amazon Textract service to extract tabular data from the document (PDF format) and indexes the metadata into Amazon OpenSearch cluster.

__

Execution input

The state execution input is similar to the Analysis Main State Machine with additional fields generated by the Prepare analysis state.

{
    "input": {
        ...,
        "document": {
            "enabled": true,
            "prefix": "IMAGE_PROXIES_PREFIX",
            "numPages": 68
        },
        "request": {
            "timestamp": 1637743896177
        }
    }
}

Field	Description	Comments
input.document.enabled	indicates document analysis is required	Must be true
input.document.prefix	location of the image proxies (PNG files) generated from the Ingest Document State Machine	Must exist
input.document.numPages	total number of pages extracted from the Ingest Document State Machine	Must exist
input.request.timestamp	request timestamp	If present, the timestamp (DATETIME) is concatenated to the path to store the raw analysis results

__

State: Analyze document

A state where a lambda function uses Amazon Textract AnalyzeDocument to extract tabular metadata from all pages within a PDF document. The raw JSON results are stored to s3://PROXY_BUCKET/UUID/FILE_BASENAME/raw/DATETIME/textract/XXX.json.

__

State: More pages?

A Choice state to check $.status field. If it is set to COMPLETED indicating all pages have been processed, the state machine transitions to the next state, Index analysis results state. Otherwise, it moves to Analyze document state to continue the rest of the document.

__

State: Index analysis results

A state where a lambda function downloads and parses the tabular data and indexes to the Amazon OpenSearch cluster under the textract indice.

__

AWS Lambda function (analysis-document)

The analysis-document lambda function provides the implementation to support different states of the Analysis Document state machine. The following AWS XRAY trace diagram illustrates the AWS resources this lambda function communicates to.

__

IAM Role Permission

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "s3:ListBucket",
            "Resource": "PROXY_BUCKET",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "PROXY_BUCKET/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:DescribeTable",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:UpdateItem",
                "dynamodb:DeleteItem"
            ],
            "Resource": [
                "SERVICE_TOKEN_TABLE",
            ],
            "Effect": "Allow"
        },
        {
            "Action": "textract:AnalyzeDocument",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "es:ESHttpGet",
                "es:ESHttpHead",
                "es:ESHttpPost",
                "es:ESHttpPut",
                "es:ESHttpDelete"
            ],
            "Resource": "OPENSEARCH_CLUSTER",
            "Effect": "Allow"
        }
    ]
}

__

Back to Analysis Main State Machine | Back to Table of contents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Analysis Document State Machine

Execution input

State: Analyze document

State: More pages?

State: Index analysis results

AWS Lambda function (analysis-document)

IAM Role Permission

Files

README.md

Latest commit

History

README.md

File metadata and controls

Analysis Document State Machine

Execution input

State: Analyze document

State: More pages?

State: Index analysis results

AWS Lambda function (analysis-document)

IAM Role Permission