Skip to content

Latest commit

 

History

History
112 lines (93 loc) · 3.9 KB

README.md

File metadata and controls

112 lines (93 loc) · 3.9 KB

Analysis Document State Machine

The Analysis Document State Machine runs Amazon Textract service to extract tabular data from the document (PDF format) and indexes the metadata into Amazon OpenSearch cluster.

Analysis Document state machine

__

Execution input

The state execution input is similar to the Analysis Main State Machine with additional fields generated by the Prepare analysis state.

{
    "input": {
        ...,
        "document": {
            "enabled": true,
            "prefix": "IMAGE_PROXIES_PREFIX",
            "numPages": 68
        },
        "request": {
            "timestamp": 1637743896177
        }
    }
}
Field Description Comments
input.document.enabled indicates document analysis is required Must be true
input.document.prefix location of the image proxies (PNG files) generated from the Ingest Document State Machine Must exist
input.document.numPages total number of pages extracted from the Ingest Document State Machine Must exist
input.request.timestamp request timestamp If present, the timestamp (DATETIME) is concatenated to the path to store the raw analysis results

__

State: Analyze document

A state where a lambda function uses Amazon Textract AnalyzeDocument to extract tabular metadata from all pages within a PDF document. The raw JSON results are stored to s3://PROXY_BUCKET/UUID/FILE_BASENAME/raw/DATETIME/textract/XXX.json.

__

State: More pages?

A Choice state to check $.status field. If it is set to COMPLETED indicating all pages have been processed, the state machine transitions to the next state, Index analysis results state. Otherwise, it moves to Analyze document state to continue the rest of the document.

__

State: Index analysis results

A state where a lambda function downloads and parses the tabular data and indexes to the Amazon OpenSearch cluster under the textract indice.

__

AWS Lambda function (analysis-document)

The analysis-document lambda function provides the implementation to support different states of the Analysis Document state machine. The following AWS XRAY trace diagram illustrates the AWS resources this lambda function communicates to.

Analysis Document Lambda function

__

IAM Role Permission

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "s3:ListBucket",
            "Resource": "PROXY_BUCKET",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "PROXY_BUCKET/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:DescribeTable",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:UpdateItem",
                "dynamodb:DeleteItem"
            ],
            "Resource": [
                "SERVICE_TOKEN_TABLE",
            ],
            "Effect": "Allow"
        },
        {
            "Action": "textract:AnalyzeDocument",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "es:ESHttpGet",
                "es:ESHttpHead",
                "es:ESHttpPost",
                "es:ESHttpPut",
                "es:ESHttpDelete"
            ],
            "Resource": "OPENSEARCH_CLUSTER",
            "Effect": "Allow"
        }
    ]
}

__

Back to Analysis Main State Machine | Back to Table of contents