The Analysis Document State Machine runs Amazon Textract service to extract tabular data from the document (PDF format) and indexes the metadata into Amazon OpenSearch cluster.
__
The state execution input is similar to the Analysis Main State Machine with additional fields generated by the Prepare analysis state.
{
"input": {
...,
"document": {
"enabled": true,
"prefix": "IMAGE_PROXIES_PREFIX",
"numPages": 68
},
"request": {
"timestamp": 1637743896177
}
}
}
Field | Description | Comments |
---|---|---|
input.document.enabled | indicates document analysis is required | Must be true |
input.document.prefix | location of the image proxies (PNG files) generated from the Ingest Document State Machine | Must exist |
input.document.numPages | total number of pages extracted from the Ingest Document State Machine | Must exist |
input.request.timestamp | request timestamp | If present, the timestamp (DATETIME) is concatenated to the path to store the raw analysis results |
__
A state where a lambda function uses Amazon Textract AnalyzeDocument to extract tabular metadata from all pages within a PDF document. The raw JSON results are stored to s3://PROXY_BUCKET/UUID/FILE_BASENAME/raw/DATETIME/textract/XXX.json.
__
A Choice state to check $.status field. If it is set to COMPLETED indicating all pages have been processed, the state machine transitions to the next state, Index analysis results
state. Otherwise, it moves to Analyze document
state to continue the rest of the document.
__
A state where a lambda function downloads and parses the tabular data and indexes to the Amazon OpenSearch cluster under the textract
indice.
__
The analysis-document lambda function provides the implementation to support different states of the Analysis Document state machine. The following AWS XRAY trace diagram illustrates the AWS resources this lambda function communicates to.
__
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "s3:ListBucket",
"Resource": "PROXY_BUCKET",
"Effect": "Allow"
},
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "PROXY_BUCKET/*",
"Effect": "Allow"
},
{
"Action": [
"dynamodb:DescribeTable",
"dynamodb:Scan",
"dynamodb:Query",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem"
],
"Resource": [
"SERVICE_TOKEN_TABLE",
],
"Effect": "Allow"
},
{
"Action": "textract:AnalyzeDocument",
"Resource": "*",
"Effect": "Allow"
},
{
"Action": [
"es:ESHttpGet",
"es:ESHttpHead",
"es:ESHttpPost",
"es:ESHttpPut",
"es:ESHttpDelete"
],
"Resource": "OPENSEARCH_CLUSTER",
"Effect": "Allow"
}
]
}
__
Back to Analysis Main State Machine | Back to Table of contents