Skip to content

aws-samples/sample-multi-lingual-document-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Multi-lingual Document Processor (MDP)

AWS Python Docker License

🌟 Overview

The Multi-lingual Document Processor (MDP) is a cloud-native solution that transforms complex multi-modal documents into structured, narrative text optimized for AI and RAG applications. Unlike traditional OCR tools, MDP excels at processing documents with varied layouts, multimedia elements, and non-Latin scripts. MDP localises multi-modal components on a page, extracts information from each component and then puts them together in a narrative form preserving content localisation. All images and tables are also saved in separate folders for all forms of downstream applications that need both structured and unstructured information.

πŸŽ₯ Video Demo

video_demo.mp4

πŸš€ Key Features

🌍 Multi-language Support

  • Supported Languages: English, Japanese, Korean
  • Coming Soon: Thai, Hindi

πŸ“„ Document Types

  • Reports: Multimedia documents with images, tables, charts
  • Invoices: Tabular structure documents
  • Coming Soon: Magazines, newspapers

πŸ€– AI-Powered Processing

  • Amazon Bedrock Integration: Advanced LLM-based document analysis
  • Component Extraction: Text, images, tables, charts, infographics
  • Narrative Generation: Converts all elements to descriptive text for downstream RAG applications
  • Confidence Scoring: Reliability assessment for extracted content

πŸ—οΈ Architecture

  • Serverless: AWS Lambda + SageMaker Processing Jobs
  • Containerized: Docker-based processing with ECR
  • API-First: RESTful API with authentication
  • Scalable: Auto-scaling based on demand

πŸ›οΈ System Architecture

MDP Architecture

Architecture Overview

The Multi-lingual Document Processor follows a serverless architecture designed for scalability, reliability, and cost-effectiveness:

🌐 Frontend Layer

  • API Gateway: RESTful API endpoint with API key authentication
  • Client Applications: Python client (mdp_client.py) and direct API access
  • Authentication: Secure API key-based access control

⚑ Processing Layer

  • Lambda Function: Lightweight orchestrator that receives requests and manages workflow
  • SageMaker Processing Jobs: Heavy-duty document processing in containerized environments
  • Docker Containers: Custom-built images with all dependencies (stored in ECR)

πŸ€– AI/ML Services

  • Amazon Bedrock: Advanced LLM services for document analysis and narrative generation
  • Amazon Textract: OCR and table extraction for structured data
  • Custom AI Models: Specialized processors for different document types and languages

πŸ’Ύ Storage & Data

  • S3 Buckets:
    • Input documents storage (uploads/, invoices/, etc.)
    • Output results storage (output/)
    • Lambda deployment packages and layers
  • ECR Repository: Docker image storage and versioning

πŸ”§ Infrastructure

  • CloudFormation: Infrastructure as Code for reproducible deployments
  • IAM Roles: Secure, least-privilege access between services
  • CloudWatch: Logging and monitoring for all components

πŸ”„ Processing Flow

The picture below illustrates the processing workflow.

MDP Workflow

  1. πŸ“€ Document Upload: Users upload PDF documents to S3 bucket folders
  2. πŸš€ API Request: Client submits processing request via API Gateway
  3. ⚑ Lambda Orchestration: Lambda function validates request and initiates SageMaker job
  4. 🐳 Container Processing: SageMaker spins up Docker container with processing logic
  5. πŸ€– AI Analysis: Container uses Bedrock, Textract, CV algorithms and custom algorithms for document analysis
  6. πŸ“ Content Generation: AI generates narrative descriptions of all document components
  7. πŸ’Ύ Output Storage: Results saved to S3 in structured format
  8. βœ… Completion: Client receives confirmation and can access processed results

πŸ”’ Security Features

  • API Authentication: API key-based access control
  • IAM Roles: Service-to-service authentication with minimal permissions
  • VPC Support: Optional network isolation for enhanced security
  • Encryption: Data encrypted at rest (S3, ECR) and in transit (HTTPS)

πŸ“ˆ Scalability & Performance

  • Auto-scaling: SageMaker automatically scales based on demand
  • Concurrent Processing: Multiple documents can be processed simultaneously
  • Resource Optimization: Pay-per-use model with automatic resource management
  • Global Deployment: Can be deployed in any AWS region supporting required services

πŸ“Š Input / Output Structure

βœ… Input Structure

S3 Bucket Organization

s3://your-bucket-name/
β”œβ”€β”€ japanese-reports/          # πŸ“ Input documents folder
β”‚   β”œβ”€β”€ document1.pdf          # Individual PDF files
β”‚   β”œβ”€β”€ document2.pdf          # Other PDF files 
β”œβ”€β”€ english-invoices/          # πŸ“ invoice folder
β”‚   β”œβ”€β”€ invoice_001.pdf
β”‚   └── invoice_002.pdf
└── english-reports/           # πŸ“ folder
    β”œβ”€β”€ annual_report.pdf
    └── quarterly_report.pdf

Document Requirements

  • Format: PDF files only (.pdf extension)
  • Size Limits:
    • Recommended: < 50MB per file
    • Maximum: 100MB per file
    • Pages: Up to 50 pages per document
  • Content Types: Text, images, tables, charts, infographics
  • Quality: Minimum 150 DPI for optimal OCR results

Supported Document Categories

Category Description Optimization
report Multimedia documents with mixed content General-purpose processing
invoice Tabular documents with structured data Table extraction focused
magazine Magazine-style layouts (Coming Soon) Layout-aware processing
newspaper Newspaper formats (Coming Soon) Column-aware processing

Language Support

Language Code Status Script Type
English english βœ… Production Latin
Japanese japanese βœ… Production Hiragana, Katakana, Kanji
Korean korean βœ… Production Hangul
Thai thai πŸ”„ Coming Soon Thai script
Hindi hindi πŸ”„ Coming Soon Devanagari

API Request Format

{
  "file_key": "uploads",              // S3 folder path (required)
  "bucket": "your-bucket-name",       // S3 bucket name (required)
  "doc_category": "report",           // Document category (required)
  "doc_language": "japanese",         // Source language (required)
  "output_language": "english"        // Target language (required)
}

Client Usage Examples

# Process Japanese reports in uploads folder
python mdp_client.py --stack-name your-stack \
  --folder uploads \
  --language japanese \
  --category report \
  --output-language english

# Process Korean invoices in specific folder
python mdp_client.py --stack-name your-stack \
  --folder invoices/korean \
  --language korean \
  --category invoice \
  --output-language english

# Process only PDF files with specific extensions
python mdp_client.py --stack-name your-stack \
  --folder documents \
  --extensions pdf \
  --language english

Input Validation

The system automatically validates:

  • βœ… File format (PDF only)
  • βœ… File accessibility in S3
  • βœ… Language code validity
  • βœ… Category support
  • βœ… File size limits
  • ⚠️ Content quality (warns if low resolution)

Best Practices for Input

  1. File Organization: Group similar documents in folders by category
  2. Naming Convention: Use descriptive filenames (e.g., invoice_2024_001.pdf)
  3. Quality: Ensure documents are clear and readable
  4. Size: Keep files under 50MB for optimal processing speed
  5. Language Consistency: Process documents of the same language together

πŸ“ Output Structure

s3://bucket/output/document-name/
β”œβ”€β”€ final_report.txt           # πŸ“„ Complete narrative description
β”œβ”€β”€ processing_report.json     # πŸ“‹ Processing metadata
β”œβ”€β”€ images/                    # πŸ–ΌοΈ  Extracted images
β”‚   β”œβ”€β”€ image_page_1_1.png
β”‚   └── image_page_2_1.png
└── tables/                    # πŸ“Š Extracted tables
    β”œβ”€β”€ table_page_5_1.csv
    └── table_page_6_1.csv

πŸš€ Things to watch out for

  • MDP does not classify documents by their category and hence when providing documents for processing seperate out documents by their document type (report or invoice) into separate folders under the S3 bucket created after you deploy the stack
  • Note that the deployment stack creates an S3 bucket first to store all artifacts for the CDK deployment. All your data needs to go directly under a new folder (any name) in this S3 bucket. The folder name needs to be provided as part of the payload when calling the solution. Create separate folders for different document categories.
  • MDP version 1.0 can be deployed in regions that have BDA, although use of BDA is just a part of it to handle english reports. So update the deploy_dynamic.sh with the right region. Also please ensure Clause 3.7 Sonnet is accessible in your account wherever it is available. (Regions were BDA is available: us-east-1, us-west-2, us-gov-west-1, eu-central-1, eu-west-1, eu-west-2, ap-south-1, ap-southeast-2)
  • If you already have a stack deployed in an AWS region, delete the stack first before deploying another stack as this may cause some role creation clashes and the new stack may throw some errors.
  • If you wish to modify the code and build your own custom image, just modify the code in the src folder, delete the images folder that gets created when you had run the ./build_sagemaker_prebuilt.sh scriot and then run ./build_sagemaker_prebuilt.sh again. This will re-build the docker image in the images folder.
  • Users should use a representative doc, estimate the cost based on token information and then run the full workload to ensure their cost sensitivities are managed.

⚑ Quick Start (5 Minutes)

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Docker running
  • Python 3.10+

0. Check for prerequisites

aws sts get-caller-identity

docker info >/dev/null 2>&1 && echo "Docker is running" || echo "Docker is not running"

1. Deploy the Stack

git clone <repository-url>
cd multi-lingual-document-processor
chmod +x build_sagemaker_prebuilt.sh && ./build_sagemaker_prebuilt.sh
chmod +x deploy_dynamic.sh && ./deploy_dynamic.sh

# Record YOUR-STACK-NAME and YOUR-BUCKET-NAME as environment variables

2. Upload Test Document

# Use the bucket name from deployment output
aws s3 cp samples/<your filename>.pdf s3://YOUR-BUCKET-NAME/<uploads folder>/

3. Process Document

# Use the stack name from deployment output
python mdp_client.py --stack-name YOUR-STACK-NAME \
  --folder uploads \
  --language japanese \
  --category report \
  --output-language japanese

4. Check Results

# Monitor processing
python mdp_client.py --stack-name YOUR-STACK-NAME --check-status

# View output files
aws s3 ls s3://YOUR-BUCKET-NAME/output/ --recursive

5. Cleaning Up

# List all MDP stacks
./cleanup_stack.sh --list-stacks
# Clean up a specific stack (with confirmation) 
./cleanup_stack.sh mdp-stack-user-abc123 
# Force cleanup without prompts 
./cleanup_stack.sh mdp-stack-user-abc123 --force 
# Clean up in different region 
./cleanup_stack.sh mdp-stack-user-abc123 --region us-west-2 
# Show help 
./cleanup_stack.sh --help

πŸ› οΈ Deployment Details

Automated Deployment

The deploy_dynamic.sh script handles everything:

  • βœ… Resource Creation: S3 bucket, ECR repository, unique naming
  • βœ… Docker Build: Platform-specific image building (ARM64/AMD64)
  • βœ… Image Push: Automated ECR push with authentication
  • βœ… Infrastructure: CloudFormation stack with all AWS resources
  • βœ… Validation: Template validation and deployment verification

Generated Resources

Stack Name:     mdp-stack-{user}-{random}
S3 Bucket:      mdp-bucket-{user}-{random}
ECR Repository: mdp-processor-{user}-{random}
API Gateway:    https://{id}.execute-api.us-east-1.amazonaws.com/prod/process
Lambda Function: {stack-name}-MDPDocumentProcessor

AWS Services Used

  • Lambda: Document processing orchestration
  • SageMaker: Heavy processing jobs in containers
  • S3: Document storage and output
  • ECR: Docker image registry
  • API Gateway: RESTful API with authentication
  • Bedrock: AI/ML document analysis
  • CloudFormation: Infrastructure as Code

πŸ“‹ Usage Examples

Process Japanese Invoice

python mdp_client.py --stack-name mdp-stack-user-abc123 \
  --process-folder invoices \
  --language japanese \
  --output-language english \
  --category invoice

Process Korean Report

python mdp_client.py --stack-name mdp-stack-user-abc123 \
  --process-folder documents \
  --language korean \
  --output-language english \
  --category report

List Files in Bucket

python mdp_client.py --stack-name mdp-stack-user-abc123 --list-folder uploads

Check Processing Status

python mdp_client.py --stack-name mdp-stack-user-abc123 --check-status

πŸ” Monitoring & Troubleshooting

Check Processing Jobs

# List recent SageMaker jobs
aws sagemaker list-processing-jobs --sort-by CreationTime --sort-order Descending --max-items 5

# Get job details
aws sagemaker describe-processing-job --processing-job-name JOB-NAME

View Logs

# Lambda logs
aws logs describe-log-streams --log-group-name "/aws/lambda/STACK-NAME-MDPDocumentProcessor"

# SageMaker logs
aws logs filter-log-events --log-group-name "/aws/sagemaker/ProcessingJobs" --log-stream-name-prefix "JOB-NAME"

Common Issues

Issue Solution
Docker build fails Use Dockerfile.sagemaker (most reliable)
Processing timeout Check document size (< 50MB recommended)
No output files Verify input files exist in S3
API authentication Check API key in CloudFormation outputs

πŸ“ Project Structure

multi-lingual-document-processor/
β”œβ”€β”€ πŸ“„ README.md                    # This comprehensive guide
β”œβ”€β”€ πŸš€ deploy_dynamic.sh            # ⭐ Main deployment script
β”œβ”€β”€ 🐳 Dockerfile.sagemaker         # ⭐ Production Docker image
β”œβ”€β”€ 🐳 Dockerfile.arm64             # ARM64 optimized image
β”œβ”€β”€ πŸ”§ mdp_client.py                # ⭐ Python client for API
β”œβ”€β”€ βš™οΈ  lambda_function.py           # Lambda entry point
β”œβ”€β”€ πŸ”„ processing_script_clean.py   # Main processing logic
β”œβ”€β”€ πŸ“‹ CFT_template_dynamic.yml     # CloudFormation template
β”œβ”€β”€ πŸ“‹ buildspec.yml                # CodeBuild specification
β”œβ”€β”€ πŸ“‹ requirements.txt             # Python dependencies
β”œβ”€β”€ πŸ“‹ stack_config.json            # Configuration file
β”œβ”€β”€ πŸ”§ load_prebuilt_image.sh       # Load pre-built Docker images
β”œβ”€β”€ πŸ–ΌοΈ  MDP_2.png                   # Architecture diagram
β”œβ”€β”€ πŸ“„ LICENSE                      # MIT License
β”œβ”€β”€ πŸ“ src/                         # ⭐ Source code modules
β”‚   β”œβ”€β”€ config/                     # Configuration management
β”‚   β”‚   β”œβ”€β”€ config.py               # Main configuration
β”‚   β”‚   └── prompts.py              # AI prompts
β”‚   β”œβ”€β”€ processors/                 # Document processors
β”‚   β”‚   β”œβ”€β”€ document_processor.py   # Main document processor
β”‚   β”‚   β”œβ”€β”€ image_processor.py      # Image analysis
β”‚   β”‚   β”œβ”€β”€ table_processor.py      # Table extraction
β”‚   β”‚   β”œβ”€β”€ text_processor.py       # Text processing
β”‚   β”‚   └── [other processors]      # Specialized processors
β”‚   └── utils/                      # Utility functions
β”‚       β”œβ”€β”€ cleanup_utility.py      # Cleanup tools
β”‚       β”œβ”€β”€ directory_manager.py    # File management
β”‚       └── postprocessing.py       # Output processing
β”œβ”€β”€ πŸ“ Layers/                      # ⭐ Lambda layers (dependencies)
β”‚   β”œβ”€β”€ SM_layer_new-*.zip          # SageMaker layer
β”‚   β”œβ”€β”€ pandas_layer_new-*.zip      # Pandas layer
β”‚   β”œβ”€β”€ numpy_layer_new-*.zip       # NumPy layer
β”‚   β”œβ”€β”€ pydantic_layer-*.zip        # Pydantic layer
β”‚   └── rpds_layer_new-*.zip        # RPDS layer
β”œβ”€β”€ πŸ“ samples/                     # ⭐ Test documents
β”‚   └── Japanese_doc_shorter_images.pdf
β”œβ”€β”€ πŸ“ images/                      # Pre-built Docker images
    β”œβ”€β”€ mdp-sagemaker_latest_sagemaker_amd64.tar.gz
    └── README.md                   # Usage instructions

⭐ Key Files for Users

  • deploy_dynamic.sh - One-command deployment
  • mdp_client.py - Easy-to-use Python client
  • Dockerfile.sagemaker - Production-ready container
  • src/ - All processing logic and configuration
  • Layers/ - Pre-built dependencies for Lambda
  • samples/ - Test documents to verify deployment

πŸ”§ Advanced Configuration

Custom Processing Parameters

Edit src/config/config.py:

# Model settings
BEDROCK_MODEL = "claude-3-7-sonnet-20250219-v1:0"
PROCESSING_TIMEOUT = 900  # seconds
MEMORY_SIZE = 1024        # MB

# Language-specific settings
LANGUAGE_MODELS = {
    "japanese": "claude-3-sonnet",
    "korean": "claude-3-sonnet",
    "english": "claude-3-haiku"
}

Custom Prompts

Edit src/config/prompts.py:

# Document analysis prompts
DOCUMENT_ANALYSIS_PROMPT = """
Analyze this document and provide a detailed narrative...
"""

# Table extraction prompts
TABLE_EXTRACTION_PROMPT = """
Extract and describe the tabular data...
"""

πŸ§ͺ API Reference

Submit Processing Job

curl -X POST https://API-ENDPOINT/prod/process \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR-API-KEY" \
  -d '{
    "file_key": "uploads",
    "bucket": "your-bucket-name",
    "doc_category": "report",
    "doc_language": "japanese",
    "output_language": "english"
  }'

Response

{
  "message": "Document processing request accepted",
  "requestId": "12345-abcde-67890"
}

πŸ”’ Security & Permissions

Required AWS Permissions

  • S3: Full access for bucket operations
  • Lambda: Function creation and execution
  • SageMaker: Processing job execution
  • ECR: Repository and image management
  • API Gateway: API creation and management
  • CloudFormation: Stack operations
  • IAM: Role and policy management
  • Bedrock: Model access and data automation

Security Features

  • API Key Authentication: Secure API access
  • IAM Role-based Access: Least privilege principle
  • VPC Support: Network isolation (optional)
  • Encryption: S3 and ECR encryption at rest

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test with sample documents
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support

  • Documentation: Check this README and inline code comments
  • Issues: Create GitHub issues for bugs or feature requests
  • Logs: Always check CloudWatch logs for troubleshooting
  • AWS Support: Use AWS Support for service-specific issues

🎯 Roadmap

Current Version (v1.0)

  • βœ… Japanese, Korean, English support
  • βœ… Report and invoice processing
  • βœ… Automated deployment
  • βœ… API Gateway integration

Upcoming Features

  • πŸ”„ Thai and Hindi language support
  • πŸ”„ Magazine and newspaper document types
  • πŸ”„ Batch processing capabilities
  • πŸ”„ Real-time processing API
  • πŸ”„ Web UI for document upload
  • πŸ”„ Advanced analytics dashboard

Core Contributors

  • Sujoy Roy (Principal Applied Scientist, AWS)
  • Shreya Goyal (Applied Scientist, AWS)
  • Xiaogang Wang (Senior Applied Scientist, AWS)
  • Iman Abbasnejad (Applied Scientist, AWS)

πŸš€ Ready to process your multi-lingual documents? Start with the Quick Start guide above!

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •