The Multi-lingual Document Processor (MDP) is a cloud-native solution that transforms complex multi-modal documents into structured, narrative text optimized for AI and RAG applications. Unlike traditional OCR tools, MDP excels at processing documents with varied layouts, multimedia elements, and non-Latin scripts. MDP localises multi-modal components on a page, extracts information from each component and then puts them together in a narrative form preserving content localisation. All images and tables are also saved in separate folders for all forms of downstream applications that need both structured and unstructured information.
video_demo.mp4
- Supported Languages: English, Japanese, Korean
- Coming Soon: Thai, Hindi
- Reports: Multimedia documents with images, tables, charts
- Invoices: Tabular structure documents
- Coming Soon: Magazines, newspapers
- Amazon Bedrock Integration: Advanced LLM-based document analysis
- Component Extraction: Text, images, tables, charts, infographics
- Narrative Generation: Converts all elements to descriptive text for downstream RAG applications
- Confidence Scoring: Reliability assessment for extracted content
- Serverless: AWS Lambda + SageMaker Processing Jobs
- Containerized: Docker-based processing with ECR
- API-First: RESTful API with authentication
- Scalable: Auto-scaling based on demand
The Multi-lingual Document Processor follows a serverless architecture designed for scalability, reliability, and cost-effectiveness:
- API Gateway: RESTful API endpoint with API key authentication
- Client Applications: Python client (
mdp_client.py
) and direct API access - Authentication: Secure API key-based access control
- Lambda Function: Lightweight orchestrator that receives requests and manages workflow
- SageMaker Processing Jobs: Heavy-duty document processing in containerized environments
- Docker Containers: Custom-built images with all dependencies (stored in ECR)
- Amazon Bedrock: Advanced LLM services for document analysis and narrative generation
- Amazon Textract: OCR and table extraction for structured data
- Custom AI Models: Specialized processors for different document types and languages
- S3 Buckets:
- Input documents storage (
uploads/
,invoices/
, etc.) - Output results storage (
output/
) - Lambda deployment packages and layers
- Input documents storage (
- ECR Repository: Docker image storage and versioning
- CloudFormation: Infrastructure as Code for reproducible deployments
- IAM Roles: Secure, least-privilege access between services
- CloudWatch: Logging and monitoring for all components
The picture below illustrates the processing workflow.
- π€ Document Upload: Users upload PDF documents to S3 bucket folders
- π API Request: Client submits processing request via API Gateway
- β‘ Lambda Orchestration: Lambda function validates request and initiates SageMaker job
- π³ Container Processing: SageMaker spins up Docker container with processing logic
- π€ AI Analysis: Container uses Bedrock, Textract, CV algorithms and custom algorithms for document analysis
- π Content Generation: AI generates narrative descriptions of all document components
- πΎ Output Storage: Results saved to S3 in structured format
- β Completion: Client receives confirmation and can access processed results
- API Authentication: API key-based access control
- IAM Roles: Service-to-service authentication with minimal permissions
- VPC Support: Optional network isolation for enhanced security
- Encryption: Data encrypted at rest (S3, ECR) and in transit (HTTPS)
- Auto-scaling: SageMaker automatically scales based on demand
- Concurrent Processing: Multiple documents can be processed simultaneously
- Resource Optimization: Pay-per-use model with automatic resource management
- Global Deployment: Can be deployed in any AWS region supporting required services
s3://your-bucket-name/
βββ japanese-reports/ # π Input documents folder
β βββ document1.pdf # Individual PDF files
β βββ document2.pdf # Other PDF files
βββ english-invoices/ # π invoice folder
β βββ invoice_001.pdf
β βββ invoice_002.pdf
βββ english-reports/ # π folder
βββ annual_report.pdf
βββ quarterly_report.pdf
- Format: PDF files only (
.pdf
extension) - Size Limits:
- Recommended: < 50MB per file
- Maximum: 100MB per file
- Pages: Up to 50 pages per document
- Content Types: Text, images, tables, charts, infographics
- Quality: Minimum 150 DPI for optimal OCR results
Category | Description | Optimization |
---|---|---|
report |
Multimedia documents with mixed content | General-purpose processing |
invoice |
Tabular documents with structured data | Table extraction focused |
magazine |
Magazine-style layouts (Coming Soon) | Layout-aware processing |
newspaper |
Newspaper formats (Coming Soon) | Column-aware processing |
Language | Code | Status | Script Type |
---|---|---|---|
English | english |
β Production | Latin |
Japanese | japanese |
β Production | Hiragana, Katakana, Kanji |
Korean | korean |
β Production | Hangul |
Thai | thai |
π Coming Soon | Thai script |
Hindi | hindi |
π Coming Soon | Devanagari |
{
"file_key": "uploads", // S3 folder path (required)
"bucket": "your-bucket-name", // S3 bucket name (required)
"doc_category": "report", // Document category (required)
"doc_language": "japanese", // Source language (required)
"output_language": "english" // Target language (required)
}
# Process Japanese reports in uploads folder
python mdp_client.py --stack-name your-stack \
--folder uploads \
--language japanese \
--category report \
--output-language english
# Process Korean invoices in specific folder
python mdp_client.py --stack-name your-stack \
--folder invoices/korean \
--language korean \
--category invoice \
--output-language english
# Process only PDF files with specific extensions
python mdp_client.py --stack-name your-stack \
--folder documents \
--extensions pdf \
--language english
The system automatically validates:
- β File format (PDF only)
- β File accessibility in S3
- β Language code validity
- β Category support
- β File size limits
β οΈ Content quality (warns if low resolution)
- File Organization: Group similar documents in folders by category
- Naming Convention: Use descriptive filenames (e.g.,
invoice_2024_001.pdf
) - Quality: Ensure documents are clear and readable
- Size: Keep files under 50MB for optimal processing speed
- Language Consistency: Process documents of the same language together
s3://bucket/output/document-name/
βββ final_report.txt # π Complete narrative description
βββ processing_report.json # π Processing metadata
βββ images/ # πΌοΈ Extracted images
β βββ image_page_1_1.png
β βββ image_page_2_1.png
βββ tables/ # π Extracted tables
βββ table_page_5_1.csv
βββ table_page_6_1.csv
- MDP does not classify documents by their category and hence when providing documents for processing seperate out documents by their document type (report or invoice) into separate folders under the S3 bucket created after you deploy the stack
- Note that the deployment stack creates an S3 bucket first to store all artifacts for the CDK deployment. All your data needs to go directly under a new folder (any name) in this S3 bucket. The folder name needs to be provided as part of the payload when calling the solution. Create separate folders for different document categories.
- MDP version 1.0 can be deployed in regions that have BDA, although use of BDA is just a part of it to handle english reports. So update the
deploy_dynamic.sh
with the right region. Also please ensure Clause 3.7 Sonnet is accessible in your account wherever it is available. (Regions were BDA is available:us-east-1
,us-west-2
,us-gov-west-1
,eu-central-1
,eu-west-1
,eu-west-2
,ap-south-1
,ap-southeast-2
) - If you already have a stack deployed in an AWS region, delete the stack first before deploying another stack as this may cause some role creation clashes and the new stack may throw some errors.
- If you wish to modify the code and build your own custom image, just modify the code in the
src
folder, delete theimages
folder that gets created when you had run the./build_sagemaker_prebuilt.sh
scriot and then run./build_sagemaker_prebuilt.sh
again. This will re-build the docker image in theimages
folder. - Users should use a representative doc, estimate the cost based on token information and then run the full workload to ensure their cost sensitivities are managed.
- AWS CLI configured with appropriate permissions
- Docker running
- Python 3.10+
aws sts get-caller-identity
docker info >/dev/null 2>&1 && echo "Docker is running" || echo "Docker is not running"
git clone <repository-url>
cd multi-lingual-document-processor
chmod +x build_sagemaker_prebuilt.sh && ./build_sagemaker_prebuilt.sh
chmod +x deploy_dynamic.sh && ./deploy_dynamic.sh
# Record YOUR-STACK-NAME and YOUR-BUCKET-NAME as environment variables
# Use the bucket name from deployment output
aws s3 cp samples/<your filename>.pdf s3://YOUR-BUCKET-NAME/<uploads folder>/
# Use the stack name from deployment output
python mdp_client.py --stack-name YOUR-STACK-NAME \
--folder uploads \
--language japanese \
--category report \
--output-language japanese
# Monitor processing
python mdp_client.py --stack-name YOUR-STACK-NAME --check-status
# View output files
aws s3 ls s3://YOUR-BUCKET-NAME/output/ --recursive
# List all MDP stacks
./cleanup_stack.sh --list-stacks
# Clean up a specific stack (with confirmation)
./cleanup_stack.sh mdp-stack-user-abc123
# Force cleanup without prompts
./cleanup_stack.sh mdp-stack-user-abc123 --force
# Clean up in different region
./cleanup_stack.sh mdp-stack-user-abc123 --region us-west-2
# Show help
./cleanup_stack.sh --help
The deploy_dynamic.sh
script handles everything:
- β Resource Creation: S3 bucket, ECR repository, unique naming
- β Docker Build: Platform-specific image building (ARM64/AMD64)
- β Image Push: Automated ECR push with authentication
- β Infrastructure: CloudFormation stack with all AWS resources
- β Validation: Template validation and deployment verification
Stack Name: mdp-stack-{user}-{random}
S3 Bucket: mdp-bucket-{user}-{random}
ECR Repository: mdp-processor-{user}-{random}
API Gateway: https://{id}.execute-api.us-east-1.amazonaws.com/prod/process
Lambda Function: {stack-name}-MDPDocumentProcessor
- Lambda: Document processing orchestration
- SageMaker: Heavy processing jobs in containers
- S3: Document storage and output
- ECR: Docker image registry
- API Gateway: RESTful API with authentication
- Bedrock: AI/ML document analysis
- CloudFormation: Infrastructure as Code
python mdp_client.py --stack-name mdp-stack-user-abc123 \
--process-folder invoices \
--language japanese \
--output-language english \
--category invoice
python mdp_client.py --stack-name mdp-stack-user-abc123 \
--process-folder documents \
--language korean \
--output-language english \
--category report
python mdp_client.py --stack-name mdp-stack-user-abc123 --list-folder uploads
python mdp_client.py --stack-name mdp-stack-user-abc123 --check-status
# List recent SageMaker jobs
aws sagemaker list-processing-jobs --sort-by CreationTime --sort-order Descending --max-items 5
# Get job details
aws sagemaker describe-processing-job --processing-job-name JOB-NAME
# Lambda logs
aws logs describe-log-streams --log-group-name "/aws/lambda/STACK-NAME-MDPDocumentProcessor"
# SageMaker logs
aws logs filter-log-events --log-group-name "/aws/sagemaker/ProcessingJobs" --log-stream-name-prefix "JOB-NAME"
Issue | Solution |
---|---|
Docker build fails | Use Dockerfile.sagemaker (most reliable) |
Processing timeout | Check document size (< 50MB recommended) |
No output files | Verify input files exist in S3 |
API authentication | Check API key in CloudFormation outputs |
multi-lingual-document-processor/
βββ π README.md # This comprehensive guide
βββ π deploy_dynamic.sh # β Main deployment script
βββ π³ Dockerfile.sagemaker # β Production Docker image
βββ π³ Dockerfile.arm64 # ARM64 optimized image
βββ π§ mdp_client.py # β Python client for API
βββ βοΈ lambda_function.py # Lambda entry point
βββ π processing_script_clean.py # Main processing logic
βββ π CFT_template_dynamic.yml # CloudFormation template
βββ π buildspec.yml # CodeBuild specification
βββ π requirements.txt # Python dependencies
βββ π stack_config.json # Configuration file
βββ π§ load_prebuilt_image.sh # Load pre-built Docker images
βββ πΌοΈ MDP_2.png # Architecture diagram
βββ π LICENSE # MIT License
βββ π src/ # β Source code modules
β βββ config/ # Configuration management
β β βββ config.py # Main configuration
β β βββ prompts.py # AI prompts
β βββ processors/ # Document processors
β β βββ document_processor.py # Main document processor
β β βββ image_processor.py # Image analysis
β β βββ table_processor.py # Table extraction
β β βββ text_processor.py # Text processing
β β βββ [other processors] # Specialized processors
β βββ utils/ # Utility functions
β βββ cleanup_utility.py # Cleanup tools
β βββ directory_manager.py # File management
β βββ postprocessing.py # Output processing
βββ π Layers/ # β Lambda layers (dependencies)
β βββ SM_layer_new-*.zip # SageMaker layer
β βββ pandas_layer_new-*.zip # Pandas layer
β βββ numpy_layer_new-*.zip # NumPy layer
β βββ pydantic_layer-*.zip # Pydantic layer
β βββ rpds_layer_new-*.zip # RPDS layer
βββ π samples/ # β Test documents
β βββ Japanese_doc_shorter_images.pdf
βββ π images/ # Pre-built Docker images
βββ mdp-sagemaker_latest_sagemaker_amd64.tar.gz
βββ README.md # Usage instructions
deploy_dynamic.sh
- One-command deploymentmdp_client.py
- Easy-to-use Python clientDockerfile.sagemaker
- Production-ready containersrc/
- All processing logic and configurationLayers/
- Pre-built dependencies for Lambdasamples/
- Test documents to verify deployment
Edit src/config/config.py
:
# Model settings
BEDROCK_MODEL = "claude-3-7-sonnet-20250219-v1:0"
PROCESSING_TIMEOUT = 900 # seconds
MEMORY_SIZE = 1024 # MB
# Language-specific settings
LANGUAGE_MODELS = {
"japanese": "claude-3-sonnet",
"korean": "claude-3-sonnet",
"english": "claude-3-haiku"
}
Edit src/config/prompts.py
:
# Document analysis prompts
DOCUMENT_ANALYSIS_PROMPT = """
Analyze this document and provide a detailed narrative...
"""
# Table extraction prompts
TABLE_EXTRACTION_PROMPT = """
Extract and describe the tabular data...
"""
curl -X POST https://API-ENDPOINT/prod/process \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR-API-KEY" \
-d '{
"file_key": "uploads",
"bucket": "your-bucket-name",
"doc_category": "report",
"doc_language": "japanese",
"output_language": "english"
}'
{
"message": "Document processing request accepted",
"requestId": "12345-abcde-67890"
}
- S3: Full access for bucket operations
- Lambda: Function creation and execution
- SageMaker: Processing job execution
- ECR: Repository and image management
- API Gateway: API creation and management
- CloudFormation: Stack operations
- IAM: Role and policy management
- Bedrock: Model access and data automation
- API Key Authentication: Secure API access
- IAM Role-based Access: Least privilege principle
- VPC Support: Network isolation (optional)
- Encryption: S3 and ECR encryption at rest
- Fork the repository
- Create a feature branch
- Make your changes
- Test with sample documents
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Check this README and inline code comments
- Issues: Create GitHub issues for bugs or feature requests
- Logs: Always check CloudWatch logs for troubleshooting
- AWS Support: Use AWS Support for service-specific issues
- β Japanese, Korean, English support
- β Report and invoice processing
- β Automated deployment
- β API Gateway integration
- π Thai and Hindi language support
- π Magazine and newspaper document types
- π Batch processing capabilities
- π Real-time processing API
- π Web UI for document upload
- π Advanced analytics dashboard
- Sujoy Roy (Principal Applied Scientist, AWS)
- Shreya Goyal (Applied Scientist, AWS)
- Xiaogang Wang (Senior Applied Scientist, AWS)
- Iman Abbasnejad (Applied Scientist, AWS)
π Ready to process your multi-lingual documents? Start with the Quick Start guide above!