A high-performance document parsing microservice for the Pipestream AI platform, built on Apache Tika and Quarkus.
The module-parser is a critical pipeline component in the Pipestream AI ecosystem that transforms raw documents into structured, searchable data. It extracts text, metadata, and document structure from 100+ file formats, enabling downstream AI/ML processing, search indexing, and content analysis.
Pipestream AI is a distributed document processing platform where specialized microservices collaborate through gRPC to handle complex workflows. The parser module serves as the ingestion gateway, responsible for:
- Document Understanding: Converting diverse file formats into standardized representations
- Metadata Enrichment: Extracting 1,330+ metadata fields across 14 document types
- Structure Extraction: Building document outlines, link graphs, and hierarchies
- Pipeline Integration: Producing
PipeDocmessages for downstream processing modules - Quality Assurance: Validating document quality and handling parsing errors gracefully
Each parsed document flows through the platform's service mesh, where it can be indexed for search, analyzed by AI models, or transformed by other pipeline modules—all coordinated through the platform's dynamic service discovery and gRPC communication layer.
At its core, this service leverages Apache Tika, the industry-standard content analysis toolkit. Tika provides:
- Universal Format Support: Parsers for documents, images, audio, video, archives, and more
- Metadata Extraction: Access to embedded metadata across formats (EXIF, XMP, Office properties)
- Content Detection: Automatic MIME type detection and character encoding handling
- Text Extraction: Unified API for extracting plain text from binary formats
The module-parser builds upon Tika's foundation by:
- Wrapping Tika's capabilities in modern gRPC and REST APIs
- Mapping Tika's loosely-typed metadata to strongly-typed Protocol Buffers
- Adding specialized extractors for PDF outlines, EPUB structure, and HTML hierarchies
- Providing enterprise features like error handling, configuration management, and service discovery
- Offering developer-friendly testing endpoints and comprehensive documentation
- Tika Core: 3.2.1
- Standard Parsers: Microsoft Office, PDF, OpenDocument, RTF, images, email, and more
- Scientific Parsers: NetCDF climate data, specialized scientific formats
- OCR Support: Tesseract integration for scanned documents and images
The parser produces strongly-typed metadata using an extensive set of Protocol Buffer definitions maintained in the platform-libraries repository.
17 protobuf files define comprehensive metadata structures:
pdf_metadata.proto- 50+ PDF fields (version, encryption, permissions, producer, etc.)office_metadata.proto- Microsoft Office & OpenOffice propertiesimage_metadata.proto- TIFF/EXIF/IPTC metadata, camera settings, GPS coordinatesemail_metadata.proto- Email headers, attachments, MAPI propertiesmedia_metadata.proto- Audio/video XMP metadatahtml_metadata.proto- Web document propertiesepub_metadata.proto- E-book format metadatartf_metadata.proto- Rich Text Format propertiesfont_metadata.proto- Font file attributes (TTF, OTF, WOFF)database_metadata.proto- Database schema informationwarc_metadata.proto- Web archive format dataclimate_forecast_metadata.proto- NetCDF climate data
dublin_core.proto- Dublin Core metadata standardcreative_commons_metadata.proto- Creative Commons licensing
tika_response.proto- Top-level response structuretika_base_metadata.proto- Foundation metadata typesgeneric_metadata.proto- Universal metadata fields
This comprehensive schema enables strongly-typed metadata extraction with over 1,330 fields mapped from Tika's interfaces to structured protobuf messages.
- 1,330+ metadata fields across 14 document types
- 18 specialized metadata builders for format-specific extraction
- Automatic mapping from Tika metadata interfaces to structured protobufs
- XMP metadata extraction (Rights, PDF, Digital Media)
- Dublin Core standard metadata support
- Access permissions and security metadata
- PDF Outlines: Bookmark hierarchy extraction via PDFBox
- EPUB Table of Contents: Complete e-book navigation structure
- HTML Outlines: Heading hierarchy (H1-H6) with CSS selector support
- Markdown Structure: Heading extraction with CommonMark parser
- HTML link extraction with external/internal classification
- Markdown hyperlink discovery
- Link context and metadata tracking
100+ file formats including:
- Documents: PDF, Word (.doc/.docx), PowerPoint, Excel, RTF, OpenDocument
- Images: JPEG, PNG, TIFF, GIF with full EXIF/IPTC metadata
- Emails: EML, MSG, MBOX with attachment metadata
- E-books: EPUB with table of contents
- Archives: WARC web archives
- Media: Audio/video with XMP metadata
- Data: Databases, NetCDF climate files
- Web: HTML, XML with link extraction
- Fonts: TTF, OTF, WOFF with font metrics
- Pre-built configurations for common scenarios:
- Default parsing with balanced settings
- Large document processing (optimized memory)
- Fast processing (speed-optimized)
- Batch processing (resilient error handling)
- Strict quality control (fail-fast validation)
- Rich configuration options:
- Content length limits
- Metadata extraction controls
- Timeout management
- Geo-parser and EMF parser toggles
- MIME type filtering
- Outline extraction settings
- gRPC API: High-performance binary protocol for production
- REST API: HTTP endpoints for testing and exploration
- OpenAPI/Swagger: Interactive API documentation at
/swagger-ui - Pre-built examples: Configuration templates for common use cases
- Service Discovery: Automatic Consul registration
- Health Checks: Built-in health monitoring
- Error Handling: Graceful fallbacks and detailed error reporting
- Schema Validation: Configuration validation endpoints
- MIME Type Detection: Automatic content type detection
- Runtime: Quarkus 3.x (Java 21)
- Build System: Gradle with version catalogs
- Parsing Engine: Apache Tika 3.2.1
- Protocols: gRPC + REST (JAX-RS)
- Service Discovery: Consul via Smallrye Stork
- Data Format: Protocol Buffers (protobuf)
- Additional Libraries: PDFBox, CommonMark, Jackson
The main gRPC service implementation (ai.pipestream.module.parser.service.ParserServiceImpl):
- Receives
ModuleProcessRequestwith document blobs - Orchestrates parsing through
DocumentParser - Extracts metadata via
TikaMetadataExtractor - Builds structured
TikaResponsewith format-specific metadata - Returns
ModuleProcessResponsecontainingPipeDoc
Developer-friendly HTTP endpoints at /api/parser/service/:
/config- Get parser configuration JSON schema/health- Service health check/test- Quick parser testing/simple-form- Form-based document upload/parse-json- JSON-based parsing with configuration/parse-file- Direct file upload/config/validate- Configuration validation/config/examples- Pre-built configuration examples/demo/*- Demo document testing
Core parsing engine (ai.pipestream.module.parser.parser.DocumentParser):
- Auto-detects document types via Tika's
AutoDetectParser - Applies custom Tika configurations
- Enforces content length limits
- Handles special cases (fonts, EPUBs, embedded documents)
- Maps Tika metadata to protobuf structures
18 specialized metadata builders (3,081 lines total):
PdfMetadataBuilder- PDF documentsOfficeMetadataBuilder- Microsoft Office formatsImageMetadataBuilder- Images with EXIF/IPTCEmailMetadataBuilder- Email messagesMediaMetadataBuilder- Audio/video filesHtmlMetadataBuilder- HTML documentsRtfMetadataBuilder- RTF documentsEpubMetadataBuilder- EPUB e-booksDatabaseMetadataBuilder- Database filesFontMetadataBuilder- Font filesWarcMetadataBuilder- Web archivesClimateForecastMetadataBuilder- Climate dataCreativeCommonsMetadataBuilder- CC licensing- Plus extraction utilities for outlines, links, and type detection
┌─────────────────────┐
│ Input Document │
│ (Binary Blob) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ ParserServiceImpl │
│ (gRPC/REST) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ DocumentParser │
│ (Apache Tika) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ TikaMetadataExtractor│
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ DocumentTypeDetector │
└──────────┬──────────┘
│
├─→ PdfMetadataBuilder
├─→ OfficeMetadataBuilder
├─→ ImageMetadataBuilder
├─→ EmailMetadataBuilder
└─→ [Other Builders...]
│
▼
┌─────────────────────┐
│ TikaResponse │
│ (Protobuf) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ PipeDoc │
│ (Search Metadata │
│ + Structured Data)│
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Downstream Pipeline │
│ (Search, AI/ML, │
│ Analytics) │
└─────────────────────┘
- Java 21+
- Gradle 8.x
- Docker (optional, for containerized deployment)
- Clone the repository
git clone https://github.com/ai-pipestream/module-parser.git
cd module-parser- Run in development mode
./gradlew quarkusDevThe service will start on port 39001 with:
- gRPC endpoint:
localhost:39001 - REST API:
http://localhost:39001/api/parser/service/ - Swagger UI:
http://localhost:39001/swagger-ui/
- Test with a sample document
curl -F "file=@sample.pdf" http://localhost:39001/api/parser/service/simple-form./gradlew testTests use sample documents from Maven artifacts:
ai.pipestream.testdata:test-documentsai.pipestream.testdata:sample-doc-types
Service: ai.pipestream.module.parser.service.ParserService
Method: Process
- Request:
ModuleProcessRequestconfig- JSON string withParserConfigdocument- Document blobrequest_id- Unique request identifier
- Response:
ModuleProcessResponsepipe_doc- Parsed document with metadatastatus- Processing statuserror- Error details (if applicable)
Base path: /api/parser/service/
GET /config
- Returns JSON schema for
ParserConfig - Response:
application/json
POST /config/validate
- Validates parser configuration
- Body: JSON configuration
- Response: Validation result
GET /config/examples
- Returns pre-built configuration examples
- Query params:
scenario- Configuration scenario name
- Response: List of example configurations
POST /simple-form
- Simple file upload parsing
- Body:
multipart/form-datafile- Document file
- Response: Parsed text and metadata
POST /parse-json
- Parse with full configuration
- Body: JSON with
configand base64-encodeddocument - Response: Complete
TikaResponse
POST /parse-file
- Direct file upload with config
- Body:
multipart/form-datafile- Document fileconfig- JSON configuration (optional)
- Response: Complete
TikaResponse
GET /health
- Service health check
- Response: Health status
GET /test
- Test endpoint with demo document
- Response: Parsing result
GET /demo/{filename}
- Parse demo document by filename
- Path param:
filename- Demo document name - Response: Parsed content
The parser is configured using the ParserConfig record with multiple subsections:
{
"parsingOptions": {
"maxContentLength": 1000000,
"extractMetadata": true,
"extractOutline": true,
"extractLinks": true,
"parseEmbeddedDocuments": true,
"maxRecursionDepth": 10
}
}{
"advancedOptions": {
"enableGeoParser": false,
"disableEmfParser": true,
"extractXmpRights": true
}
}{
"contentTypeHandling": {
"titleExtractionStrategy": "AUTO",
"allowedMimeTypes": ["application/pdf", "text/html"],
"blockedMimeTypes": []
}
}{
"errorHandling": {
"throwOnError": false,
"fallbackStrategy": "BEST_EFFORT"
}
}Access via /config/examples?scenario=<name>:
default- Balanced settings for general uselargeDocumentProcessing- Optimized for large filesfastProcessing- Speed-optimizedbatchProcessing- Resilient batch operationsstrictQualityControl- Fail-fast validation
The parser extracts metadata through a sophisticated mapping system:
Tika metadata is organized into interfaces (e.g., PDF, Office, TIFF, Message). The parser maps each interface to corresponding protobuf messages.
See TIKA_INTERFACE_MAPPING.md for:
- Verified Tika metadata interfaces
- Interface-to-protobuf mappings
- Implementation guidelines
Each Tika metadata field is mapped to specific protobuf fields with type conversion.
See SOURCE_DESTINATION_MAPPING.md for:
- Exact field mappings (1,330+ fields)
- Property-by-property documentation
- Usage examples
Real-world Tika capabilities vary by document type.
See tika-actual-metadata-fields.md for:
- What metadata is actually available
- Example outputs by document type
- Aspirational vs. real capabilities
module-parser/
├── src/main/java/ai/pipestream/module/parser/
│ ├── service/ # gRPC and REST service implementations
│ ├── parser/ # Core parsing logic
│ ├── metadata/ # Metadata extraction system
│ │ ├── builders/ # Format-specific metadata builders
│ │ └── utils/ # Metadata utilities
│ ├── config/ # Configuration management
│ └── model/ # Domain models
├── src/test/java/ # Comprehensive test suite
├── docs/ # Documentation
│ ├── TIKA_INTERFACE_MAPPING.md
│ ├── SOURCE_DESTINATION_MAPPING.md
│ └── tika-actual-metadata-fields.md
├── build.gradle # Gradle build configuration
└── README.md # This file
Build the project
./gradlew buildBuild Docker image
./gradlew build -Dquarkus.container-image.build=trueRun tests
./gradlew testThe codebase demonstrates excellent engineering practices:
- Comprehensive documentation (8.6 KB interface mapping, 19.9 KB field mapping)
- Extensive testing with format-specific integration tests
- Clear separation of concerns (service, parser, metadata layers)
- Type safety through Protocol Buffers
- Error handling with graceful degradation
Run with live reload:
./gradlew quarkusDevFeatures in dev mode:
- Live reload on code changes
- Swagger UI at
/swagger-ui/ - Dev UI at
/q/dev/ - Local Consul for service discovery
Build and run with Docker:
./gradlew build -Dquarkus.container-image.build=true
docker run -p 39001:39001 ai.pipestream.module/module-parser:latestThe service is designed for cloud-native deployment:
- Health check endpoints for liveness/readiness probes
- Graceful shutdown support
- Externalized configuration via environment variables
- Service discovery integration
- Production (
prod): Full service discovery, Consul registration - Development (
dev): Local Consul, compose dev services - Test (
test): Isolated testing without registration
Recommended resources:
- Memory: 10GB heap (for large documents and Tika parsers)
- CPU: 2+ cores
- Disk: Minimal (stateless service)
Configure via JVM args:
-Xmx10g -XX:MaxMetaspaceSize=1gThe parser automatically registers with the Pipestream platform:
- Service Type:
PARSER - Capabilities: Advertises supported MIME types
- Schema: Provides JSON schema for configuration
- Health: Reports health status to platform
Registration is handled by the @GrpcServiceRegistration annotation and platform libraries.
- TIKA_INTERFACE_MAPPING.md - Tika interface to protobuf mappings
- SOURCE_DESTINATION_MAPPING.md - Field-level mapping documentation
- tika-actual-metadata-fields.md - Real-world Tika capabilities
- TODO.md - Project roadmap and implementation tracking
- Apache Tika Documentation - Upstream Tika docs
- Pipestream Platform Libraries - Platform infrastructure
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
MIT License - Copyright 2025 io-pipeline
See LICENSE for full details.
For issues, questions, or contributions:
- Issues: GitHub Issues
- Documentation: See
/docsdirectory - Platform: Pipestream AI Platform
Built with: Apache Tika 3.2.1, Quarkus 3.x, Java 21, gRPC, Protocol Buffers