Add OCR encode parser module#2769
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new standard parser module, tika-parser-ocr-encode-module, intended to route OCR image types to a parser that base64-encodes the original image bytes into the extracted XHTML output (instead of running text OCR), enabling downstream OCR processing.
Changes:
- Added new
EncodeOCRParser+EncodeOCRConfigimplementation and SPI registration. - Registered the new module in the standard modules reactor, BOM, and standard package.
- Added unit tests and small image fixtures to validate base64 output, config behavior, and supported types.
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml | Adds dependency on the new OCR-encode parser module to the standard package. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/pom.xml | Registers the new module in the standard-modules build. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/pom.xml | New module POM defining artifact, dependencies, and build settings. |
| tika-bom/pom.xml | Adds the new module to the BOM for dependency management. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/java/.../EncodeOCRParser.java | Implements the parser that emits base64-encoded image bytes wrapped in markers. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/java/.../EncodeOCRConfig.java | Adds configuration options (size limits, max images, skip, inline content). |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser | Registers the parser via Java SPI. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/java/.../EncodeOCRParserTest.java | Unit tests for base64 output markers, limits, skip behavior, and supported types. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/java/.../EncodeOCRConfigTest.java | Unit tests for config defaults, validation, and clone/update behavior. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-documents/testOCR_encode.png | PNG fixture for encoding tests. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-documents/testOCR_encode.jpg | JPEG fixture for encoding tests. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-skip.xml | Example config intended to enable skip behavior (but has a param naming issue). |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-partial.xml | Example config exercising partial configuration. |
| tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-full.xml | Example config exercising full configuration (but has a param naming issue). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Copilot claims we should adjust the failing test. It is related to the file testPhoneNumberExtractor.odt . For some reason the extraction has results in a different order than on my machine. (I don't see how this is related to the proposed change) |
|
Here's what copilot says after told it that adjusting the test is the wrong priority: Adding a globally registered parser changes the set/order of parsers that AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed, and changes in classpath/jar ordering in CI can affect parser selection/behavior. Even if the new parser isn’t intended for ODT, it can still perturb overall parser discovery and embedded parsing flow, which in turn changes the emitted SAX events and thus the extracted phone-number order. Additionally, EncodeOCRParser currently advertises support for some non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which makes it “more eligible” than intended and increases the chance of unintended participation.
(I'm just posting this, I have no opinon myself on this) |
Thanks for the review! I made it opt-in, that was actually the intended use. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
At a high level, we've added vlm inference hooks in 4.x: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java And we also have vlm parsers with a "give me all the text" prompt that should yield similar results to OCR: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java The other thing we've added is recursive embedded file extraction so that you can get the List<Metadata>/json back and aim an emitter at a file share or s3, and Tika will write the bytes for embedded files there. You can configure it to output only images (I think?). I understand that you might want to do post-processing/inference at a different stage, though, and this looks decent to me on a quick glance. |
Interesting, I was not aware. Sounds like you have done a lot of work to enable image processing outside.
What I found is that base64 encoding is still not as fast as I would have hoped, so if I could change the text interface to reply with protobuf, that would be a substantial speed improvement. |
a627cb2 to
f6c28d6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f6c28d6 to
57c3232
Compare
tballison
left a comment
There was a problem hiding this comment.
This looks really good for your use case, and we've determined that nothing we currently have meets your needs.
Would you be willing to move this to tika-parsers-extended, perhaps?
Add a new parser module (tika-parser-ocr-encode-module) under tika-parsers-extended that base64-encodes image content instead of performing OCR text extraction. This is useful when image data needs to be preserved in the parsed output for downstream processing by an external OCR service. The module handles the same media types as TesseractOCRParser (ocr-png, ocr-jpeg, ocr-tiff, etc.) and supports configurable file size limits and per-parse image count limits via EncodeOCRConfig. Includes 27 unit tests covering encoding, skip-OCR, file size filtering, image limits, supported types, config clone-and-update, and base64 round-trip validation.
57c3232 to
6a877fd
Compare
Sure, no problem, done. Thanks! |
Sure, done. |
Summary
Adds a new
tika-parser-ocr-encode-modulethat base64-encodes image content instead of performing local OCR text extraction. Intended for pipelines that hand the image off to an external OCR/VLM service downstream.The module is opt-in: it registers no SPI, so callers enable it explicitly via
tika-config.xml/tika-config.jsonor by passing the parser directly.Output
Each image is emitted as:
wrapped in
<div class="ocr">. The in-body markers are load-bearing: downstream consumers read theBodyContentHandler-stripped plain text and need them to locate and reassemble base64 blocks in document order.Config (
EncodeOCRConfig)minFileSizeToOcr/maxFileSizeToOcr— size gates (default max: 100 MB)maxImagesToOcr— per-parse cap (default: 50)skipOcr— runtime disableinlineContent— inline embedded imagesTest plan
mvn test -pl tika-parsers/.../tika-parser-ocr-encode-module— all tests passmvn clean install -am -DskipTests— full build succeeds