Add OCR encode parser module by zamf · Pull Request #2769 · apache/tika

zamf · 2026-04-15T16:50:52Z

Summary

Adds a new tika-parser-ocr-encode-module that base64-encodes image content instead of performing local OCR text extraction. Intended for pipelines that hand the image off to an external OCR/VLM service downstream.

The module is opt-in: it registers no SPI, so callers enable it explicitly via tika-config.xml/tika-config.json or by passing the parser directly.

Output

Each image is emitted as:

<<<---IMAGE-BASE64-ENCODED-BEGIN--->>>
<base64 payload>
<<<---IMAGE-BASE64-ENCODED-END--->>>

wrapped in <div class="ocr">. The in-body markers are load-bearing: downstream consumers read the BodyContentHandler-stripped plain text and need them to locate and reassemble base64 blocks in document order.

Config (`EncodeOCRConfig`)

minFileSizeToOcr / maxFileSizeToOcr — size gates (default max: 100 MB)
maxImagesToOcr — per-parse cap (default: 50)
skipOcr — runtime disable
inlineContent — inline embedded images

Test plan

mvn test -pl tika-parsers/.../tika-parser-ocr-encode-module — all tests pass
mvn clean install -am -DskipTests — full build succeeds
CI pipeline

Copilot

Pull request overview

This PR introduces a new standard parser module, tika-parser-ocr-encode-module, intended to route OCR image types to a parser that base64-encodes the original image bytes into the extracted XHTML output (instead of running text OCR), enabling downstream OCR processing.

Changes:

Added new EncodeOCRParser + EncodeOCRConfig implementation and SPI registration.
Registered the new module in the standard modules reactor, BOM, and standard package.
Added unit tests and small image fixtures to validate base64 output, config behavior, and supported types.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml	Adds dependency on the new OCR-encode parser module to the standard package.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/pom.xml	Registers the new module in the standard-modules build.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/pom.xml	New module POM defining artifact, dependencies, and build settings.
tika-bom/pom.xml	Adds the new module to the BOM for dependency management.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/java/.../EncodeOCRParser.java	Implements the parser that emits base64-encoded image bytes wrapped in markers.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/java/.../EncodeOCRConfig.java	Adds configuration options (size limits, max images, skip, inline content).
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser	Registers the parser via Java SPI.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/java/.../EncodeOCRParserTest.java	Unit tests for base64 output markers, limits, skip behavior, and supported types.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/java/.../EncodeOCRConfigTest.java	Unit tests for config defaults, validation, and clone/update behavior.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-documents/testOCR_encode.png	PNG fixture for encoding tests.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-documents/testOCR_encode.jpg	JPEG fixture for encoding tests.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-skip.xml	Example config intended to enable skip behavior (but has a param naming issue).
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-partial.xml	Example config exercising partial configuration.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-full.xml	Example config exercising full configuration (but has a param naming issue).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

THausherr · 2026-04-16T08:47:22Z

Copilot claims we should adjust the failing test. It is related to the file testPhoneNumberExtractor.odt . For some reason the extraction has results in a different order than on my machine. (I don't see how this is related to the proposed change)

THausherr · 2026-04-16T08:56:25Z

Here's what copilot says after told it that adjusting the test is the wrong priority:

Adding a globally registered parser changes the set/order of parsers that AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed, and changes in classpath/jar ordering in CI can affect parser selection/behavior. Even if the new parser isn’t intended for ODT, it can still perturb overall parser discovery and embedded parsing flow, which in turn changes the emitted SAX events and thus the extracted phone-number order.

Additionally, EncodeOCRParser currently advertises support for some non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which makes it “more eligible” than intended and increases the chance of unintended participation.
Recommended fix (don’t weaken the test; fix the regression)

1. Restrict EncodeOCRParser supported types to only image/ocr-* (opt-in via override), removing the non-ocr- image types. That keeps it from being considered for generic image parsing and reduces collateral changes in parse output.

If the intent is truly opt-in-only: consider removing the ServiceLoader registration and requiring explicit inclusion via tika-config.xml. That completely avoids global side effects on unrelated parsing/tests.

(I'm just posting this, I have no opinon myself on this)

zamf · 2026-04-16T11:34:12Z

Here's what copilot says after told it that adjusting the test is the wrong priority:

Adding a globally registered parser changes the set/order of parsers that AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed, and changes in classpath/jar ordering in CI can affect parser selection/behavior. Even if the new parser isn’t intended for ODT, it can still perturb overall parser discovery and embedded parsing flow, which in turn changes the emitted SAX events and thus the extracted phone-number order.

Additionally, EncodeOCRParser currently advertises support for some non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which makes it “more eligible” than intended and increases the chance of unintended participation. Recommended fix (don’t weaken the test; fix the regression)
1. Restrict EncodeOCRParser supported types to only image/ocr-* (opt-in via override), removing the non-ocr- image types. That keeps it from being considered for generic image parsing and reduces collateral changes in parse output.
If the intent is truly opt-in-only: consider removing the ServiceLoader registration and requiring explicit inclusion via tika-config.xml. That completely avoids global side effects on unrelated parsing/tests.

(I'm just posting this, I have no opinon myself on this)

Thanks for the review! I made it opt-in, that was actually the intended use.

Copilot

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tballison · 2026-04-16T15:28:52Z

At a high level, we've added vlm inference hooks in 4.x: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java

And we also have vlm parsers with a "give me all the text" prompt that should yield similar results to OCR: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java

The other thing we've added is recursive embedded file extraction so that you can get the List<Metadata>/json back and aim an emitter at a file share or s3, and Tika will write the bytes for embedded files there. You can configure it to output only images (I think?).

I understand that you might want to do post-processing/inference at a different stage, though, and this looks decent to me on a quick glance.

zamf · 2026-04-16T15:45:33Z

At a high level, we've added vlm inference hooks in 4.x: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java

And we also have vlm parsers with a "give me all the text" prompt that should yield similar results to OCR: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java

The other thing we've added is recursive embedded file extraction so that you can get the List/json back and aim an emitter at a file share or s3, and Tika will write the bytes for embedded files there. You can configure it to output only images (I think?).

I understand that you might want to do post-processing/inference at a different stage, though, and this looks decent to me on a quick glance.

Interesting, I was not aware. Sounds like you have done a lot of work to enable image processing outside.
I wanted something that

allows post-processing at a different stage
does not use http, so that it frees up the memory quickly and does not have to wait for a vlm to reply.
is order-preserving, so that it is easy to re-assamble the full document later
does not rely on Tika having access to storage.

What I found is that base64 encoding is still not as fast as I would have hoped, so if I could change the text interface to reply with protobuf, that would be a substantial speed improvement.

Copilot

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tballison

This looks really good for your use case, and we've determined that nothing we currently have meets your needs.

Would you be willing to move this to tika-parsers-extended, perhaps?

Add a new parser module (tika-parser-ocr-encode-module) under tika-parsers-extended that base64-encodes image content instead of performing OCR text extraction. This is useful when image data needs to be preserved in the parsed output for downstream processing by an external OCR service. The module handles the same media types as TesseractOCRParser (ocr-png, ocr-jpeg, ocr-tiff, etc.) and supports configurable file size limits and per-parse image count limits via EncodeOCRConfig. Includes 27 unit tests covering encoding, skip-OCR, file size filtering, image limits, supported types, config clone-and-update, and base64 round-trip validation.

zamf · 2026-04-17T21:33:53Z

This looks really good for your use case, and we've determined that nothing we currently have meets your needs.

Would you be willing to move this to tika-parsers-extended, perhaps?

Sure, no problem, done. Thanks!

zamf · 2026-04-18T11:25:07Z

This looks really good for your use case, and we've determined that nothing we currently have meets your needs.

Would you be willing to move this to tika-parsers-extended, perhaps?

Sure, done.

THausherr requested a review from Copilot April 16, 2026 02:55

Copilot started reviewing on behalf of THausherr April 16, 2026 02:55 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

THausherr requested a review from Copilot April 16, 2026 12:31

Copilot started reviewing on behalf of THausherr April 16, 2026 12:31 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes