Skip to content

Add OCR encode parser module#2769

Merged
tballison merged 1 commit intoapache:mainfrom
zamf:add-ocr-encode-module
Apr 20, 2026
Merged

Add OCR encode parser module#2769
tballison merged 1 commit intoapache:mainfrom
zamf:add-ocr-encode-module

Conversation

@zamf
Copy link
Copy Markdown
Contributor

@zamf zamf commented Apr 15, 2026

Summary

Adds a new tika-parser-ocr-encode-module that base64-encodes image content instead of performing local OCR text extraction. Intended for pipelines that hand the image off to an external OCR/VLM service downstream.

The module is opt-in: it registers no SPI, so callers enable it explicitly via tika-config.xml/tika-config.json or by passing the parser directly.

Output

Each image is emitted as:

<<<---IMAGE-BASE64-ENCODED-BEGIN--->>>
<base64 payload>
<<<---IMAGE-BASE64-ENCODED-END--->>>

wrapped in <div class="ocr">. The in-body markers are load-bearing: downstream consumers read the BodyContentHandler-stripped plain text and need them to locate and reassemble base64 blocks in document order.

Config (EncodeOCRConfig)

  • minFileSizeToOcr / maxFileSizeToOcr — size gates (default max: 100 MB)
  • maxImagesToOcr — per-parse cap (default: 50)
  • skipOcr — runtime disable
  • inlineContent — inline embedded images

Test plan

  • mvn test -pl tika-parsers/.../tika-parser-ocr-encode-module — all tests pass
  • mvn clean install -am -DskipTests — full build succeeds
  • CI pipeline

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new standard parser module, tika-parser-ocr-encode-module, intended to route OCR image types to a parser that base64-encodes the original image bytes into the extracted XHTML output (instead of running text OCR), enabling downstream OCR processing.

Changes:

  • Added new EncodeOCRParser + EncodeOCRConfig implementation and SPI registration.
  • Registered the new module in the standard modules reactor, BOM, and standard package.
  • Added unit tests and small image fixtures to validate base64 output, config behavior, and supported types.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml Adds dependency on the new OCR-encode parser module to the standard package.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/pom.xml Registers the new module in the standard-modules build.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/pom.xml New module POM defining artifact, dependencies, and build settings.
tika-bom/pom.xml Adds the new module to the BOM for dependency management.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/java/.../EncodeOCRParser.java Implements the parser that emits base64-encoded image bytes wrapped in markers.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/java/.../EncodeOCRConfig.java Adds configuration options (size limits, max images, skip, inline content).
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser Registers the parser via Java SPI.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/java/.../EncodeOCRParserTest.java Unit tests for base64 output markers, limits, skip behavior, and supported types.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/java/.../EncodeOCRConfigTest.java Unit tests for config defaults, validation, and clone/update behavior.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-documents/testOCR_encode.png PNG fixture for encoding tests.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-documents/testOCR_encode.jpg JPEG fixture for encoding tests.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-skip.xml Example config intended to enable skip behavior (but has a param naming issue).
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-partial.xml Example config exercising partial configuration.
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-encode-module/src/test/resources/test-configs/tika-config-encodeocr-full.xml Example config exercising full configuration (but has a param naming issue).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@THausherr
Copy link
Copy Markdown
Contributor

Copilot claims we should adjust the failing test. It is related to the file testPhoneNumberExtractor.odt . For some reason the extraction has results in a different order than on my machine. (I don't see how this is related to the proposed change)

@THausherr
Copy link
Copy Markdown
Contributor

Here's what copilot says after told it that adjusting the test is the wrong priority:

Adding a globally registered parser changes the set/order of parsers that AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed, and changes in classpath/jar ordering in CI can affect parser selection/behavior. Even if the new parser isn’t intended for ODT, it can still perturb overall parser discovery and embedded parsing flow, which in turn changes the emitted SAX events and thus the extracted phone-number order.

Additionally, EncodeOCRParser currently advertises support for some non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which makes it “more eligible” than intended and increases the chance of unintended participation.
Recommended fix (don’t weaken the test; fix the regression)

1. Restrict EncodeOCRParser supported types to only image/ocr-* (opt-in via override), removing the non-ocr- image types. That keeps it from being considered for generic image parsing and reduces collateral changes in parse output.
  1. If the intent is truly opt-in-only: consider removing the ServiceLoader registration and requiring explicit inclusion via tika-config.xml. That completely avoids global side effects on unrelated parsing/tests.

(I'm just posting this, I have no opinon myself on this)

@zamf
Copy link
Copy Markdown
Contributor Author

zamf commented Apr 16, 2026

Here's what copilot says after told it that adjusting the test is the wrong priority:

Adding a globally registered parser changes the set/order of parsers that AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed, and changes in classpath/jar ordering in CI can affect parser selection/behavior. Even if the new parser isn’t intended for ODT, it can still perturb overall parser discovery and embedded parsing flow, which in turn changes the emitted SAX events and thus the extracted phone-number order.

Additionally, EncodeOCRParser currently advertises support for some non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which makes it “more eligible” than intended and increases the chance of unintended participation. Recommended fix (don’t weaken the test; fix the regression)

1. Restrict EncodeOCRParser supported types to only image/ocr-* (opt-in via override), removing the non-ocr- image types. That keeps it from being considered for generic image parsing and reduces collateral changes in parse output.
  1. If the intent is truly opt-in-only: consider removing the ServiceLoader registration and requiring explicit inclusion via tika-config.xml. That completely avoids global side effects on unrelated parsing/tests.

(I'm just posting this, I have no opinon myself on this)

Thanks for the review! I made it opt-in, that was actually the intended use.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tballison
Copy link
Copy Markdown
Contributor

tballison commented Apr 16, 2026

At a high level, we've added vlm inference hooks in 4.x: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java

And we also have vlm parsers with a "give me all the text" prompt that should yield similar results to OCR: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java

The other thing we've added is recursive embedded file extraction so that you can get the List<Metadata>/json back and aim an emitter at a file share or s3, and Tika will write the bytes for embedded files there. You can configure it to output only images (I think?).

I understand that you might want to do post-processing/inference at a different stage, though, and this looks decent to me on a quick glance.

@zamf
Copy link
Copy Markdown
Contributor Author

zamf commented Apr 16, 2026

At a high level, we've added vlm inference hooks in 4.x: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java

And we also have vlm parsers with a "give me all the text" prompt that should yield similar results to OCR: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java

The other thing we've added is recursive embedded file extraction so that you can get the List/json back and aim an emitter at a file share or s3, and Tika will write the bytes for embedded files there. You can configure it to output only images (I think?).

I understand that you might want to do post-processing/inference at a different stage, though, and this looks decent to me on a quick glance.

Interesting, I was not aware. Sounds like you have done a lot of work to enable image processing outside.
I wanted something that

  1. allows post-processing at a different stage
  2. does not use http, so that it frees up the memory quickly and does not have to wait for a vlm to reply.
  3. is order-preserving, so that it is easy to re-assamble the full document later
  4. does not rely on Tika having access to storage.

What I found is that base64 encoding is still not as fast as I would have hoped, so if I could change the text interface to reply with protobuf, that would be a substantial speed improvement.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zamf zamf force-pushed the add-ocr-encode-module branch from f6c28d6 to 57c3232 Compare April 17, 2026 12:08
Copy link
Copy Markdown
Contributor

@tballison tballison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good for your use case, and we've determined that nothing we currently have meets your needs.

Would you be willing to move this to tika-parsers-extended, perhaps?

Add a new parser module (tika-parser-ocr-encode-module) under
tika-parsers-extended that base64-encodes image content instead of
performing OCR text extraction. This is useful when image data needs
to be preserved in the parsed output for downstream processing by an
external OCR service.

The module handles the same media types as TesseractOCRParser
(ocr-png, ocr-jpeg, ocr-tiff, etc.) and supports configurable
file size limits and per-parse image count limits via
EncodeOCRConfig.

Includes 27 unit tests covering encoding, skip-OCR, file size
filtering, image limits, supported types, config clone-and-update,
and base64 round-trip validation.
@zamf zamf force-pushed the add-ocr-encode-module branch from 57c3232 to 6a877fd Compare April 17, 2026 21:22
@zamf
Copy link
Copy Markdown
Contributor Author

zamf commented Apr 17, 2026

This looks really good for your use case, and we've determined that nothing we currently have meets your needs.

Would you be willing to move this to tika-parsers-extended, perhaps?

Sure, no problem, done. Thanks!

@zamf
Copy link
Copy Markdown
Contributor Author

zamf commented Apr 18, 2026

This looks really good for your use case, and we've determined that nothing we currently have meets your needs.

Would you be willing to move this to tika-parsers-extended, perhaps?

Sure, done.

@tballison tballison merged commit 7a039f3 into apache:main Apr 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants