Skip to content

TIKA-4766: Typed parse response grpc#2916

Draft
krickert wants to merge 4 commits into
apache:mainfrom
ai-pipestream:typed-parse-response-grpc
Draft

TIKA-4766: Typed parse response grpc#2916
krickert wants to merge 4 commits into
apache:mainfrom
ai-pipestream:typed-parse-response-grpc

Conversation

@krickert

@krickert krickert commented Jun 30, 2026

Copy link
Copy Markdown

Summary

This change replaces the flat map<string,string> on FetchAndParseReply with a
typed org.apache.tika.grpc.v1.ParseResponse. Parse metadata is mapped from Tika
Metadata into protobuf messages aligned with Tika property interfaces (PDF, Office,
HTML, and the other supported formats), with Dublin Core normalized at the response
root and Creative Commons licensing carried on a dedicated field when present.

The work is split into three Maven modules so clients can depend on the schema and
generated stubs without pulling in the server, and so mapping logic stays testable
outside the gRPC layer:

  • tika-grpc-api — protobuf sources, Java generation, bundled FileDescriptorSet
  • tika-grpc-mapperParseResponseMapper and format builders; optional
    ParseResponseDecorator for future extensions (for example document outlines)
  • tika-grpc — existing service; FetchAndParseReply.parse_response replaces
    the removed fields map

This is a breaking change for gRPC clients that read string keys from fields.
Migration is documented in tika-grpc-api/README.md and summarized in
tika-grpc/README.md.

Architecture

Module dependencies

flowchart TB
  subgraph clients [Clients]
    C[gRPC / Java clients]
  end

  subgraph server [tika-grpc]
    S[TikaGrpcServerImpl]
    P[Tika Pipes workers]
  end

  subgraph mapper [tika-grpc-mapper]
    M[ParseResponseMapper]
    B[Format metadata builders]
    D[ParseResponseDecorator optional]
  end

  subgraph api [tika-grpc-api]
    PR[parse_response.proto and format protos]
    FD[FileDescriptorSet in META-INF]
    J[Generated Java stubs]
  end

  subgraph tika [Tika core]
    AD[AutoDetectParser / Pipes]
    MD[Metadata]
  end

  C -->|FetchAndParse| S
  S --> P
  P --> AD
  AD --> MD
  S --> M
  M --> B
  M --> D
  M --> PR
  B --> PR
  PR --> J
  PR --> FD
  S -->|FetchAndParseReply.parse_response| C
Loading

Parse pipeline

sequenceDiagram
  participant Client
  participant Grpc as TikaGrpcServerImpl
  participant Pipes as Tika Pipes
  participant Parser as Tika parser
  participant Mapper as ParseResponseMapper

  Client->>Grpc: FetchAndParseRequest
  Grpc->>Pipes: fetch and parse
  Pipes->>Parser: parse bytes
  Parser-->>Pipes: Metadata, body text
  Pipes-->>Grpc: parse result
  Grpc->>Mapper: map metadata, body, status
  Mapper->>Mapper: detect format oneof
  Mapper->>Mapper: build dublin_core
  Mapper->>Mapper: optional CC overlay
  Mapper-->>Grpc: ParseResponse
  Grpc-->>Client: FetchAndParseReply.parse_response
Loading

ParseResponse layout

classDiagram
  class ParseResponse {
    string parse_id
    ParseStatus status
    ParseContent content
    DublinCoreMetadata dublin_core
    oneof document_metadata
    CreativeCommonsMetadata creative_commons
    repeated EmbeddedDocument embedded_docs
  }

  class ParseContent {
    string body
    string title
  }

  class PdfMetadata
  class OfficeMetadata
  class HtmlMetadata
  class ImageMetadata
  class GenericMetadata

  ParseResponse --> ParseContent
  ParseResponse --> DublinCoreMetadata
  ParseResponse --> PdfMetadata : pdf
  ParseResponse --> OfficeMetadata : office
  ParseResponse --> HtmlMetadata : html
  ParseResponse --> ImageMetadata : image
  ParseResponse --> GenericMetadata : generic
Loading

Breaking API change

Removed Replacement
FetchAndParseReply.fields (field 2, reserved) FetchAndParseReply.parse_response (field 5)

Example client reads:

  • Body text: parse_response.content.body
  • Title: parse_response.content.title or parse_response.dublin_core.title
  • PDF producer: parse_response.pdf.doc_info_producer
  • Parse status: parse_response.status

Modules added or updated

Path Notes
tika-grpc-api/ ~17 proto files under org/apache/tika/grpc/v1/; buf lint config; descriptor bundle
tika-grpc-mapper/ Builders ported from prior Pipestream mapper work; 35 unit tests against Tika test fixtures
tika-grpc/ Depends on api + mapper; TikaGrpcServerImpl uses ParseResponseMapper
tika-bom/pom.xml Lists new artifacts
Root pom.xml Reactor modules

Test plan

  • ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
  • Confirm FetchAndParseReply no longer exposes fields; clients read parse_response
  • Parse a PDF and an HTML sample via gRPC; verify typed pdf / html oneof and content.body
  • Confirm tika-grpc-api jar contains META-INF/org.apache.tika.grpc.v1.descriptors
  • Review breaking change note with downstream consumers before release

Follow-up (not in this PR)

  • Outline decorators (ParseResponseDecorator) for PDF/HTML heading trees when proto fields are added
  • Expand ClimateForecastMetadataBuilder beyond additional_struct mapping
  • Downstream consumer updates in separate repositories

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID (TIKA-XXXX)
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Tika is successfully built and unit tests pass by running ./mvnw clean test
  • there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
  • if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

krickert added 3 commits June 29, 2026 21:19
Add tika-grpc-api (protobuf schema and descriptors) and tika-grpc-mapper
(Tika Metadata to ParseResponse). Wire tika-grpc to return parse_response on
fetch-and-parse RPCs. Include mapper tests and README updates.
Default to typed-parse-response-grpc and fork remote; refuse commits on main.
@krickert krickert changed the title Typed parse response grpc TIKA-4766: Typed parse response grpc Jun 30, 2026
@krickert

Copy link
Copy Markdown
Author

@tballison @nddipiazza I'd love a chance to go over this more w/ya - give me a chance to clean it up a bit first.

I lifted the model I've been running live on my own OSS project and moved it here - the models barely changed over time and I've done well with keeping up with it. I'll clean this up a bit more but wanted to start the conversation about this.

This is only a draft - so open to anything. My goals are simple:

  1. Strongly type the response so I'm not pulling out my hair from parsing a second time when it's already strongly typed in Tika.

That's about it. I would tie this into https://github.com/apache/opennlp-sandbox/tree/OPENNLP-1833-grpc-expansion

This will allow for a seamless pipeline from tika -> opennlp -> whatever in the world would like NLP data and embeddings.

@tballison

Copy link
Copy Markdown
Contributor

I've only had a chance to look at this briefly.

The challenges as you know: a) under the hood, we're currently storing metadata keys as strings, b) metadata keys are an open set with user-customizable keys, c) there are a lot of keys.

This was claude's review

What's legitimate

  - Replacing a flat map<string,string> with a typed contract is a reasonable thing to want — string-keyed maps are a weak contract for gRPC clients.
  - The tika-grpc-api / tika-grpc-mapper / tika-grpc module split (clients depend on schema, not server) is good hygiene.
  - Unmapped keys aren't simply dropped: each format message has an additional_metadata Struct (e.g. pdf_metadata.proto:253) and the builders fill it via
  MetadataUtils.buildAdditionalMetadata(..., mappedFields). So it's not as lossy as the "replace the map" framing first suggests.

  What doesn't hold up

  1. Dead fields in the public wire contract. ParseResponse.raw_properties (field 14) and repeated MetadataEntry metadata (field 12) are declared in the proto with
  comments promising "anything not captured in typed metadata" — but the mapper populates neither (0 references in ParseResponseMapper.java). Shipping public proto
  fields that are always empty is a contract smell; a client will reasonably read raw_properties and get nothing.

  2. Four overlapping places a value can live. typed format field → per-format additional_metadata Struct → dublin_core → creative_commons, plus the two dead
  channels. There's no single predictable place to find a given property. That's harder to consume correctly than the map it replaces.

  3. It re-detects the format Tika already told you. DocumentTypeDetector.detect() (316 lines of heuristics over metadata) chooses the oneof branch — but Tika
  already set Content-Type. Re-deriving it heuristically is fragile and duplicative, and it's called twice (mapper lines 198 and 327).

  4. The oneof document_metadata fights Tika's data model. Tika metadata routinely spans namespaces (a PDF carrying XMP/DC, EXIF on embedded images, …). Forcing
  exactly one format bucket is precisely why DC and CC had to be pulled out of the oneof as special cases — the model is already breaking under its own weight, and
  every new cross-cutting namespace adds another special case.

  5. A hand-maintained parallel copy of Tika's entire metadata taxonomy. ~5,000 lines of proto + ~4,000 lines of builders that must track org.apache.tika.metadata.*
  by hand forever. When core adds/renames a Property, the typed field silently desyncs and the value quietly drops into additional_metadata. That's a large
  permanent maintenance tax for marginal gain over a good catch-all.

  6. Fidelity loss. Catch-alls use google.protobuf.Struct, which is JSON-ish (numbers are doubles, no native String[]); Tika metadata is uniformly multivalued
  strings. And small tells confirm looseness: hardcoded keys "Keywords"/"Content-Length" instead of Property constants, and content.keywords is a scalar though Tika
  keywords are multivalue.

  7. Scope creep past "typed parse response." EpubStructureExtractor (483 lines) and the detector live in the gRPC mapper — that's parser/detection logic migrating
  into the serialization layer, the wrong altitude. WARC/font/climate-forecast/database schemas widen the surface a lot for a first cut. There's also a
  commit-typed-parse-response.sh script checked in, which doesn't belong in the PR.

My recommendation

  The shape I'd push toward is augment, not replace:
  - Keep one lossless, uniform, actually-populated channel as source of truth (a repeated key/value that preserves multivalue), then layer a small curated set of
  typed convenience fields (DC + a dozen genuinely common PDF/office fields) on top.
  - Trust Content-Type; drop the format oneof in favor of coexisting sub-messages (or just DC + raw).
  - Don't move EPUB structure extraction / type detection into the mapper.
  - Wire or delete the dead fields, and stage the break (add parse_response alongside fields, deprecate, then remove) rather than a 10.7k-line breaking mega-PR.

  1. gives me great concern from a maintenance perspective. I haven't looked carefully enough to know if this is real.

Addresses the PR apache#2916 review by turning the typed ParseResponse into a
clean, schema-first gRPC contract instead of a REST-shaped envelope.

Contract (tika-grpc-api):
- Drop the `document_metadata` oneof in favour of independent `optional`
  format submessages. Tika metadata spans namespaces, so forcing exactly
  one bucket was wrong and is why Dublin Core / Creative Commons had to be
  special-cased. Field numbers are preserved, so the change is
  wire-compatible. Format submessages now coexist as peers.
- Remove every google.protobuf.Struct catch-all: the per-format
  `additional_metadata` (incl. climate `additional_scientific_metadata`
  and CC `additional_rights_metadata`), `BaseFields.raw_metadata`, and
  `GenericMetadata`'s four Structs. Removed field numbers are `reserved`.
- Single lossless channel: `ParseResponse.metadata` — one typed
  `MetadataEntry` per Tika key, multivalue-preserving via `text_list`,
  with values coerced via Tika `Property` types (int/double/bool/date).
- Envelope `content_type` (canonical MIME) and typed `primary_format`
  (`DocumentFormatCategory`) so clients branch on an enum, not a string.

Mapper (tika-grpc-mapper):
- Populate the metadata mirror; stop emitting Struct catch-alls.
- Use Tika Property constants instead of string literals
  ("Keywords"/"Content-Length"/"resourceName"); join multivalue keywords.
- Detector reads `Metadata.CONTENT_TYPE` and is called once; maps to the
  typed `DocumentFormatCategory`.

Tests / docs:
- Retarget former catch-all assertions to the typed metadata mirror.
- Add DocumentFormatCategoryTest and ParseResponseMetadataPopulationTest.
- Add ParseResponseCoexistenceTest proving (a) a format submessage and the
  Creative Commons overlay coexist on one response, and (b) a parent and
  its embedded child carry distinct typed formats at the correct altitude
  (parent PDF + child IMAGE via embedded_docs[].parsed_content).
- Update READMEs and mapper design docs; remove the stray
  commit-typed-parse-response.sh helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@krickert

Copy link
Copy Markdown
Author

The Claude review was useful and I went through all 7 points. Pushed in 99883db0c.

Short version: the "augment, not replace" shape you recommended and where this landed are the same now. One lossless, populated channel as the source of truth (ParseResponse.metadata), a typed convenience layer on top, Content-Type trusted, and the oneof dropped for coexisting submessages.

Below is a summary of the changes that line up with the concerns that yourself and Claude pointed out (summary provided by claude based on my initial write up) -

# Concern Status What changed
1 Dead wire fields Fixed metadata (field 12) populated; raw_properties (14) reserved; empty Structs removed
2 Four overlapping channels Down to one only ParseResponse.metadata is a catch-all now; the 13 per-format additional_metadata, BaseFields.raw_metadata, and 4 GenericMetadata Structs are gone (numbers reserved)
3 Re-detects format, twice Fixed single detect(), reads Content-Type first, exposed as a typed primary_format enum
4 oneof fights Tika's model Fixed + test oneof to coexisting optional peers; ParseResponseCoexistenceTest
5 Maintaining the mirror Bounded desync no longer drops data, see below
6 Fidelity loss (Struct) Fixed Struct gone; MetadataEntry keeps multivalue and native types; Property constants
7 Scope creep Partial helper script removed; EPUB extractor I can split if you want

Write up (provided by me, edited for grammar by claude)

1. Dead fields. metadata is populated for every key now, raw_properties is reserved. ParseResponseMetadataPopulationTest asserts metadata.names().length == response.getMetadataCount(), so there's no always-empty public field anymore and it's guard-railed.

2. Overlapping channels. One catch-all: ParseResponse.metadata. Every per-format additional_metadata, BaseFields.raw_metadata (that one was a full dump embedded in every message), and the four GenericMetadata Structs are removed, field numbers reserved. Precedence is written up in tika-grpc-api/README.md.

3. Detection. One detect() call that leads with Content-Type (extension and CC are fallbacks only when MIME is missing), surfaced as DocumentFormatCategory primary_format so clients read an enum instead of re-deriving.

4. The oneof. Gone, replaced with independent optional submessages (same field numbers, so it stays wire-compatible). They coexist, so DC and CC aren't special cases anymore. ParseResponseCoexistenceTest covers both a format submessage plus the CC overlay on one response, and a parent PDF with an embedded child typed IMAGE on its own parsed_content. That last bit is where "EXIF on embedded images" actually belongs, instead of forcing a second bucket onto the parent.

5. Maintenance. This is the big one. It's often the first "shock" people get when surfacing a gRPC interface - it's it's also why most implementations never take off. I believe in keeping the ugly parsing details behind the contract and not pushing that work to the user. That's the center of the decision - and there' s honestly no good compromise (avoid frameworks like bazel as it turns into just as much mapping scaffolding and you're at the whim of yet another layer - its best to own it and go first class. LLMs are great at writing mapping code and will only get better).

Two things keep it bounded:

  • No silent data loss when a field desyncs. The mirror is lossless and the test above proves every Tika key lands in metadata, typed. So if core renames or adds a Property and a typed field stops matching, the value doesn't disappear into a lossy Struct, it's still in metadata with the right type. Worst case is a convenience field is briefly missing, not a dropped value. That kills the specific failure mode the review called out.
  • It's cheap. I lifted this model from an OSS pipeline I run live and it's been about an hour a year to keep current (thanks to LLMs this is now easy). The interfaces don't move much at all. If they do, we still capture unknowns. (Also, I'll be glad to maintain this - as I need to maintain it anyway for other projects)

And the typed layer is opt-in. If you want zero coupling to the taxonomy you read metadata and ignore the typed fields. The mirror is the contract, the typed fields are convenience. My take is still that if the contract stays dynamic, JSON is the better fit, the whole reason to be on gRPC is the typed schema. Simply put - grpc is pretty bad at being dynamic. So don't fight it, embrace the type safety and downstream client code looks great and all those parsing casting errors you've seen for years just go away.

6. Fidelity. No more Struct, so no JSON-ish doubles and no lost String[]. MetadataEntry keeps multivalue via text_list and native types (int64/double/bool/Timestamp) coerced from Tika Property definitions. Swapped the hardcoded "Keywords"/"Content-Length" for Property constants, and multivalue keywords are joined.

7. Scope. The commit-typed-parse-response.sh script is gone. The detector trusts Content-Type and sits in the mapper as routing, not parsing, but I can move it. EpubStructureExtractor is a fair hit. I can split it, and trim WARC/font/climate to a thinner first cut, into a follow-up if you'd rather keep this PR lean.

On staging the break: nothing consumes the gRPC fields map yet and this is 4.0, so I did a hard removal instead of deprecate-then-remove. Easy to switch to staged if you'd prefer not to break it in one shot. But again, the fields map is not really the design you want - the JSON interface is far better for this.

Why I care about the typing: it lets me wire tika into the opennlp-sandbox gRPC work (OPENNLP-1833) and on to embeddings without re-parsing metadata Tika already typed. I also created mapping tools that I'm going to share that make the mappings between protobufs via CEL selectors - making mapping truely dynamic while not resorting to java reflection.

I know this is a lot - and I'll keep up with responding to any concerns.

Where I'd want your read: how curated the typed surface should be (you floated DC plus a dozen common fields), the EPUB-extractor and detector altitude, and hard-remove vs staged. Which of those matter most to you?

I think that the popularity of this resides on being able to get other languages like Rust and Python to play well with the interface. So a strongly typed one - where java can feel like a pain - is actually strongly preferred in a python IDE and would make this package very attractive as a first class parser that exceeds speed and capabilities of any parser out there.

Tika is great because it's a powerhouse for speed. It's reliable. This would polish up the surface and give that to 12 languages in a solid contract. So to me the contract is most important and why this reply focuses so heavily on it.

@nddipiazza

Copy link
Copy Markdown
Contributor

have you had a chance to look into why the CI isn't passing?

@krickert

Copy link
Copy Markdown
Author

No problem!

@krickert krickert closed this Jun 30, 2026
@krickert krickert reopened this Jun 30, 2026
@tballison

Copy link
Copy Markdown
Contributor

Even with agents to help out, I can't stomach 11k lines of code to nail down maybe 80% of an open set.

I'm really worried about maintenance within the project and then clients having to rebuild their protos when we change metadata definitions.

We've had churn on value types EVEN for dublin core over the history of the project. Even if we limit custom handling to that, clients will still have to rebuild their protos when we make changes.

I'd be ok, maybe, with special handling for dublin core and some of the tika core properties: media type, etc.

Fellow devs (@nddipiazza) what do you think about this?

From claude: The lossless catch-all is the right idea and the part that belongs in Tika — it's what should replace the removed fields map. I'd simplify its shape, though: from repeated MetadataEntry with a typed oneof to a plain multivalue map<string, StringList>. That keeps the native dict lookup clients had with the old map<string,string>, fixes the real gap (multivalue), and drops the per-value typing — which for dynamic keys forces clients to branch on a 6-way union on every read without giving them a compile-time typed accessor anyway. A new or renamed metadata key still never forces a client rebuild, because a key is data, not schema. On top of that map I'd add only special-cased DC + a few core props as typed strings.

@krickert what, specifically, do you need within the Tika project and what can you do outside of Tika to meet your objectives?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants