TIKA-4766: Typed parse response grpc by krickert · Pull Request #2916 · apache/tika

krickert · 2026-06-30T01:24:07Z

Summary

This change replaces the flat map<string,string> on FetchAndParseReply with a
typed org.apache.tika.grpc.v1.ParseResponse. Parse metadata is mapped from Tika
Metadata into protobuf messages aligned with Tika property interfaces (PDF, Office,
HTML, and the other supported formats), with Dublin Core normalized at the response
root and Creative Commons licensing carried on a dedicated field when present.

The work is split into three Maven modules so clients can depend on the schema and
generated stubs without pulling in the server, and so mapping logic stays testable
outside the gRPC layer:

tika-grpc-api — protobuf sources, Java generation, bundled FileDescriptorSet
tika-grpc-mapper — ParseResponseMapper and format builders; optional
ParseResponseDecorator for future extensions (for example document outlines)
tika-grpc — existing service; FetchAndParseReply.parse_response replaces
the removed fields map

This is a breaking change for gRPC clients that read string keys from fields.
Migration is documented in tika-grpc-api/README.md and summarized in
tika-grpc/README.md.

Architecture

Module dependencies

flowchart TB
  subgraph clients [Clients]
    C[gRPC / Java clients]
  end

  subgraph server [tika-grpc]
    S[TikaGrpcServerImpl]
    P[Tika Pipes workers]
  end

  subgraph mapper [tika-grpc-mapper]
    M[ParseResponseMapper]
    B[Format metadata builders]
    D[ParseResponseDecorator optional]
  end

  subgraph api [tika-grpc-api]
    PR[parse_response.proto and format protos]
    FD[FileDescriptorSet in META-INF]
    J[Generated Java stubs]
  end

  subgraph tika [Tika core]
    AD[AutoDetectParser / Pipes]
    MD[Metadata]
  end

  C -->|FetchAndParse| S
  S --> P
  P --> AD
  AD --> MD
  S --> M
  M --> B
  M --> D
  M --> PR
  B --> PR
  PR --> J
  PR --> FD
  S -->|FetchAndParseReply.parse_response| C

Parse pipeline

sequenceDiagram
  participant Client
  participant Grpc as TikaGrpcServerImpl
  participant Pipes as Tika Pipes
  participant Parser as Tika parser
  participant Mapper as ParseResponseMapper

  Client->>Grpc: FetchAndParseRequest
  Grpc->>Pipes: fetch and parse
  Pipes->>Parser: parse bytes
  Parser-->>Pipes: Metadata, body text
  Pipes-->>Grpc: parse result
  Grpc->>Mapper: map metadata, body, status
  Mapper->>Mapper: detect format oneof
  Mapper->>Mapper: build dublin_core
  Mapper->>Mapper: optional CC overlay
  Mapper-->>Grpc: ParseResponse
  Grpc-->>Client: FetchAndParseReply.parse_response

ParseResponse layout

classDiagram
  class ParseResponse {
    string parse_id
    ParseStatus status
    ParseContent content
    DublinCoreMetadata dublin_core
    oneof document_metadata
    CreativeCommonsMetadata creative_commons
    repeated EmbeddedDocument embedded_docs
  }

  class ParseContent {
    string body
    string title
  }

  class PdfMetadata
  class OfficeMetadata
  class HtmlMetadata
  class ImageMetadata
  class GenericMetadata

  ParseResponse --> ParseContent
  ParseResponse --> DublinCoreMetadata
  ParseResponse --> PdfMetadata : pdf
  ParseResponse --> OfficeMetadata : office
  ParseResponse --> HtmlMetadata : html
  ParseResponse --> ImageMetadata : image
  ParseResponse --> GenericMetadata : generic

Breaking API change

Removed	Replacement
`FetchAndParseReply.fields` (field 2, reserved)	`FetchAndParseReply.parse_response` (field 5)

Example client reads:

Body text: parse_response.content.body
Title: parse_response.content.title or parse_response.dublin_core.title
PDF producer: parse_response.pdf.doc_info_producer
Parse status: parse_response.status

Modules added or updated

Path	Notes
`tika-grpc-api/`	~17 proto files under `org/apache/tika/grpc/v1/`; buf lint config; descriptor bundle
`tika-grpc-mapper/`	Builders ported from prior Pipestream mapper work; 35 unit tests against Tika test fixtures
`tika-grpc/`	Depends on api + mapper; `TikaGrpcServerImpl` uses `ParseResponseMapper`
`tika-bom/pom.xml`	Lists new artifacts
Root `pom.xml`	Reactor modules

Test plan

./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
Confirm FetchAndParseReply no longer exposes fields; clients read parse_response
Parse a PDF and an HTML sample via gRPC; verify typed pdf / html oneof and content.body
Confirm tika-grpc-api jar contains META-INF/org.apache.tika.grpc.v1.descriptors
Review breaking change note with downstream consumers before release

Follow-up (not in this PR)

Outline decorators (ParseResponseDecorator) for PDF/HTML heading trees when proto fields are added
Expand ClimateForecastMetadataBuilder beyond additional_struct mapping
Downstream consumer updates in separate repositories

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
the issue ID (TIKA-XXXX)
- is referenced in the title of the pull request
- and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
commits are squashed into a single one (or few commits for larger changes)
Tika is successfully built and unit tests pass by running ./mvnw clean test
there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

Add tika-grpc-api (protobuf schema and descriptors) and tika-grpc-mapper (Tika Metadata to ParseResponse). Wire tika-grpc to return parse_response on fetch-and-parse RPCs. Include mapper tests and README updates.

Default to typed-parse-response-grpc and fork remote; refuse commits on main.

krickert · 2026-06-30T01:46:44Z

@tballison @nddipiazza I'd love a chance to go over this more w/ya - give me a chance to clean it up a bit first.

I lifted the model I've been running live on my own OSS project and moved it here - the models barely changed over time and I've done well with keeping up with it. I'll clean this up a bit more but wanted to start the conversation about this.

This is only a draft - so open to anything. My goals are simple:

Strongly type the response so I'm not pulling out my hair from parsing a second time when it's already strongly typed in Tika.

That's about it. I would tie this into https://github.com/apache/opennlp-sandbox/tree/OPENNLP-1833-grpc-expansion

This will allow for a seamless pipeline from tika -> opennlp -> whatever in the world would like NLP data and embeddings.

tballison · 2026-06-30T10:50:17Z

I've only had a chance to look at this briefly.

The challenges as you know: a) under the hood, we're currently storing metadata keys as strings, b) metadata keys are an open set with user-customizable keys, c) there are a lot of keys.

This was claude's review

What's legitimate

  - Replacing a flat map<string,string> with a typed contract is a reasonable thing to want — string-keyed maps are a weak contract for gRPC clients.
  - The tika-grpc-api / tika-grpc-mapper / tika-grpc module split (clients depend on schema, not server) is good hygiene.
  - Unmapped keys aren't simply dropped: each format message has an additional_metadata Struct (e.g. pdf_metadata.proto:253) and the builders fill it via
  MetadataUtils.buildAdditionalMetadata(..., mappedFields). So it's not as lossy as the "replace the map" framing first suggests.

  What doesn't hold up

  1. Dead fields in the public wire contract. ParseResponse.raw_properties (field 14) and repeated MetadataEntry metadata (field 12) are declared in the proto with
  comments promising "anything not captured in typed metadata" — but the mapper populates neither (0 references in ParseResponseMapper.java). Shipping public proto
  fields that are always empty is a contract smell; a client will reasonably read raw_properties and get nothing.

  2. Four overlapping places a value can live. typed format field → per-format additional_metadata Struct → dublin_core → creative_commons, plus the two dead
  channels. There's no single predictable place to find a given property. That's harder to consume correctly than the map it replaces.

  3. It re-detects the format Tika already told you. DocumentTypeDetector.detect() (316 lines of heuristics over metadata) chooses the oneof branch — but Tika
  already set Content-Type. Re-deriving it heuristically is fragile and duplicative, and it's called twice (mapper lines 198 and 327).

  4. The oneof document_metadata fights Tika's data model. Tika metadata routinely spans namespaces (a PDF carrying XMP/DC, EXIF on embedded images, …). Forcing
  exactly one format bucket is precisely why DC and CC had to be pulled out of the oneof as special cases — the model is already breaking under its own weight, and
  every new cross-cutting namespace adds another special case.

  5. A hand-maintained parallel copy of Tika's entire metadata taxonomy. ~5,000 lines of proto + ~4,000 lines of builders that must track org.apache.tika.metadata.*
  by hand forever. When core adds/renames a Property, the typed field silently desyncs and the value quietly drops into additional_metadata. That's a large
  permanent maintenance tax for marginal gain over a good catch-all.

  6. Fidelity loss. Catch-alls use google.protobuf.Struct, which is JSON-ish (numbers are doubles, no native String[]); Tika metadata is uniformly multivalued
  strings. And small tells confirm looseness: hardcoded keys "Keywords"/"Content-Length" instead of Property constants, and content.keywords is a scalar though Tika
  keywords are multivalue.

  7. Scope creep past "typed parse response." EpubStructureExtractor (483 lines) and the detector live in the gRPC mapper — that's parser/detection logic migrating
  into the serialization layer, the wrong altitude. WARC/font/climate-forecast/database schemas widen the surface a lot for a first cut. There's also a
  commit-typed-parse-response.sh script checked in, which doesn't belong in the PR.

My recommendation

  The shape I'd push toward is augment, not replace:
  - Keep one lossless, uniform, actually-populated channel as source of truth (a repeated key/value that preserves multivalue), then layer a small curated set of
  typed convenience fields (DC + a dozen genuinely common PDF/office fields) on top.
  - Trust Content-Type; drop the format oneof in favor of coexisting sub-messages (or just DC + raw).
  - Don't move EPUB structure extraction / type detection into the mapper.
  - Wire or delete the dead fields, and stage the break (add parse_response alongside fields, deprecate, then remove) rather than a 10.7k-line breaking mega-PR.

gives me great concern from a maintenance perspective. I haven't looked carefully enough to know if this is real.

Addresses the PR apache#2916 review by turning the typed ParseResponse into a clean, schema-first gRPC contract instead of a REST-shaped envelope. Contract (tika-grpc-api): - Drop the `document_metadata` oneof in favour of independent `optional` format submessages. Tika metadata spans namespaces, so forcing exactly one bucket was wrong and is why Dublin Core / Creative Commons had to be special-cased. Field numbers are preserved, so the change is wire-compatible. Format submessages now coexist as peers. - Remove every google.protobuf.Struct catch-all: the per-format `additional_metadata` (incl. climate `additional_scientific_metadata` and CC `additional_rights_metadata`), `BaseFields.raw_metadata`, and `GenericMetadata`'s four Structs. Removed field numbers are `reserved`. - Single lossless channel: `ParseResponse.metadata` — one typed `MetadataEntry` per Tika key, multivalue-preserving via `text_list`, with values coerced via Tika `Property` types (int/double/bool/date). - Envelope `content_type` (canonical MIME) and typed `primary_format` (`DocumentFormatCategory`) so clients branch on an enum, not a string. Mapper (tika-grpc-mapper): - Populate the metadata mirror; stop emitting Struct catch-alls. - Use Tika Property constants instead of string literals ("Keywords"/"Content-Length"/"resourceName"); join multivalue keywords. - Detector reads `Metadata.CONTENT_TYPE` and is called once; maps to the typed `DocumentFormatCategory`. Tests / docs: - Retarget former catch-all assertions to the typed metadata mirror. - Add DocumentFormatCategoryTest and ParseResponseMetadataPopulationTest. - Add ParseResponseCoexistenceTest proving (a) a format submessage and the Creative Commons overlay coexist on one response, and (b) a parent and its embedded child carry distinct typed formats at the correct altitude (parent PDF + child IMAGE via embedded_docs[].parsed_content). - Update READMEs and mapper design docs; remove the stray commit-typed-parse-response.sh helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

krickert · 2026-06-30T13:56:41Z

The Claude review was useful and I went through all 7 points. Pushed in 99883db0c.

Short version: the "augment, not replace" shape you recommended and where this landed are the same now. One lossless, populated channel as the source of truth (ParseResponse.metadata), a typed convenience layer on top, Content-Type trusted, and the oneof dropped for coexisting submessages.

Below is a summary of the changes that line up with the concerns that yourself and Claude pointed out (summary provided by claude based on my initial write up) -

#	Concern	Status	What changed
1	Dead wire fields	Fixed	`metadata` (field 12) populated; `raw_properties` (14) reserved; empty Structs removed
2	Four overlapping channels	Down to one	only `ParseResponse.metadata` is a catch-all now; the 13 per-format `additional_metadata`, `BaseFields.raw_metadata`, and 4 `GenericMetadata` Structs are gone (numbers reserved)
3	Re-detects format, twice	Fixed	single `detect()`, reads `Content-Type` first, exposed as a typed `primary_format` enum
4	oneof fights Tika's model	Fixed + test	oneof to coexisting `optional` peers; `ParseResponseCoexistenceTest`
5	Maintaining the mirror	Bounded	desync no longer drops data, see below
6	Fidelity loss (Struct)	Fixed	Struct gone; `MetadataEntry` keeps multivalue and native types; Property constants
7	Scope creep	Partial	helper script removed; EPUB extractor I can split if you want

Write up (provided by me, edited for grammar by claude)

1. Dead fields. metadata is populated for every key now, raw_properties is reserved. ParseResponseMetadataPopulationTest asserts metadata.names().length == response.getMetadataCount(), so there's no always-empty public field anymore and it's guard-railed.

2. Overlapping channels. One catch-all: ParseResponse.metadata. Every per-format additional_metadata, BaseFields.raw_metadata (that one was a full dump embedded in every message), and the four GenericMetadata Structs are removed, field numbers reserved. Precedence is written up in tika-grpc-api/README.md.

3. Detection. One detect() call that leads with Content-Type (extension and CC are fallbacks only when MIME is missing), surfaced as DocumentFormatCategory primary_format so clients read an enum instead of re-deriving.

4. The oneof. Gone, replaced with independent optional submessages (same field numbers, so it stays wire-compatible). They coexist, so DC and CC aren't special cases anymore. ParseResponseCoexistenceTest covers both a format submessage plus the CC overlay on one response, and a parent PDF with an embedded child typed IMAGE on its own parsed_content. That last bit is where "EXIF on embedded images" actually belongs, instead of forcing a second bucket onto the parent.

5. Maintenance. This is the big one. It's often the first "shock" people get when surfacing a gRPC interface - it's it's also why most implementations never take off. I believe in keeping the ugly parsing details behind the contract and not pushing that work to the user. That's the center of the decision - and there' s honestly no good compromise (avoid frameworks like bazel as it turns into just as much mapping scaffolding and you're at the whim of yet another layer - its best to own it and go first class. LLMs are great at writing mapping code and will only get better).

Two things keep it bounded:

No silent data loss when a field desyncs. The mirror is lossless and the test above proves every Tika key lands in metadata, typed. So if core renames or adds a Property and a typed field stops matching, the value doesn't disappear into a lossy Struct, it's still in metadata with the right type. Worst case is a convenience field is briefly missing, not a dropped value. That kills the specific failure mode the review called out.
It's cheap. I lifted this model from an OSS pipeline I run live and it's been about an hour a year to keep current (thanks to LLMs this is now easy). The interfaces don't move much at all. If they do, we still capture unknowns. (Also, I'll be glad to maintain this - as I need to maintain it anyway for other projects)

And the typed layer is opt-in. If you want zero coupling to the taxonomy you read metadata and ignore the typed fields. The mirror is the contract, the typed fields are convenience. My take is still that if the contract stays dynamic, JSON is the better fit, the whole reason to be on gRPC is the typed schema. Simply put - grpc is pretty bad at being dynamic. So don't fight it, embrace the type safety and downstream client code looks great and all those parsing casting errors you've seen for years just go away.

6. Fidelity. No more Struct, so no JSON-ish doubles and no lost String[]. MetadataEntry keeps multivalue via text_list and native types (int64/double/bool/Timestamp) coerced from Tika Property definitions. Swapped the hardcoded "Keywords"/"Content-Length" for Property constants, and multivalue keywords are joined.

7. Scope. The commit-typed-parse-response.sh script is gone. The detector trusts Content-Type and sits in the mapper as routing, not parsing, but I can move it. EpubStructureExtractor is a fair hit. I can split it, and trim WARC/font/climate to a thinner first cut, into a follow-up if you'd rather keep this PR lean.

On staging the break: nothing consumes the gRPC fields map yet and this is 4.0, so I did a hard removal instead of deprecate-then-remove. Easy to switch to staged if you'd prefer not to break it in one shot. But again, the fields map is not really the design you want - the JSON interface is far better for this.

Why I care about the typing: it lets me wire tika into the opennlp-sandbox gRPC work (OPENNLP-1833) and on to embeddings without re-parsing metadata Tika already typed. I also created mapping tools that I'm going to share that make the mappings between protobufs via CEL selectors - making mapping truely dynamic while not resorting to java reflection.

I know this is a lot - and I'll keep up with responding to any concerns.

Where I'd want your read: how curated the typed surface should be (you floated DC plus a dozen common fields), the EPUB-extractor and detector altitude, and hard-remove vs staged. Which of those matter most to you?

I think that the popularity of this resides on being able to get other languages like Rust and Python to play well with the interface. So a strongly typed one - where java can feel like a pain - is actually strongly preferred in a python IDE and would make this package very attractive as a first class parser that exceeds speed and capabilities of any parser out there.

Tika is great because it's a powerhouse for speed. It's reliable. This would polish up the surface and give that to 12 languages in a solid contract. So to me the contract is most important and why this reply focuses so heavily on it.

nddipiazza · 2026-06-30T17:27:55Z

have you had a chance to look into why the CI isn't passing?

krickert · 2026-06-30T17:43:04Z

No problem!

tballison · 2026-06-30T18:28:57Z

Even with agents to help out, I can't stomach 11k lines of code to nail down maybe 80% of an open set.

I'm really worried about maintenance within the project and then clients having to rebuild their protos when we change metadata definitions.

We've had churn on value types EVEN for dublin core over the history of the project. Even if we limit custom handling to that, clients will still have to rebuild their protos when we make changes.

I'd be ok, maybe, with special handling for dublin core and some of the tika core properties: media type, etc.

Fellow devs (@nddipiazza) what do you think about this?

From claude: The lossless catch-all is the right idea and the part that belongs in Tika — it's what should replace the removed fields map. I'd simplify its shape, though: from repeated MetadataEntry with a typed oneof to a plain multivalue map<string, StringList>. That keeps the native dict lookup clients had with the old map<string,string>, fixes the real gap (multivalue), and drops the per-value typing — which for dynamic keys forces clients to branch on a 6-way union on every read without giving them a compile-time typed accessor anyway. A new or renamed metadata key still never forces a client rebuild, because a key is data, not schema. On top of that map I'd add only special-cased DC + a few core props as typed strings.

@krickert what, specifically, do you need within the Tika project and what can you do outside of Tika to meet your objectives?

krickert added 3 commits June 29, 2026 21:19

Replace FetchAndParseReply string map with typed ParseResponse

fce4bc8

Add tika-grpc-api (protobuf schema and descriptors) and tika-grpc-mapper (Tika Metadata to ParseResponse). Wire tika-grpc to return parse_response on fetch-and-parse RPCs. Include mapper tests and README updates.

Use feature branch in typed ParseResponse commit script

d7efe9c

Default to typed-parse-response-grpc and fork remote; refuse commits on main.

removed cruft

95f9a1d

krickert changed the title ~~Typed parse response grpc~~ TIKA-4766: Typed parse response grpc Jun 30, 2026

krickert mentioned this pull request Jun 30, 2026

TIKA-4727: Add experimental strongly-typed protobuf response to tika-grpc #2811

Draft

4 tasks

krickert closed this Jun 30, 2026

krickert reopened this Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TIKA-4766: Typed parse response grpc#2916

TIKA-4766: Typed parse response grpc#2916
krickert wants to merge 4 commits into
apache:mainfrom
ai-pipestream:typed-parse-response-grpc

krickert commented Jun 30, 2026 •

edited

Loading

Uh oh!

krickert commented Jun 30, 2026

Uh oh!

tballison commented Jun 30, 2026

Uh oh!

krickert commented Jun 30, 2026

Uh oh!

nddipiazza commented Jun 30, 2026

Uh oh!

krickert commented Jun 30, 2026

Uh oh!

tballison commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

krickert commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Module dependencies

Parse pipeline

ParseResponse layout

Breaking API change

Modules added or updated

Test plan

Follow-up (not in this PR)

Uh oh!

krickert commented Jun 30, 2026

Uh oh!

tballison commented Jun 30, 2026

Uh oh!

krickert commented Jun 30, 2026

Uh oh!

nddipiazza commented Jun 30, 2026

Uh oh!

krickert commented Jun 30, 2026

Uh oh!

tballison commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krickert commented Jun 30, 2026 •

edited

Loading