TIKA-4766: Typed parse response grpc#2916
Conversation
Add tika-grpc-api (protobuf schema and descriptors) and tika-grpc-mapper (Tika Metadata to ParseResponse). Wire tika-grpc to return parse_response on fetch-and-parse RPCs. Include mapper tests and README updates.
Default to typed-parse-response-grpc and fork remote; refuse commits on main.
|
@tballison @nddipiazza I'd love a chance to go over this more w/ya - give me a chance to clean it up a bit first. I lifted the model I've been running live on my own OSS project and moved it here - the models barely changed over time and I've done well with keeping up with it. I'll clean this up a bit more but wanted to start the conversation about this. This is only a draft - so open to anything. My goals are simple:
That's about it. I would tie this into https://github.com/apache/opennlp-sandbox/tree/OPENNLP-1833-grpc-expansion This will allow for a seamless pipeline from tika -> opennlp -> whatever in the world would like NLP data and embeddings. |
|
I've only had a chance to look at this briefly. The challenges as you know: a) under the hood, we're currently storing metadata keys as strings, b) metadata keys are an open set with user-customizable keys, c) there are a lot of keys. This was claude's review
|
Addresses the PR apache#2916 review by turning the typed ParseResponse into a clean, schema-first gRPC contract instead of a REST-shaped envelope. Contract (tika-grpc-api): - Drop the `document_metadata` oneof in favour of independent `optional` format submessages. Tika metadata spans namespaces, so forcing exactly one bucket was wrong and is why Dublin Core / Creative Commons had to be special-cased. Field numbers are preserved, so the change is wire-compatible. Format submessages now coexist as peers. - Remove every google.protobuf.Struct catch-all: the per-format `additional_metadata` (incl. climate `additional_scientific_metadata` and CC `additional_rights_metadata`), `BaseFields.raw_metadata`, and `GenericMetadata`'s four Structs. Removed field numbers are `reserved`. - Single lossless channel: `ParseResponse.metadata` — one typed `MetadataEntry` per Tika key, multivalue-preserving via `text_list`, with values coerced via Tika `Property` types (int/double/bool/date). - Envelope `content_type` (canonical MIME) and typed `primary_format` (`DocumentFormatCategory`) so clients branch on an enum, not a string. Mapper (tika-grpc-mapper): - Populate the metadata mirror; stop emitting Struct catch-alls. - Use Tika Property constants instead of string literals ("Keywords"/"Content-Length"/"resourceName"); join multivalue keywords. - Detector reads `Metadata.CONTENT_TYPE` and is called once; maps to the typed `DocumentFormatCategory`. Tests / docs: - Retarget former catch-all assertions to the typed metadata mirror. - Add DocumentFormatCategoryTest and ParseResponseMetadataPopulationTest. - Add ParseResponseCoexistenceTest proving (a) a format submessage and the Creative Commons overlay coexist on one response, and (b) a parent and its embedded child carry distinct typed formats at the correct altitude (parent PDF + child IMAGE via embedded_docs[].parsed_content). - Update READMEs and mapper design docs; remove the stray commit-typed-parse-response.sh helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The Claude review was useful and I went through all 7 points. Pushed in Short version: the "augment, not replace" shape you recommended and where this landed are the same now. One lossless, populated channel as the source of truth ( Below is a summary of the changes that line up with the concerns that yourself and Claude pointed out (summary provided by claude based on my initial write up) -
Write up (provided by me, edited for grammar by claude) 1. Dead fields. 2. Overlapping channels. One catch-all: 3. Detection. One 4. The oneof. Gone, replaced with independent 5. Maintenance. This is the big one. It's often the first "shock" people get when surfacing a gRPC interface - it's it's also why most implementations never take off. I believe in keeping the ugly parsing details behind the contract and not pushing that work to the user. That's the center of the decision - and there' s honestly no good compromise (avoid frameworks like bazel as it turns into just as much mapping scaffolding and you're at the whim of yet another layer - its best to own it and go first class. LLMs are great at writing mapping code and will only get better). Two things keep it bounded:
And the typed layer is opt-in. If you want zero coupling to the taxonomy you read 6. Fidelity. No more Struct, so no JSON-ish doubles and no lost String[]. 7. Scope. The On staging the break: nothing consumes the gRPC Why I care about the typing: it lets me wire tika into the opennlp-sandbox gRPC work (OPENNLP-1833) and on to embeddings without re-parsing metadata Tika already typed. I also created mapping tools that I'm going to share that make the mappings between protobufs via CEL selectors - making mapping truely dynamic while not resorting to java reflection. I know this is a lot - and I'll keep up with responding to any concerns. Where I'd want your read: how curated the typed surface should be (you floated DC plus a dozen common fields), the EPUB-extractor and detector altitude, and hard-remove vs staged. Which of those matter most to you? I think that the popularity of this resides on being able to get other languages like Rust and Python to play well with the interface. So a strongly typed one - where java can feel like a pain - is actually strongly preferred in a python IDE and would make this package very attractive as a first class parser that exceeds speed and capabilities of any parser out there. Tika is great because it's a powerhouse for speed. It's reliable. This would polish up the surface and give that to 12 languages in a solid contract. So to me the contract is most important and why this reply focuses so heavily on it. |
|
have you had a chance to look into why the CI isn't passing? |
|
No problem! |
|
Even with agents to help out, I can't stomach 11k lines of code to nail down maybe 80% of an open set. I'm really worried about maintenance within the project and then clients having to rebuild their protos when we change metadata definitions. We've had churn on value types EVEN for dublin core over the history of the project. Even if we limit custom handling to that, clients will still have to rebuild their protos when we make changes. I'd be ok, maybe, with special handling for dublin core and some of the tika core properties: media type, etc. Fellow devs (@nddipiazza) what do you think about this? From claude: The lossless catch-all is the right idea and the part that belongs in Tika — it's what should replace the removed fields map. I'd simplify its shape, though: from repeated MetadataEntry with a typed oneof to a plain multivalue map<string, StringList>. That keeps the native dict lookup clients had with the old map<string,string>, fixes the real gap (multivalue), and drops the per-value typing — which for dynamic keys forces clients to branch on a 6-way union on every read without giving them a compile-time typed accessor anyway. A new or renamed metadata key still never forces a client rebuild, because a key is data, not schema. On top of that map I'd add only special-cased DC + a few core props as typed strings. @krickert what, specifically, do you need within the Tika project and what can you do outside of Tika to meet your objectives? |
Summary
This change replaces the flat
map<string,string>onFetchAndParseReplywith atyped
org.apache.tika.grpc.v1.ParseResponse. Parse metadata is mapped from TikaMetadatainto protobuf messages aligned with Tika property interfaces (PDF, Office,HTML, and the other supported formats), with Dublin Core normalized at the response
root and Creative Commons licensing carried on a dedicated field when present.
The work is split into three Maven modules so clients can depend on the schema and
generated stubs without pulling in the server, and so mapping logic stays testable
outside the gRPC layer:
FileDescriptorSetParseResponseMapperand format builders; optionalParseResponseDecoratorfor future extensions (for example document outlines)FetchAndParseReply.parse_responsereplacesthe removed
fieldsmapThis is a breaking change for gRPC clients that read string keys from
fields.Migration is documented in
tika-grpc-api/README.mdand summarized intika-grpc/README.md.Architecture
Module dependencies
flowchart TB subgraph clients [Clients] C[gRPC / Java clients] end subgraph server [tika-grpc] S[TikaGrpcServerImpl] P[Tika Pipes workers] end subgraph mapper [tika-grpc-mapper] M[ParseResponseMapper] B[Format metadata builders] D[ParseResponseDecorator optional] end subgraph api [tika-grpc-api] PR[parse_response.proto and format protos] FD[FileDescriptorSet in META-INF] J[Generated Java stubs] end subgraph tika [Tika core] AD[AutoDetectParser / Pipes] MD[Metadata] end C -->|FetchAndParse| S S --> P P --> AD AD --> MD S --> M M --> B M --> D M --> PR B --> PR PR --> J PR --> FD S -->|FetchAndParseReply.parse_response| CParse pipeline
ParseResponse layout
classDiagram class ParseResponse { string parse_id ParseStatus status ParseContent content DublinCoreMetadata dublin_core oneof document_metadata CreativeCommonsMetadata creative_commons repeated EmbeddedDocument embedded_docs } class ParseContent { string body string title } class PdfMetadata class OfficeMetadata class HtmlMetadata class ImageMetadata class GenericMetadata ParseResponse --> ParseContent ParseResponse --> DublinCoreMetadata ParseResponse --> PdfMetadata : pdf ParseResponse --> OfficeMetadata : office ParseResponse --> HtmlMetadata : html ParseResponse --> ImageMetadata : image ParseResponse --> GenericMetadata : genericBreaking API change
FetchAndParseReply.fields(field 2, reserved)FetchAndParseReply.parse_response(field 5)Example client reads:
parse_response.content.bodyparse_response.content.titleorparse_response.dublin_core.titleparse_response.pdf.doc_info_producerparse_response.statusModules added or updated
tika-grpc-api/org/apache/tika/grpc/v1/; buf lint config; descriptor bundletika-grpc-mapper/tika-grpc/TikaGrpcServerImplusesParseResponseMappertika-bom/pom.xmlpom.xmlTest plan
./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc testFetchAndParseReplyno longer exposesfields; clients readparse_responsepdf/htmloneof andcontent.bodytika-grpc-apijar containsMETA-INF/org.apache.tika.grpc.v1.descriptorsFollow-up (not in this PR)
ParseResponseDecorator) for PDF/HTML heading trees when proto fields are addedClimateForecastMetadataBuilderbeyond additional_struct mappingThanks for your contribution to Apache Tika! Your help is appreciated!
Before opening the pull request, please verify that
TIKA-XXXX)[TIKA-XXXX] Issue or pull request title)./mvnw clean testmainbranch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulledmainbranchtika-bom/pom.xml.We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!