Skip to content

Iceberg source: adopt Iceberg's native Arrow reader stack for forward-compatibility with Iceberg spec evolution and performance improvements #19498

@Shekharrajak

Description

@Shekharrajak

Description

Replace (initial both readers will be available) Druid's hand-written Iceberg reader path with an opt-in path that delegates reading, delete application, and type handling to Iceberg's official iceberg-arrow library. This stops shipping a custom Druid-side fork of Iceberg's reader semantics and lets Druid automatically inherit every Iceberg spec evolution (V2 deletes → V3 deletion vectors / row lineage → V4 and beyond), reader optimisation (pushdown, statistics, vectorisation), and format coverage (Parquet/ORC/Avro and future formats) the moment we bump the Iceberg dependency.

Current

  • IcebergNativeRecordReader is a Druid-maintained reader
  • Every Iceberg spec improvement (new delete encodings, partition statistics, manifest changes, deletion vectors in V3, row lineage, etc) requires bespoke Druid implementation work .

After changes

  • New IcebergArrowReader activated by useArrowReader: true in the input spec; defaults to false initially.
  • Druid converts the resulting Arrow VectorSchemaRoot batches into MapBasedInputRow via one small adapter; InputRow remains the firewall and nothing else in Druid sees Arrow.
  • Iceberg dependency bumps automatically deliver new spec features and optimisations to Druid users with no Druid code change.

Motivation

This will be first step (foundation step) towards arrow integration #19456 and seeing druid + arrow working.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions