feat: Added IcebergArrowInputSourceReader #19510
Conversation
…st constructor arity
|
Benchmark : |
|
Looking into benchmark module which will be used throughout the arrow and datafusion integration to quickly check the improvements. |
…t line breaking, DateTimes.utc, Maps.newHashMapWithExpectedSize)
6297e2f to
9da5049
Compare
| private final boolean useArrowReader; | ||
|
|
||
| @JsonProperty | ||
| private final int arrowBatchSize; |
There was a problem hiding this comment.
With a batch of 1024 values in a contiguous buffer -> emit SIMD instructions that process 8-16 rows per CPU cycle.
There was a problem hiding this comment.
Decompression amortization -> batch-read helps in decompress once and consume many row, since parquet compressed pages .
|
ingestion time at 100k × 5 ReadArrow implementation : 17ms Index + Persist (= total − read)~330ms total timeArrow implementation: 347 ms Index+persist does substantially more work per row than read. So even though Arrow makes read ~3x faster, that gain is dwarfed when amortized over the much-larger indexing cost. Benchmark added https://github.com/Shekharrajak/druid/pull/1/changes |
…Allocation init failure
|
looks like network error https://github.com/apache/druid/actions/runs/26333466916/job/77523245546?pr=19510 - please trigger the CI check again. |
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 2 |
| P2 | 1 |
| P3 | 0 |
| Total | 3 |
Reviewed 7 of 7 changed files.
This is an automated review by Codex GPT-5.5
| File temporaryDirectory | ||
| ) | ||
| { | ||
| if (useArrowReader) { |
There was a problem hiding this comment.
[P1] Arrow path bypasses residual FAIL handling
The useArrowReader branch returns an IcebergArrowInputSourceReader before retrieveIcebergDatafiles() runs, so it skips the residual detection that enforces residualFilterMode=FAIL. IcebergArrowInputSourceReaderTest documents that a non-partition equality filter returns all rows from the file, so a spec with useArrowReader=true and residualFilterMode=FAIL will silently ingest residual rows instead of rejecting the ingestion. Please run the same residual check for the Arrow scan or otherwise honor the fail mode before reading.
| // Push column projection into the scan planner — only requested columns are read from disk. | ||
| final List<String> configuredDims = schema.getDimensionsSpec().getDimensionNames(); | ||
| final String timestampColumn = schema.getTimestampSpec().getTimestampColumn(); | ||
| if (!configuredDims.isEmpty()) { |
There was a problem hiding this comment.
[P1] Projection omits required non-dimension columns
When fixed dimensions are configured, this projection selects only those dimension names plus the timestamp column. Druid carries required transform inputs and aggregator source fields through InputRowSchema.getColumnsFilter(), not DimensionsSpec alone. For example, dimensions=[name] with an aggregator over value will scan only ts/name, so value is missing from the event and downstream metrics/transforms read null. Please drive projection from the full columnsFilter/required fields, or avoid pruning when those required columns cannot be represented safely.
| File temporaryDirectory | ||
| ) | ||
| { | ||
| if (useArrowReader) { |
There was a problem hiding this comment.
[P2] Parallel ingestion bypasses the Arrow reader
This special case only affects direct reader() calls. The input source still reports itself as splittable, and Druid's parallel task runners call createSplits() and then withSplit(); withSplit() returns the delegate input source, so subtasks read raw files through the old delegate path instead of IcebergArrowInputSourceReader. With useArrowReader=true, behavior therefore changes based on maxNumConcurrentSubTasks and can lose the Arrow path's Iceberg semantics. Please make the Arrow mode non-splittable or return split-aware Iceberg input sources that preserve useArrowReader.
Fixes #19498
Description
Release note
Apache Iceberg ingestion now supports an opt-in vectorized reader path backed by iceberg-arrow. Enable by setting "useArrowReader": true on the iceberg input source. The Arrow path automatically applies V2 delete files (positional and equality), handles schema evolution, pushes column projection and predicates into the scan planner, and is 2x–3x faster than the existing path.
Future Iceberg spec features (V3 deletion vectors, row lineage, V4+) become available on Iceberg version bumps with no Druid code changes. Default remains the existing path; both coexist.
This PR has: