Skip to content

feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266

Open
Shekharrajak wants to merge 43 commits into
apache:masterfrom
Shekharrajak:feature/iceberg-v2-delete-support
Open

feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266
Shekharrajak wants to merge 43 commits into
apache:masterfrom
Shekharrajak:feature/iceberg-v2-delete-support

Conversation

@Shekharrajak
Copy link
Copy Markdown
Contributor

@Shekharrajak Shekharrajak commented Apr 6, 2026

Ref #19190

Description

In IcebergCatalog.extractSnapshotDataFiles(), line:

dataFilePaths.add(task.file().location());

discards task.deletes() entirely. Every FileScanTask from tableScan.planFiles() carries a List that must
be applied for correct v2 reads. The current code passes only raw file paths to warehouseSource, and Druid's
ParquetReader has zero awareness of Iceberg delete files.

Changes

DeleteFileInfo.java Serializable POJO: path, contentType (POSITION/EQUALITY), equalityFieldIds
IcebergFileTaskInputSource.java Per-task InputSource carrying data file + delete metadata + schema JSON +
warehouseSource
IcebergNativeRecordReader.java Manual positional + equality delete application with streaming reads via
Parquet.read()
IcebergRecordConverter.java Iceberg Record to Map with full type coverage

Release note

Iceberg V2 Delete File Support: When FileScanTask.deletes() returns non-empty, the extension creates per-task IcebergFileTaskInputSource objects
carrying serializable metadata (data file path, delete file paths/types/equality field IDs, schema JSON). Workers
apply deletes at read time via IcebergNativeRecordReader which reads position-delete and equality-delete Parquet
files, builds filter sets, and streams the data file while skipping deleted rows. V1 tables (no delete files) continue
to use the existing warehouseSource path unchanged.


Key changed/added classes in this PR
  • DeleteFileInfo.java
  • IcebergFileTaskInputSource.java
  • IcebergNativeRecordReader.java
  • IcebergRecordConverter.java
  • IcebergInputSource.java
  • IcebergCatalog.java
  • IcebergDruidModule.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Shekharrajak Shekharrajak changed the title Iceberg V2 Delete File Support Iceberg V2 delete file support in druid-iceberg-extensions Apr 6, 2026
@Shekharrajak Shekharrajak changed the title Iceberg V2 delete file support in druid-iceberg-extensions feat: Iceberg V2 delete file support in druid-iceberg-extensions Apr 6, 2026
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings that could not be attached inline:

  • extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

  • extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.

updated : 309c97b#diff-1b9776e43fa17c32a610eee043a6fd6cbf32b29ae4026d24ea798ac9a6f638bcR168

@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Noted gaps in iceberg v2 spec support #19471

@Shekharrajak Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from 6c98a82 to bab8237 Compare May 18, 2026 11:53
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1
Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1

Reviewed 15 of 15 changed files. The earlier FileIO metadata follow-up is addressed, so no inline reply is needed; the new finding is below.


This is an automated review by Codex GPT-5.5


// Step 3: Stream data file with delete application
final InputFile dataInputFile = fileIO.newInputFile(dataFilePath);
final CloseableIterable<Record> records = Parquet.read(dataInputFile)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] V2 delete path only reads Parquet files

The new v2 path bypasses the configured inputFormat and always opens the data file with Parquet.read(...), and the delete readers do the same for delete files. The extension still documents Iceberg data files as Parquet, ORC, or Avro via the delegated warehouseSource/inputFormat path, so any ORC/Avro Iceberg table with delete files will now route through this native reader and fail to parse the files instead of using the configured format-specific reader. Please either dispatch by Iceberg FileFormat / reader support or explicitly reject/document v2 delete support as Parquet-only.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me document for parquet only and follow up PR will add it : #19472

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this now guards the data file format, but I think one edge of the original concern is still open: Iceberg delete files carry their own deleteFile.format(), separate from task.file().format(). Right now DeleteFileInfo only carries path/content/equality fields, and requireParquet(deleteFileInfo.getPath()) checks the data file format, so a PARQUET data file with an ORC/AVRO delete file would still reach Parquet.read(deleteInputFile) instead of being explicitly rejected. Could we preserve the delete-file format and guard each delete file too?

Also, the user docs still say Iceberg data files can be Parquet, ORC, or Avro and the new v2 delete section does not call out the Parquet-only limitation, so it would be worth documenting that limitation there as well.

@jtuglu1 jtuglu1 self-requested a review May 19, 2026 08:33
@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Flaky test reported #19491

Please help in triggering the one failed CI check run. We can work on this flaky test separately.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 15 of 15 changed files.


This is an automated review by Codex GPT-5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants