feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266
feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266Shekharrajak wants to merge 43 commits into
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
Findings that could not be attached inline:
- extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.
8e510d3 to
309c97b
Compare
updated : 309c97b#diff-1b9776e43fa17c32a610eee043a6fd6cbf32b29ae4026d24ea798ac9a6f638bcR168 |
|
Noted gaps in iceberg v2 spec support #19471 |
… delete application
…d no-delete scenarios
… verifySqlQuery signature
6c98a82 to
bab8237
Compare
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
Reviewed 15 of 15 changed files. The earlier FileIO metadata follow-up is addressed, so no inline reply is needed; the new finding is below.
This is an automated review by Codex GPT-5.5
|
|
||
| // Step 3: Stream data file with delete application | ||
| final InputFile dataInputFile = fileIO.newInputFile(dataFilePath); | ||
| final CloseableIterable<Record> records = Parquet.read(dataInputFile) |
There was a problem hiding this comment.
[P2] V2 delete path only reads Parquet files
The new v2 path bypasses the configured inputFormat and always opens the data file with Parquet.read(...), and the delete readers do the same for delete files. The extension still documents Iceberg data files as Parquet, ORC, or Avro via the delegated warehouseSource/inputFormat path, so any ORC/Avro Iceberg table with delete files will now route through this native reader and fail to parse the files instead of using the configured format-specific reader. Please either dispatch by Iceberg FileFormat / reader support or explicitly reject/document v2 delete support as Parquet-only.
There was a problem hiding this comment.
Let me document for parquet only and follow up PR will add it : #19472
There was a problem hiding this comment.
Thanks, this now guards the data file format, but I think one edge of the original concern is still open: Iceberg delete files carry their own deleteFile.format(), separate from task.file().format(). Right now DeleteFileInfo only carries path/content/equality fields, and requireParquet(deleteFileInfo.getPath()) checks the data file format, so a PARQUET data file with an ORC/AVRO delete file would still reach Parquet.read(deleteInputFile) instead of being explicitly rejected. Could we preserve the delete-file format and guard each delete file too?
Also, the user docs still say Iceberg data files can be Parquet, ORC, or Avro and the new v2 delete section does not call out the Parquet-only limitation, so it would be worth documenting that limitation there as well.
…ceberg V2 delete tests
…id row data requirement
|
Flaky test reported #19491 Please help in triggering the one failed CI check run. We can work on this flaky test separately. |
FrankChen021
left a comment
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 15 of 15 changed files.
This is an automated review by Codex GPT-5.5
Ref #19190
Description
In IcebergCatalog.extractSnapshotDataFiles(), line:
discards task.deletes() entirely. Every FileScanTask from tableScan.planFiles() carries a List that must
be applied for correct v2 reads. The current code passes only raw file paths to warehouseSource, and Druid's
ParquetReader has zero awareness of Iceberg delete files.
Changes
DeleteFileInfo.java Serializable POJO: path, contentType (POSITION/EQUALITY), equalityFieldIds
IcebergFileTaskInputSource.java Per-task InputSource carrying data file + delete metadata + schema JSON +
warehouseSource
IcebergNativeRecordReader.java Manual positional + equality delete application with streaming reads via
Parquet.read()
IcebergRecordConverter.java Iceberg Record to Map with full type coverage
Release note
Iceberg V2 Delete File Support: When FileScanTask.deletes() returns non-empty, the extension creates per-task IcebergFileTaskInputSource objects
carrying serializable metadata (data file path, delete file paths/types/equality field IDs, schema JSON). Workers
apply deletes at read time via IcebergNativeRecordReader which reads position-delete and equality-delete Parquet
files, builds filter sets, and streams the data file while skipping deleted rows. V1 tables (no delete files) continue
to use the existing warehouseSource path unchanged.
Key changed/added classes in this PR
DeleteFileInfo.javaIcebergFileTaskInputSource.javaIcebergNativeRecordReader.javaIcebergRecordConverter.javaIcebergInputSource.javaIcebergCatalog.javaThis PR has: