Skip to content

Core: Support Hadoop bulk delete API.#15436

Draft
steveloughran wants to merge 5 commits intoapache:mainfrom
steveloughran:pr/12055-bulk-delete-2026
Draft

Core: Support Hadoop bulk delete API.#15436
steveloughran wants to merge 5 commits intoapache:mainfrom
steveloughran:pr/12055-bulk-delete-2026

Conversation

@steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Feb 24, 2026

Uses of Hadoop 3.4.0+ BulkDelete API so that S3 object deletions can be done in pages of objects, rather than one at a time.

Configuration option "iceberg.hadoop.bulk.delete.enabled" to switch to bulk deletes

All code to use the API is in BulkDeleter.java, which also contains a probe for the availability of the operation. This is to ensure that there's no accidental use of the method

Reflection-based used of Hadoop 3.4.1+ BulkDelete API so that
S3 object deletions can be done in pages of objects, rather
than one at a time.

* Configuration option "iceberg.hadoop.bulk.delete.enabled" to switch
  to bulk deletes.
@steveloughran steveloughran marked this pull request as draft February 24, 2026 19:06
@steveloughran
Copy link
Contributor Author

There's something else to consider here. Do we need full reflection given the method is available at compile time? Instead, only use the operations if enabled, catch link failures and report better.

then there'd be spark tests where 4.0 and 4.1 verify the operation is there, 3.x expect failure when requested.

Uses the API directly in iceberg-core, which is compiled at hadoop 3.4.3
But this is isolated to one class, org.apache.iceberg.hadoop.BulkDeleter, which is
only loaded when bulk delete is enabled with "iceberg.hadoop.bulk.delete.enabled"

There's no attempt at a graceful fallback. If it is enabled and not found,
bulk delete will fail.
This is done with a new class in iceberg-spark 3.5
This is done by mocking the CNFE failure condition in the safety probe,
allowing tests to point to a nonexistent class.
As a result it verifies that
* if the file isn't found bulk delete fails meaningfully,
* the api isn't used.

Ideally tests would be run in the spark 3.4/3.5 modules but their classpath still pulls in
hadoop-3.4.3 and it'd be hard work to remove.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant