Spark: Cache ENTRIES scan in RemoveDanglingDeletesSparkAction to avoid duplicate manifest reads by hemanthboyina · Pull Request #15519 · apache/iceberg

hemanthboyina · 2026-03-05T10:35:48Z

The findDanglingDeletes() method in RemoveDanglingDeletesSparkAction currently loads the ENTRIES
metadata table twice — once to compute minimum sequence numbers from data files, and once to find
delete file entries. Each load triggers a full scan of all manifest files. This change loads the
ENTRIES table once, caches the live entries, and filters into data and delete paths from the cached
result. The cache is properly released in a finally block. This halves the manifest I/O for tables
with many manifests.

…d duplicate manifest reads

dramaticlly · 2026-03-06T22:22:32Z

@hemanthboyina I dont think we load the same manifests twice due to the manifest status is being pushed down to filter which manifest to scan, see more in #10203.

Spark: Cache ENTRIES scan in RemoveDanglingDeletesSparkAction to avoi…

9d83419

…d duplicate manifest reads

github-actions bot added the spark label Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Cache ENTRIES scan in RemoveDanglingDeletesSparkAction to avoid duplicate manifest reads#15519

Spark: Cache ENTRIES scan in RemoveDanglingDeletesSparkAction to avoid duplicate manifest reads#15519
hemanthboyina wants to merge 1 commit intoapache:mainfrom
hemanthboyina:cache_dangling_deletes

hemanthboyina commented Mar 5, 2026

Uh oh!

dramaticlly commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hemanthboyina commented Mar 5, 2026

Uh oh!

dramaticlly commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants