Skip to content

Spark: Cache ENTRIES scan in RemoveDanglingDeletesSparkAction to avoid duplicate manifest reads#15519

Open
hemanthboyina wants to merge 1 commit intoapache:mainfrom
hemanthboyina:cache_dangling_deletes
Open

Spark: Cache ENTRIES scan in RemoveDanglingDeletesSparkAction to avoid duplicate manifest reads#15519
hemanthboyina wants to merge 1 commit intoapache:mainfrom
hemanthboyina:cache_dangling_deletes

Conversation

@hemanthboyina
Copy link
Contributor

The findDanglingDeletes() method in RemoveDanglingDeletesSparkAction currently loads the ENTRIES
metadata table twice — once to compute minimum sequence numbers from data files, and once to find
delete file entries. Each load triggers a full scan of all manifest files. This change loads the
ENTRIES table once, caches the live entries, and filters into data and delete paths from the cached
result. The cache is properly released in a finally block. This halves the manifest I/O for tables
with many manifests.

@github-actions github-actions bot added the spark label Mar 5, 2026
@dramaticlly
Copy link
Contributor

@hemanthboyina I dont think we load the same manifests twice due to the manifest status is being pushed down to filter which manifest to scan, see more in #10203.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants