Description
This is a tracking issue for a set of optimizations to the Delta Lake Deletion Vector (DV) processing path in the Velox backend.
When a Delta table receives MERGE, UPDATE, or DELETE operations, Delta writes Deletion Vectors -- small sidecar files containing bitmaps of logically deleted row positions -- instead of rewriting the affected data files. The physical data stays in place until compaction (OPTIMIZE) runs. This makes mutations fast, but shifts cost to reads: every subsequent query that touches files with pending deletions must load the DV bitmaps from storage and filter out deleted rows. Between compaction cycles, this cost is paid on every read query.
On remote storage (ABFS, S3, HDFS), this cost is amplified because every storage operation involves a network round-trip. The optimizations below target three layers of the DV processing path: query planning (JVM), row filtering (C++ native engine), and file I/O.
Tracked PRs
| PR |
Area |
Description |
CI |
Status |
| #12390 |
JVM (planning) |
Eliminate redundant network calls during DV materialization: cache path resolution per partition, read raw DV bytes directly (skip Java deserialize/re-serialize), early-exit guard for non-Delta queries, fused rule execution |
Green |
Open, in review |
| #12395 |
C++ (native engine) |
Iterator-based DV bitmap filtering: replace per-row contains() with move_equalorlarger() iterator so cost scales with actual deletions, not total rows |
Green |
Merged |
| #12389 |
C++ (plan converter) |
Remove double dynamic_pointer_cast and unnecessary std::string copy of DV data in parseDeltaSplitInfo |
Green |
Merged |
| #12400 |
JVM/C++ (file I/O) |
Enable file handle cache by default with TTL-based eviction, wire previously dead-code TTL config to Velox cache |
Green |
Open, in review |
Measured improvements
DV bitmap filtering (C++, PR #12395):
| Deletion density |
Speedup |
| 1% (sparse, typical after MERGE/UPDATE) |
198x |
| 10% (moderate) |
10x |
| 50% (dense) |
2x |
DV materialization (JVM, PR #12390):
- Projected up to 20x faster on remote object storage by eliminating redundant HTTP round-trips per file
- Non-Delta queries: 22x faster rule evaluation via early-exit guard
File handle caching (PR #12400):
- Estimated 40-70% improvement for repeated scans of many small files on remote storage
Was this issue authored or co-authored using generative AI tooling?
No
Description
This is a tracking issue for a set of optimizations to the Delta Lake Deletion Vector (DV) processing path in the Velox backend.
When a Delta table receives MERGE, UPDATE, or DELETE operations, Delta writes Deletion Vectors -- small sidecar files containing bitmaps of logically deleted row positions -- instead of rewriting the affected data files. The physical data stays in place until compaction (OPTIMIZE) runs. This makes mutations fast, but shifts cost to reads: every subsequent query that touches files with pending deletions must load the DV bitmaps from storage and filter out deleted rows. Between compaction cycles, this cost is paid on every read query.
On remote storage (ABFS, S3, HDFS), this cost is amplified because every storage operation involves a network round-trip. The optimizations below target three layers of the DV processing path: query planning (JVM), row filtering (C++ native engine), and file I/O.
Tracked PRs
contains()withmove_equalorlarger()iterator so cost scales with actual deletions, not total rowsdynamic_pointer_castand unnecessarystd::stringcopy of DV data inparseDeltaSplitInfoMeasured improvements
DV bitmap filtering (C++, PR #12395):
DV materialization (JVM, PR #12390):
File handle caching (PR #12400):
Was this issue authored or co-authored using generative AI tooling?
No