Skip to content

[VL] Optimize Delta Lake Deletion Vector processing on remote storage #12399

Description

@iemejia

Description

This is a tracking issue for a set of optimizations to the Delta Lake Deletion Vector (DV) processing path in the Velox backend.

When a Delta table receives MERGE, UPDATE, or DELETE operations, Delta writes Deletion Vectors -- small sidecar files containing bitmaps of logically deleted row positions -- instead of rewriting the affected data files. The physical data stays in place until compaction (OPTIMIZE) runs. This makes mutations fast, but shifts cost to reads: every subsequent query that touches files with pending deletions must load the DV bitmaps from storage and filter out deleted rows. Between compaction cycles, this cost is paid on every read query.

On remote storage (ABFS, S3, HDFS), this cost is amplified because every storage operation involves a network round-trip. The optimizations below target three layers of the DV processing path: query planning (JVM), row filtering (C++ native engine), and file I/O.

Tracked PRs

PR Area Description CI Status
#12390 JVM (planning) Eliminate redundant network calls during DV materialization: cache path resolution per partition, read raw DV bytes directly (skip Java deserialize/re-serialize), early-exit guard for non-Delta queries, fused rule execution Green Open, in review
#12395 C++ (native engine) Iterator-based DV bitmap filtering: replace per-row contains() with move_equalorlarger() iterator so cost scales with actual deletions, not total rows Green Merged
#12389 C++ (plan converter) Remove double dynamic_pointer_cast and unnecessary std::string copy of DV data in parseDeltaSplitInfo Green Merged
#12400 JVM/C++ (file I/O) Enable file handle cache by default with TTL-based eviction, wire previously dead-code TTL config to Velox cache Green Open, in review

Measured improvements

DV bitmap filtering (C++, PR #12395):

Deletion density Speedup
1% (sparse, typical after MERGE/UPDATE) 198x
10% (moderate) 10x
50% (dense) 2x

DV materialization (JVM, PR #12390):

  • Projected up to 20x faster on remote object storage by eliminating redundant HTTP round-trips per file
  • Non-Delta queries: 22x faster rule evaluation via early-exit guard

File handle caching (PR #12400):

  • Estimated 40-70% improvement for repeated scans of many small files on remote storage

Was this issue authored or co-authored using generative AI tooling?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions