Skip to content

[VL][Delta] Track native Delta writer improvements and optimization support #12025

@malinjawi

Description

@malinjawi

Goal

Track Delta-only work to improve native Gluten/Velox support for Delta writer and table optimization paths.

This issue is organized by feature-sized work areas. Each top-level task should map to one reviewable PR or a small stack of tightly related PRs.

Related Work

Feature Tracks

  • OPTIMIZE Compaction

    Native support for Delta OPTIMIZE compaction/bin-packing command paths.

    Scope:

    • offload OPTIMIZE command transactions through GlutenOptimisticTransaction
    • cover path-based and table-name OPTIMIZE forms
    • cover OPTIMIZE ... WHERE partition predicates
    • keep OPTIMIZE read/shuffle/write native where supported
    • validate returned OPTIMIZE metrics and file statistics
    • benchmark compaction on small-file Delta tables

    Related PR:

    Expected coverage:

    • path-based OPTIMIZE
    • table-name OPTIMIZE
    • OPTIMIZE ... WHERE partition_predicate
    • native-write-disabled fallback
    • data correctness before and after compaction
    • Delta log add/remove-file metadata correctness
  • Optimized Write

    Native support and correctness hardening for Delta optimized write paths.

    Scope:

    • verify native behavior when delta.autoOptimize.optimizeWrite is enabled
    • verify native behavior when spark.databricks.delta.optimizeWrite.enabled is enabled
    • verify DataFrameWriter option optimizeWrite behavior
    • cover non-partitioned optimized writes
    • cover partitioned optimized writes
    • validate output file sizing and partition layout metadata
    • reduce unnecessary columnar-to-row transitions in write, stats, and commit paths

    Related PRs:

    Expected coverage:

    • non-partitioned append and overwrite
    • partitioned append and overwrite
    • optimized-write table property, SQL conf, and writer option
    • partition values in add-file metadata
    • min/max/nullCount stats in add-file metadata
    • native and fallback plan assertions
  • OPTIMIZE ZORDER

    Native support for Delta ZORDER layout operations.

    Scope:

    • add native support for Delta ZORDER expressions such as InterleaveBits
    • add native support for RangePartitionId
    • keep ZORDER read/shuffle/sort/write native where supported
    • improve fallback diagnostics when ZORDER cannot stay native
    • validate ZORDER output correctness and Delta log metadata
    • benchmark ZORDER on larger Delta layout workloads

    Expected coverage:

    • OPTIMIZE ... ZORDER BY (...)
    • OPTIMIZE ... WHERE ... ZORDER BY (...)
    • single-column and multi-column ZORDER
    • native expression coverage for ZORDER planning
    • data correctness and Delta log metadata after ZORDER
  • Data Skipping Stats

    Correctness coverage for Delta data-skipping metadata generated by native write and optimization paths.

    Scope:

    • verify native Delta writes preserve min/max/nullCount stats
    • verify stats behavior with partition columns
    • verify stats behavior with delta.dataSkippingNumIndexedCols
    • verify stats remain usable after native writes, optimized writes, and OPTIMIZE

    Expected coverage:

    • stats in Delta add-file JSON
    • partitioned and non-partitioned tables
    • columns inside and outside the indexed stats range
    • queries that rely on data skipping after native writes
  • Auto Compaction

    Native behavior and correctness coverage for Delta auto compaction after successful writes.

    Scope:

    • investigate whether post-commit auto compaction runs through native write paths
    • cover table property delta.autoOptimize.autoCompact
    • cover session config spark.databricks.delta.autoCompact.enabled
    • validate partition selection, file stats, and commit metadata after auto compaction

    Expected coverage:

    • auto compaction on non-partitioned tables
    • auto compaction on partitioned tables
    • minimum-file threshold behavior
    • native/fallback diagnostics for post-commit compaction work
  • Delta Checkpoints And Log Compaction

    Evaluate whether there is meaningful Gluten/Velox execution work in Delta checkpoint and log compaction paths.

    Scope:

    • evaluate Delta multi-part checkpoint write paths
    • evaluate Delta log compaction paths
    • only open implementation PRs if there is execution work beyond Delta log metadata handling

    Expected coverage:

    • clear investigation result
    • follow-up issue or PR only if native execution can add value
  • Performance And Diagnostics

    Benchmark and explain remaining overhead after native execution improvements.

    Scope:

    • profile stage time for native execution versus Delta planning/log/listing/commit overhead
    • benchmark non-partitioned Delta writes
    • benchmark partitioned Delta writes
    • benchmark Delta optimized writes
    • benchmark Delta OPTIMIZE compaction
    • benchmark Delta OPTIMIZE ZORDER after native ZORDER expression support lands
    • use larger Delta datasets where write volume dominates fixed planning and commit overhead

    Expected coverage:

    • before/after numbers for each feature track
    • stage-level breakdown when native speedup is hidden by fixed overhead
    • clear fallback diagnostics for unsupported pieces

Boundaries

  • Keep each PR reviewable and focused
  • Prefer correctness tests before benchmark-only changes
  • Split command offload, native expression support, metadata correctness, and benchmark work into separate patches where practical

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions