Skip to content

[VL][DELTA] Support reading Delta Lake Change Data Feed (CDF) without falling back to vanilla Spark #12195

@felipepessoto

Description

@felipepessoto

Description

Description

Gluten currently does not offload reads of Delta tables' Change Data Feed (spark.read.format("delta").option("readChangeFeed", "true")... or the table_changes() SQL function). These queries run entirely on vanilla Spark instead of the Velox backend.

Why it falls back today

A normal Delta scan is a FileSourceScanExec whose relation.fileFormat is a DeltaParquetFileFormat. Gluten's OffloadDeltaScan only matches that exact case and rewrites it into a DeltaScanTransformer:

case scan: FileSourceScanExec
    if scan.relation.fileFormat.getClass == classOf[DeltaParquetFileFormat] =>
  DeltaScanTransformer(scan)

CDF reads do not produce that plan. Delta builds them through CDCReader.DeltaCDFRelation, a generic BaseRelation whose buildScan returns RDD[Row]

Because the resulting plan is not a FileSourceScanExec over DeltaParquetFileFormat, OffloadDeltaScan never matches it, so the entire query (scan + projections building the metadata columns) stays on vanilla Spark.

Proposed work

  • Recognize the CDF scan path (DeltaCDFRelation / the CDC file indexes) and offload the underlying parquet reads to Velox.
  • Materialize the synthesized _change_type / _commit_version / _commit_timestamp columns (literals + projections) so they can be produced natively rather than forcing a fallback.
  • Add gluten-ut coverage for batch CDF reads (readChangeFeed and table_changes()), including add/remove/cdc-file combinations and column mapping.

Gluten version

main branch

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions