Description
Description
Gluten currently does not offload reads of Delta tables' Change Data Feed (spark.read.format("delta").option("readChangeFeed", "true")... or the table_changes() SQL function). These queries run entirely on vanilla Spark instead of the Velox backend.
Why it falls back today
A normal Delta scan is a FileSourceScanExec whose relation.fileFormat is a DeltaParquetFileFormat. Gluten's OffloadDeltaScan only matches that exact case and rewrites it into a DeltaScanTransformer:
case scan: FileSourceScanExec
if scan.relation.fileFormat.getClass == classOf[DeltaParquetFileFormat] =>
DeltaScanTransformer(scan)
CDF reads do not produce that plan. Delta builds them through CDCReader.DeltaCDFRelation, a generic BaseRelation whose buildScan returns RDD[Row]
Because the resulting plan is not a FileSourceScanExec over DeltaParquetFileFormat, OffloadDeltaScan never matches it, so the entire query (scan + projections building the metadata columns) stays on vanilla Spark.
Proposed work
- Recognize the CDF scan path (
DeltaCDFRelation / the CDC file indexes) and offload the underlying parquet reads to Velox.
- Materialize the synthesized
_change_type / _commit_version / _commit_timestamp columns (literals + projections) so they can be produced natively rather than forcing a fallback.
- Add
gluten-ut coverage for batch CDF reads (readChangeFeed and table_changes()), including add/remove/cdc-file combinations and column mapping.
Gluten version
main branch
Description
Description
Gluten currently does not offload reads of Delta tables' Change Data Feed (
spark.read.format("delta").option("readChangeFeed", "true")...or thetable_changes()SQL function). These queries run entirely on vanilla Spark instead of the Velox backend.Why it falls back today
A normal Delta scan is a
FileSourceScanExecwhoserelation.fileFormatis aDeltaParquetFileFormat. Gluten'sOffloadDeltaScanonly matches that exact case and rewrites it into aDeltaScanTransformer:CDF reads do not produce that plan. Delta builds them through
CDCReader.DeltaCDFRelation, a genericBaseRelationwhosebuildScanreturns RDD[Row]Because the resulting plan is not a
FileSourceScanExecoverDeltaParquetFileFormat,OffloadDeltaScannever matches it, so the entire query (scan + projections building the metadata columns) stays on vanilla Spark.Proposed work
DeltaCDFRelation/ the CDC file indexes) and offload the underlying parquet reads to Velox._change_type/_commit_version/_commit_timestampcolumns (literals + projections) so they can be produced natively rather than forcing a fallback.gluten-utcoverage for batch CDF reads (readChangeFeedandtable_changes()), including add/remove/cdc-file combinations and column mapping.Gluten version
main branch