Skip to content

[VL][Delta] Add DV scan info extraction utility#12197

Draft
malinjawi wants to merge 1 commit into
apache:mainfrom
malinjawi:split/delta-dv-scan-info-utils-pr
Draft

[VL][Delta] Add DV scan info extraction utility#12197
malinjawi wants to merge 1 commit into
apache:mainfrom
malinjawi:split/delta-dv-scan-info-utils-pr

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

This PR is the next split from the Delta deletion-vector (DV) scan stack, following the native reader support already merged in #12040 and before the full JVM scan handoff work from #12131.

It adds a focused Scala utility layer that extracts the essential DV scan information from Spark/Delta PartitionedFile metadata without changing scan offload behavior yet.

Main changes:

  • add DeltaDeletionVectorScanInfo for Delta 3.3 and Delta 4.0 source sets
  • extract per-file DV scan info from PartitionedFile metadata:
    • row-index filter type
    • deletion-vector descriptor and cardinality
    • serialized DV bitmap payload bytes
    • normalized non-DV metadata columns
  • keep the utility independent from Substrait, Velox native split conversion, and scan offload behavior
  • add focused Delta 3.3 and Delta 4.0 tests for DV extraction, keep-all/no-DV extraction, and invalid partial DV metadata

This PR is intentionally utility-only:

  • no Substrait proto changes
  • no native/C++ changes
  • no Delta scan rule replacement
  • no end-to-end scan offload behavior change yet

Those pieces stay in follow-up PRs after this API is reviewed.

How was this patch tested?

Validation run:

  • JAVA_HOME=$(/usr/libexec/java_home -v 17) ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • JAVA_HOME=$(/usr/libexec/java_home -v 17) ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • git diff --check

Also attempted the focused suite with dev/run-scala-test.sh, but the local runner failed during classpath compilation before executing the suite while switching profiles locally. The module-level Spark 3.5 and Spark 4.0 test-compile checks above pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant