[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

huaxingao · 2025-10-06T03:42:56Z

What changes were proposed in this pull request?

Push Variant into DSv2 scan

Why are the changes needed?

with the change, DSV2 scan only needs to fetch the necessary shredded columns required by the plan

Does this PR introduce any user-facing change?

No

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

Thank you so much, @huaxingao .

cc @chenhao-db and @cloud-fan from SPARK-53805 .

#49235

dongjoon-hyun

+1, LGTM from my side.

dongjoon-hyun · 2025-10-07T21:15:02Z

cc @peter-toth , too.

singhpk234 · 2025-10-07T21:52:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

      hadoopFsRelation@HadoopFsRelation(_, _, _, _, _: ParquetFileFormat, _), _)) =>
        rewritePlan(p, projectList, filters, relation, hadoopFsRelation)
+      case p@PhysicalOperation(projectList, filters, relation: DataSourceV2Relation) =>
+        rewriteV2RelationPlan(p, projectList, filters, relation.output, relation)


if we are sending the relation already do we need to send the relation.output seperately ?

I overlooked this. Removed.

singhpk234 · 2025-10-07T22:35:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

      SchemaPruning,
      GroupBasedRowLevelOperationScanPlanning,
      V1Writes,
+      PushVariantIntoScan,


now PushVariantIntoScan runs before the PruneFileSourcePartition, which i think was for v1 sources, does this matter or if i were to ask did we just like add in later, just because it was a new rule ?

I don't think variant columns will ever be used in the partition schema. Schema transformations by PushVariantIntoScan shouldn't affect partition pruning in v1 sources.

cloud-fan · 2025-10-09T03:40:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

      relation @ LogicalRelationWithTable(
      hadoopFsRelation@HadoopFsRelation(_, _, _, _, _: ParquetFileFormat, _), _)) =>
        rewritePlan(p, projectList, filters, relation, hadoopFsRelation)
+      case p@PhysicalOperation(projectList, filters, relation: DataSourceV2Relation) =>


Is there any code we can share between the v1 rewritePlan and the v2 rewriteV2RelationPlan?

Yes, there’s shared logic. I intentionally left the v1 rewritePlan unchanged in this PR to keep the diff small and easier to review. After this merges, I’ll do a small follow-up to have v1 rewritePlan reuse the common code. If you prefer, I can fold that refactor into this PR.

it's actually harder to review as I can't tell what's the key difference between the v1 and v2 versions with the current PR...

Sorry for the confusion. I have updated the code.

The logic for transforming variant columns to struct is identical between DSv1 and DSv2. Now they both use the same helper methods (collectAndRewriteVariants, buildAttributeMap, buildFilterAndProject).

The only difference is how the transformed schema is communicated to the data source. DSv1 stores the new schema in HadoopFsRelation.dataSchema and the file source reads this field directly; DSv2 has no schema field to update. The schema is communicated later when V2ScanRelationPushDown calls pruneColumns.

[SPARK-53805][SQL] Push Variant into DSv2 scan

cd8e0d7

github-actions bot added the SQL label Oct 6, 2025

add new line at end of file

c8b9df5

dongjoon-hyun reviewed Oct 7, 2025

View reviewed changes

huaxingao mentioned this pull request Oct 7, 2025

Spark 4.0: Add variant round trip test for Spark apache/iceberg#14276

Open

dongjoon-hyun approved these changes Oct 7, 2025

View reviewed changes

singhpk234 reviewed Oct 7, 2025

View reviewed changes

address comments

2092fce

cloud-fan reviewed Oct 9, 2025

View reviewed changes

reuse common code

9bb25cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

huaxingao commented Oct 6, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Oct 7, 2025

Uh oh!

singhpk234 Oct 7, 2025

Uh oh!

huaxingao Oct 7, 2025

Uh oh!

singhpk234 Oct 7, 2025

Uh oh!

huaxingao Oct 7, 2025

Uh oh!

cloud-fan Oct 9, 2025

Uh oh!

huaxingao Oct 9, 2025

Uh oh!

cloud-fan Oct 9, 2025

Uh oh!

huaxingao Oct 9, 2025

Uh oh!

Uh oh!

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

Are you sure you want to change the base?

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

Conversation

huaxingao commented Oct 6, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!