[SPARK-33482][SQL] Fix FileScan canonicalization #31820

peter-toth · 2021-03-12T18:44:51Z

What changes were proposed in this pull request?

This PR adds canonicalization to FileScan.partitionFilters and FileScan.dataFilters in BatchScanExec nodes.

Why are the changes needed?

Partition filters and data filters added to FileScan (in #27112 and #27157) caused that canonicalized form of some BatchScanExec nodes don't match and this prevents some reuse possibilities.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

peter-toth · 2021-03-12T19:05:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

@@ -86,7 +86,7 @@ trait FileScan extends Scan

  override def equals(obj: Any): Boolean = obj match {
    case f: FileScan =>
-      fileIndex == f.fileIndex && readSchema == f.readSchema
+      fileIndex == f.fileIndex && readSchema == f.readSchema &&


This change is not related to this PR, but it looks like a && is missing from here.

Nice catch. It seems that #27112 introduced this.
cc @dongjoon-hyun if we want to backport this to 3.1 and 3.0.

Oh... nice catch. cc: @gengliangwang , too.

Credit goes to @bersprockets for catching this in SPARK-33482.

SparkQA · 2021-03-12T19:27:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40602/

SparkQA · 2021-03-12T19:31:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40602/

c21 · 2021-03-12T19:34:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

@@ -86,7 +86,7 @@ trait FileScan extends Scan

  override def equals(obj: Any): Boolean = obj match {
    case f: FileScan =>
-      fileIndex == f.fileIndex && readSchema == f.readSchema
+      fileIndex == f.fileIndex && readSchema == f.readSchema &&


Nice catch. It seems that #27112 introduced this.
cc @dongjoon-hyun if we want to backport this to 3.1 and 3.0.

c21 · 2021-03-12T19:34:37Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+            df.collect()
+            df.explain()


do we really need these two statements?

No. I left explain() there accidentally, removed in: c0fb9b2

maropu · 2021-03-13T06:20:07Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+  test("SPARK-33482: Fix FileScan canonicalization") {
+    Seq(true, false).foreach { aqe =>
+      withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "",
+        SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> aqe.toString) {


We need to set this config for this test purpose?

Well, reuse exchange code is very different in AQE/non-AQE paths, but I think you are right as we just need to test the canonicalization fix so I've removed this in c0fb9b2

SparkQA · 2021-03-13T12:38:54Z

Test build #136026 has finished for PR 31820 at commit c0fb9b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

c21 · 2021-03-14T05:01:48Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+              |JOIN t AS t2 ON t2.id = t1.id
+              |JOIN t AS t3 ON t3.id = t2.id
+              |""".stripMargin)
+          df.collect()


do we really need df.collect() here? shouldn't AdaptiveSparkPlanHelper.collect() below take care of going through query plan properly with AQE being enabled?

Yes, we do need to run the query first and then check the plan as this is an AQE compatible query where ReusedExchangeExec nodes are inserted during execution.

c21

LGTM with minor test comment.

…rces/v2/BatchScanExec.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

dongjoon-hyun

This is a nice patch, @peter-toth . For a better traceability, please proceed FileScan change in a separate PR. It looks worth to have a new JIRA because it looks like a correctness issue. And, if possible, with a separate test case focus on FileScan.equals.

dongjoon-hyun · 2021-03-15T20:02:30Z

Also, cc @cloud-fan .

peter-toth · 2021-03-16T08:33:19Z

This is a nice patch, @peter-toth . For a better traceability, please proceed FileScan change in a separate PR. It looks worth to have a new JIRA because it looks like a correctness issue. And, if possible, with a separate test case focus on FileScan.equals.

All right, reverted in 483686d and extracted to #31848

cloud-fan · 2021-03-16T09:12:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

+      case s: FileScan =>
+        s.withFilters(
+          QueryPlan.normalizePredicates(s.partitionFilters, output),
+          QueryPlan.normalizePredicates(s.dataFilters, output))


This works, but is a bit hacky as it doesn't apply to all the Scan implementations.

I think we should add doc in the Scan interface to explain how the hashCode/equals should be implemented.

SparkQA · 2021-03-16T09:33:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40687/

SparkQA · 2021-03-16T09:42:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40687/

peter-toth · 2021-03-18T18:03:30Z

Reviewers, I've moved the e2e test to #31848 and fixed the issue there by changing FileScan.equals().

maropu · 2021-03-19T01:41:11Z

We can close this now?

[SPARK-33482][SQL] Fix FileScan canonicalization

99c1482

github-actions bot added the SQL label Mar 12, 2021

peter-toth commented Mar 12, 2021

View reviewed changes

c21 reviewed Mar 12, 2021

View reviewed changes

maropu reviewed Mar 13, 2021

View reviewed changes

address comments

c0fb9b2

maropu approved these changes Mar 13, 2021

View reviewed changes

HyukjinKwon reviewed Mar 14, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 14, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Mar 14, 2021

View reviewed changes

c21 reviewed Mar 14, 2021

View reviewed changes

c21 approved these changes Mar 14, 2021

View reviewed changes

peter-toth and others added 2 commits March 14, 2021 08:32

Update sql/core/src/main/scala/org/apache/spark/sql/execution/datasou…

20624d1

…rces/v2/BatchScanExec.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

Update sql/core/src/main/scala/org/apache/spark/sql/execution/datasou…

52cc2ca

…rces/v2/BatchScanExec.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

dongjoon-hyun reviewed Mar 15, 2021

View reviewed changes

revert FileScan.equals change from this PR

483686d

peter-toth mentioned this pull request Mar 16, 2021

[SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check #31848

Closed

cloud-fan reviewed Mar 16, 2021

View reviewed changes

peter-toth closed this Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33482][SQL] Fix FileScan canonicalization #31820

[SPARK-33482][SQL] Fix FileScan canonicalization #31820

peter-toth commented Mar 12, 2021 •

edited

peter-toth Mar 12, 2021

c21 Mar 12, 2021

maropu Mar 13, 2021

peter-toth Mar 13, 2021

HyukjinKwon Mar 14, 2021

SparkQA commented Mar 12, 2021

SparkQA commented Mar 12, 2021

c21 Mar 12, 2021

c21 Mar 12, 2021

peter-toth Mar 13, 2021

maropu Mar 13, 2021

peter-toth Mar 13, 2021

SparkQA commented Mar 13, 2021

c21 Mar 14, 2021

peter-toth Mar 14, 2021

c21 left a comment

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented Mar 15, 2021

peter-toth commented Mar 16, 2021

cloud-fan Mar 16, 2021

SparkQA commented Mar 16, 2021

SparkQA commented Mar 16, 2021

peter-toth commented Mar 18, 2021

maropu commented Mar 19, 2021

[SPARK-33482][SQL] Fix FileScan canonicalization #31820

[SPARK-33482][SQL] Fix FileScan canonicalization #31820

Conversation

peter-toth commented Mar 12, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 12, 2021

SparkQA commented Mar 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 15, 2021

peter-toth commented Mar 16, 2021

Choose a reason for hiding this comment

SparkQA commented Mar 16, 2021

SparkQA commented Mar 16, 2021

peter-toth commented Mar 18, 2021

maropu commented Mar 19, 2021

peter-toth commented Mar 12, 2021 •

edited

dongjoon-hyun left a comment •

edited