Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33482][SQL] Fix FileScan canonicalization #31820

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Mar 12, 2021

What changes were proposed in this pull request?

This PR adds canonicalization to FileScan.partitionFilters and FileScan.dataFilters in BatchScanExec nodes.

Why are the changes needed?

Partition filters and data filters added to FileScan (in #27112 and #27157) caused that canonicalized form of some BatchScanExec nodes don't match and this prevents some reuse possibilities.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

@github-actions github-actions bot added the SQL label Mar 12, 2021
@@ -86,7 +86,7 @@ trait FileScan extends Scan

override def equals(obj: Any): Boolean = obj match {
case f: FileScan =>
fileIndex == f.fileIndex && readSchema == f.readSchema
fileIndex == f.fileIndex && readSchema == f.readSchema &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not related to this PR, but it looks like a && is missing from here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. It seems that #27112 introduced this.
cc @dongjoon-hyun if we want to backport this to 3.1 and 3.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... nice catch. cc: @gengliangwang , too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credit goes to @bersprockets for catching this in SPARK-33482.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow ..

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40602/

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40602/

@@ -86,7 +86,7 @@ trait FileScan extends Scan

override def equals(obj: Any): Boolean = obj match {
case f: FileScan =>
fileIndex == f.fileIndex && readSchema == f.readSchema
fileIndex == f.fileIndex && readSchema == f.readSchema &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. It seems that #27112 introduced this.
cc @dongjoon-hyun if we want to backport this to 3.1 and 3.0.

Comment on lines 4085 to 4086
df.collect()
df.explain()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need these two statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I left explain() there accidentally, removed in: c0fb9b2

test("SPARK-33482: Fix FileScan canonicalization") {
Seq(true, false).foreach { aqe =>
withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> aqe.toString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to set this config for this test purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, reuse exchange code is very different in AQE/non-AQE paths, but I think you are right as we just need to test the canonicalization fix so I've removed this in c0fb9b2

@SparkQA
Copy link

SparkQA commented Mar 13, 2021

Test build #136026 has finished for PR 31820 at commit c0fb9b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|JOIN t AS t2 ON t2.id = t1.id
|JOIN t AS t3 ON t3.id = t2.id
|""".stripMargin)
df.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need df.collect() here? shouldn't AdaptiveSparkPlanHelper.collect() below take care of going through query plan properly with AQE being enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do need to run the query first and then check the plan as this is an AQE compatible query where ReusedExchangeExec nodes are inserted during execution.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor test comment.

peter-toth and others added 2 commits March 14, 2021 08:32
…rces/v2/BatchScanExec.scala

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
…rces/v2/BatchScanExec.scala

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice patch, @peter-toth . For a better traceability, please proceed FileScan change in a separate PR. It looks worth to have a new JIRA because it looks like a correctness issue. And, if possible, with a separate test case focus on FileScan.equals.

@dongjoon-hyun
Copy link
Member

Also, cc @cloud-fan .

@peter-toth
Copy link
Contributor Author

This is a nice patch, @peter-toth . For a better traceability, please proceed FileScan change in a separate PR. It looks worth to have a new JIRA because it looks like a correctness issue. And, if possible, with a separate test case focus on FileScan.equals.

All right, reverted in 483686d and extracted to #31848

case s: FileScan =>
s.withFilters(
QueryPlan.normalizePredicates(s.partitionFilters, output),
QueryPlan.normalizePredicates(s.dataFilters, output))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but is a bit hacky as it doesn't apply to all the Scan implementations.

I think we should add doc in the Scan interface to explain how the hashCode/equals should be implemented.

@SparkQA
Copy link

SparkQA commented Mar 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40687/

@SparkQA
Copy link

SparkQA commented Mar 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40687/

@peter-toth
Copy link
Contributor Author

Reviewers, I've moved the e2e test to #31848 and fixed the issue there by changing FileScan.equals().

@maropu
Copy link
Member

maropu commented Mar 19, 2021

We can close this now?

@peter-toth peter-toth closed this Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants