New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33482][SQL] Fix FileScan canonicalization #31820
[SPARK-33482][SQL] Fix FileScan canonicalization #31820
Conversation
@@ -86,7 +86,7 @@ trait FileScan extends Scan | |||
|
|||
override def equals(obj: Any): Boolean = obj match { | |||
case f: FileScan => | |||
fileIndex == f.fileIndex && readSchema == f.readSchema | |||
fileIndex == f.fileIndex && readSchema == f.readSchema && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not related to this PR, but it looks like a &&
is missing from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. It seems that #27112 introduced this.
cc @dongjoon-hyun if we want to backport this to 3.1
and 3.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... nice catch. cc: @gengliangwang , too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credit goes to @bersprockets for catching this in SPARK-33482.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow ..
Kubernetes integration test starting |
Kubernetes integration test status failure |
@@ -86,7 +86,7 @@ trait FileScan extends Scan | |||
|
|||
override def equals(obj: Any): Boolean = obj match { | |||
case f: FileScan => | |||
fileIndex == f.fileIndex && readSchema == f.readSchema | |||
fileIndex == f.fileIndex && readSchema == f.readSchema && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. It seems that #27112 introduced this.
cc @dongjoon-hyun if we want to backport this to 3.1
and 3.0
.
df.collect() | ||
df.explain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need these two statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I left explain()
there accidentally, removed in: c0fb9b2
test("SPARK-33482: Fix FileScan canonicalization") { | ||
Seq(true, false).foreach { aqe => | ||
withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "", | ||
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> aqe.toString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to set this config for this test purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, reuse exchange code is very different in AQE/non-AQE paths, but I think you are right as we just need to test the canonicalization fix so I've removed this in c0fb9b2
Test build #136026 has finished for PR 31820 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala
Outdated
Show resolved
Hide resolved
|JOIN t AS t2 ON t2.id = t1.id | ||
|JOIN t AS t3 ON t3.id = t2.id | ||
|""".stripMargin) | ||
df.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need df.collect()
here? shouldn't AdaptiveSparkPlanHelper.collect()
below take care of going through query plan properly with AQE being enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do need to run the query first and then check the plan as this is an AQE compatible query where ReusedExchangeExec
nodes are inserted during execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor test comment.
…rces/v2/BatchScanExec.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
…rces/v2/BatchScanExec.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice patch, @peter-toth . For a better traceability, please proceed FileScan
change in a separate PR. It looks worth to have a new JIRA because it looks like a correctness issue. And, if possible, with a separate test case focus on FileScan.equals
.
Also, cc @cloud-fan . |
|
case s: FileScan => | ||
s.withFilters( | ||
QueryPlan.normalizePredicates(s.partitionFilters, output), | ||
QueryPlan.normalizePredicates(s.dataFilters, output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but is a bit hacky as it doesn't apply to all the Scan
implementations.
I think we should add doc in the Scan
interface to explain how the hashCode/equals
should be implemented.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Reviewers, I've moved the e2e test to #31848 and fixed the issue there by changing |
We can close this now? |
What changes were proposed in this pull request?
This PR adds canonicalization to
FileScan.partitionFilters
andFileScan.dataFilters
inBatchScanExec
nodes.Why are the changes needed?
Partition filters and data filters added to
FileScan
(in #27112 and #27157) caused that canonicalized form of someBatchScanExec
nodes don't match and this prevents some reuse possibilities.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UT.