[SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 #29564

mingjialiu · 2020-08-27T23:07:23Z

What changes were proposed in this pull request?

Override doCanonicalize function of class DataSourceV2ScanExec

Why are the changes needed?

Query plan of DataSourceV2 fails to reuse any exchange. This change can make DataSourceV2's plan more optimized and reuse exchange as DataSourceV1 and parquet do.

Direct reason: equals function of DataSourceV2ScanExec returns 'false' as comparing the same V2 scans(same table, outputs and pushedfilters)

Actual cause : With query plan's default doCanonicalize function, pushedfilters of DataSourceV2ScanExec are not canonicalized correctly. Essentially expressionId of predicate columns are not normalized.

Spark 32708 was not caught by my tests previously added for [SPARK-32609] because the above issue happens only when the same filtered column are of different expression id (eg : join table t1 with t1)

Does this PR introduce any user-facing change?

no

How was this patch tested?

unit test added

…rceV2

… branch-2.4

gatorsmile · 2020-08-27T23:34:05Z

ok to test

gatorsmile · 2020-08-27T23:35:41Z

2.4?

gatorsmile · 2020-08-27T23:36:52Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

@@ -54,7 +55,7 @@ case class DataSourceV2Relation(
    tableIdent.map(_.unquotedString).getOrElse(s"${source.name}:unknown")
  }

-  override def pushedFilters: Seq[Expression] = Seq.empty
+  override def pushedFilters: Seq[Filter] = Seq.empty


Why we need to change Expression to Filter, which is a public interface?

More explanation. Why changing from Expression to org.apache.spark.sql.sources.Filter

DataSourceV2ScanExec.pushedFilters are defined as array of Expressions whose equal function has expression_id in scope. So for example, Expression isnotnull(d_day_name#22364) is not considered equal to isnotnull(d_day_name#22420). Therefore, the right thing is to define and compare pushedFilter as Filter class.

At both Spark 3.0 and affected Spark 2.4's tests suite, Filter is the class being used. And the above 4 places seem to be the only places that miss to have pushedFilter as class Filter.
(Because pushedFilters are defined the right way in the above test suite, Spark 32708 was not caught by my tests previously added for SPARK-32609, another exchange reuse bug.

Usage of Expression was introduced by PR [SPARK-23203][SQL] DataSourceV2: Use immutable logical plan. From the PR's description and original intention, I don't see a necessary reason to maintain Expression.

maropu · 2020-08-28T02:12:00Z

If the issue does not happen in branch-3.0+, I think we might need to check which commit's resolved it from the commit history first. If it found, we might be able to just cherry-pick it.

SparkQA · 2020-08-28T02:46:45Z

Test build #127968 has finished for PR 29564 at commit 704efbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-09-07T10:10:57Z

In Branch 3.0, there is a mixed-in trait SupportsPushDownFilters which is introduced by #19136 and #19424 .

However, if we are going to cherry-pick the PRs mentioned above, then there will be reasons to cherry-pick the other data source V2 related PRs into 2.4 as well.
@mingjialiu is there a strong reason to use data source v2 on branch 2.4 instead of 3.0? There seems quite some work to sync DS v2 api changes to branch 2.4
cc @cloud-fan as well.

cloud-fan · 2020-09-07T12:20:20Z

If exchange reuse is broken, it means plan equality is broken somewhere. I think Seq[Expression] is OK as long as we canonicalize it before comparing it. FileSourceScanExec also contains Seq[Expression] and it's fine.

Can you look into it more and have a more surgical fix?

…rceV2

… branch-2.4

… DataSourceV2" This reverts commit dd0fb24.

mingjialiu · 2020-09-10T03:58:15Z

If exchange reuse is broken, it means plan equality is broken somewhere. I think Seq[Expression] is OK as long as we canonicalize it before comparing it. FileSourceScanExec also contains Seq[Expression] and it's fine.

Can you look into it more and have a more surgical fix?

Updated with a more surgical fix. Please review.

mingjialiu · 2020-09-10T03:58:58Z

In Branch 3.0, there is a mixed-in trait SupportsPushDownFilters which is introduced by #19136 and #19424 .

However, if we are going to cherry-pick the PRs mentioned above, then there will be reasons to cherry-pick the other data source V2 related PRs into 2.4 as well.
@mingjialiu is there a strong reason to use data source v2 on branch 2.4 instead of 3.0? There seems quite some work to sync DS v2 api changes to branch 2.4
cc @cloud-fan as well.

My org still relies heavily on 2.4

gengliangwang · 2020-09-10T04:02:06Z

@mingjialiu the new fix looks more reasonable. Could you add test case for the changes?

cloud-fan · 2020-09-10T04:51:44Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala

+      options,
+      QueryPlan.normalizePredicates(
+        pushedFilters,
+        AttributeSeq(pushedFilters.flatMap(_.references).distinct)),


should we use output here?

Output here doesn't contain all predicate columns due to implementation at : https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L108

cloud-fan · 2020-09-10T06:05:17Z

The fix LGTM, can you add a test?

SparkQA · 2020-09-10T07:05:02Z

Test build #128481 has finished for PR 29564 at commit a6e4709.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mingjialiu · 2020-09-10T18:18:47Z

The fix LGTM, can you add a test?

Regarding test coverage, it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?

The key to repro is to have the same column assigned different expression Ids.
Relative implementation : preserve old expressionId if column not found
Explained details in email.

…rceV2

mingjialiu · 2020-09-11T01:01:18Z

The fix LGTM, can you add a test?

Regarding test coverage, it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?

The key to repro is to have the same column assigned different expression Ids.
Relative implementation : preserve old expressionId if column not found
Explained details in email.

The fix LGTM, can you add a test?

Test added. Please review.

mingjialiu · 2020-09-11T01:06:06Z

The fix LGTM, can you add a test?

Regarding test coverage, it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?

The key to repro is to have the same column assigned different expression Ids.
Relative implementation : preserve old expressionId if column not found
Explained details in email.

Please ignore this message. I figured out that column's expression id is consistent within the same df.

…rceV2

SparkQA · 2020-09-11T04:32:04Z

Test build #128541 has finished for PR 29564 at commit 98483c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-11T04:59:31Z

Test build #128543 has finished for PR 29564 at commit 8b864e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-11T07:17:46Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
      }
    }
  }
+
+  test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {


If we don't have an end-to-end test, how about a low-level UT? Create two DataSourceV2ScanExec instances and check scan1.sameResult(scan2).

@cloud-fan I think this test case creates two DataSourceV2ScanExec and do the check. It looks ok to me.

mingjialiu · 2020-09-11T16:22:07Z

test pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction is failing. @gatorsmile @cloud-fan @gengliangwang do you think the failure is related to this change? If yes, any suggestions on how to fix it?

gengliangwang · 2020-09-11T16:46:43Z

I think we can rely on the jenkins test result as well

SparkQA · 2020-09-11T18:11:01Z

Test build #128574 has finished for PR 29564 at commit acadafe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang

LGTM

mingjialiu · 2020-09-14T06:28:46Z

@cloud-fan @gatorsmile Hi, it looks to me that 1 approval is not enough for merging. Can you please approve this PR if everything looks good to you?

gengliangwang · 2020-09-14T07:49:23Z

Thanks, merging to branch 2.4

…rceV2 ### What changes were proposed in this pull request? Override doCanonicalize function of class DataSourceV2ScanExec ### Why are the changes needed? Query plan of DataSourceV2 fails to reuse any exchange. This change can make DataSourceV2's plan more optimized and reuse exchange as DataSourceV1 and parquet do. Direct reason: equals function of DataSourceV2ScanExec returns 'false' as comparing the same V2 scans(same table, outputs and pushedfilters) Actual cause : With query plan's default doCanonicalize function, pushedfilters of DataSourceV2ScanExec are not canonicalized correctly. Essentially expressionId of predicate columns are not normalized. [Spark 32708](https://issues.apache.org/jira/browse/SPARK-32708#) was not caught by my [tests](https://github.com/apache/spark/blob/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala#L392) previously added for [SPARK-32609] because the above issue happens only when the same filtered column are of different expression id (eg : join table t1 with t1) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? unit test added Closes #29564 from mingjialiu/branch-2.4. Authored-by: mingjial <mingjial@google.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

dongjoon-hyun · 2020-09-14T15:34:01Z

Thank you, @mingjialiu and all!

dongjoon-hyun · 2020-09-14T15:34:41Z

@gengliangwang . Could you update Fix Version of SPARK-32708?

gengliangwang · 2020-09-14T15:46:12Z

@dongjoon-hyun sure.

dongjoon-hyun · 2020-09-21T18:54:12Z

@gengliangwang . The Fix Version seems to be empty still.

gengliangwang · 2020-09-22T02:23:21Z

@dongjoon-hyun Sorry, it's updated now.

dongjoon-hyun · 2020-09-22T02:49:05Z

Thank you, @gengliangwang !

…rceV2 ### What changes were proposed in this pull request? Override doCanonicalize function of class DataSourceV2ScanExec ### Why are the changes needed? Query plan of DataSourceV2 fails to reuse any exchange. This change can make DataSourceV2's plan more optimized and reuse exchange as DataSourceV1 and parquet do. Direct reason: equals function of DataSourceV2ScanExec returns 'false' as comparing the same V2 scans(same table, outputs and pushedfilters) Actual cause : With query plan's default doCanonicalize function, pushedfilters of DataSourceV2ScanExec are not canonicalized correctly. Essentially expressionId of predicate columns are not normalized. [Spark 32708](https://issues.apache.org/jira/browse/SPARK-32708#) was not caught by my [tests](https://github.com/apache/spark/blob/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala#L392) previously added for [SPARK-32609] because the above issue happens only when the same filtered column are of different expression id (eg : join table t1 with t1) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? unit test added Closes apache#29564 from mingjialiu/branch-2.4. Authored-by: mingjial <mingjial@google.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> RB=2951163 BUG=APA-51006,LIHADOOP-62503 G=spark-reviewers R=yezhou,ekrogen,smahadik A=smahadik

mingjialiu added 5 commits August 13, 2020 23:04

[SPARK-32609] Incorrect exchange reuse with DataSourceV2

d46483f

[SPARK-32609] Incorrect exchange reuse with DataSourceV2

5b1b9b3

[Spark 32708] Query optimization fails to reuse exchange with DataSou…

dd0fb24

…rceV2

Merge branch 'branch-2.4' of https://github.com/mingjialiu/spark into…

4ba92cc

… branch-2.4

Merge remote-tracking branch 'upstream/branch-2.4' into branch-2.4

704efbc

probot-autolabeler bot added the SQL label Aug 27, 2020

gatorsmile changed the title ~~[Spark 32708] Query optimization fails to reuse exchange with DataSourceV2~~ [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 Aug 27, 2020

gatorsmile reviewed Aug 27, 2020

View reviewed changes

mingjialiu added 3 commits September 9, 2020 20:36

[SPARK-32708] Query optimization fails to reuse exchange with DataSou…

69ea44e

…rceV2

Merge branch 'branch-2.4' of https://github.com/mingjialiu/spark into…

2fe41e0

… branch-2.4

Revert "[Spark 32708] Query optimization fails to reuse exchange with…

a6e4709

… DataSourceV2" This reverts commit dd0fb24.

cloud-fan reviewed Sep 10, 2020

View reviewed changes

mingjialiu closed this Sep 10, 2020

mingjialiu reopened this Sep 10, 2020

[SPARK-32708] Query optimization fails to reuse exchange with DataSou…

98483c8

…rceV2

[SPARK-32708] Query optimization fails to reuse exchange with DataSou…

8b864e7

…rceV2

cloud-fan reviewed Sep 11, 2020

View reviewed changes

gengliangwang changed the title ~~[WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2~~ [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 Sep 11, 2020

Merge remote-tracking branch 'upstream/branch-2.4' into branch-2.4

acadafe

gengliangwang approved these changes Sep 12, 2020

View reviewed changes

mingjialiu closed this Sep 14, 2020

mingjialiu reopened this Sep 14, 2020

gengliangwang closed this Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 #29564

[SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 #29564

mingjialiu commented Aug 27, 2020 •

edited

gatorsmile commented Aug 27, 2020

gatorsmile commented Aug 27, 2020

gatorsmile Aug 27, 2020

mingjialiu Sep 2, 2020

maropu commented Aug 28, 2020

SparkQA commented Aug 28, 2020

gengliangwang commented Sep 7, 2020

cloud-fan commented Sep 7, 2020

mingjialiu commented Sep 10, 2020

mingjialiu commented Sep 10, 2020

gengliangwang commented Sep 10, 2020

cloud-fan Sep 10, 2020

mingjialiu Sep 10, 2020 •

edited

cloud-fan commented Sep 10, 2020

SparkQA commented Sep 10, 2020

mingjialiu commented Sep 10, 2020 •

edited

mingjialiu commented Sep 11, 2020

mingjialiu commented Sep 11, 2020

SparkQA commented Sep 11, 2020

SparkQA commented Sep 11, 2020

cloud-fan Sep 11, 2020

gengliangwang Sep 11, 2020

mingjialiu commented Sep 11, 2020

gengliangwang commented Sep 11, 2020

SparkQA commented Sep 11, 2020

gengliangwang left a comment

mingjialiu commented Sep 14, 2020

gengliangwang commented Sep 14, 2020

dongjoon-hyun commented Sep 14, 2020

dongjoon-hyun commented Sep 14, 2020

gengliangwang commented Sep 14, 2020

dongjoon-hyun commented Sep 21, 2020

gengliangwang commented Sep 22, 2020

dongjoon-hyun commented Sep 22, 2020

[SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 #29564

[SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 #29564

Conversation

mingjialiu commented Aug 27, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gatorsmile commented Aug 27, 2020

gatorsmile commented Aug 27, 2020

gatorsmile Aug 27, 2020

Choose a reason for hiding this comment

mingjialiu Sep 2, 2020

Choose a reason for hiding this comment

maropu commented Aug 28, 2020

SparkQA commented Aug 28, 2020

gengliangwang commented Sep 7, 2020

cloud-fan commented Sep 7, 2020

mingjialiu commented Sep 10, 2020

mingjialiu commented Sep 10, 2020

gengliangwang commented Sep 10, 2020

cloud-fan Sep 10, 2020

Choose a reason for hiding this comment

mingjialiu Sep 10, 2020 • edited

Choose a reason for hiding this comment

cloud-fan commented Sep 10, 2020

SparkQA commented Sep 10, 2020

mingjialiu commented Sep 10, 2020 • edited

mingjialiu commented Sep 11, 2020

mingjialiu commented Sep 11, 2020

SparkQA commented Sep 11, 2020

SparkQA commented Sep 11, 2020

cloud-fan Sep 11, 2020

Choose a reason for hiding this comment

gengliangwang Sep 11, 2020

Choose a reason for hiding this comment

mingjialiu commented Sep 11, 2020

gengliangwang commented Sep 11, 2020

SparkQA commented Sep 11, 2020

gengliangwang left a comment

Choose a reason for hiding this comment

mingjialiu commented Sep 14, 2020

gengliangwang commented Sep 14, 2020

dongjoon-hyun commented Sep 14, 2020

dongjoon-hyun commented Sep 14, 2020

gengliangwang commented Sep 14, 2020

dongjoon-hyun commented Sep 21, 2020

gengliangwang commented Sep 22, 2020

dongjoon-hyun commented Sep 22, 2020

mingjialiu commented Aug 27, 2020 •

edited

mingjialiu Sep 10, 2020 •

edited

mingjialiu commented Sep 10, 2020 •

edited