Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 #29564

Closed
wants to merge 11 commits into from

Conversation

mingjialiu
Copy link
Contributor

@mingjialiu mingjialiu commented Aug 27, 2020

What changes were proposed in this pull request?

Override doCanonicalize function of class DataSourceV2ScanExec

Why are the changes needed?

Query plan of DataSourceV2 fails to reuse any exchange. This change can make DataSourceV2's plan more optimized and reuse exchange as DataSourceV1 and parquet do.

Direct reason: equals function of DataSourceV2ScanExec returns 'false' as comparing the same V2 scans(same table, outputs and pushedfilters)

Actual cause : With query plan's default doCanonicalize function, pushedfilters of DataSourceV2ScanExec are not canonicalized correctly. Essentially expressionId of predicate columns are not normalized.

Spark 32708 was not caught by my tests previously added for [SPARK-32609] because the above issue happens only when the same filtered column are of different expression id (eg : join table t1 with t1)

Does this PR introduce any user-facing change?

no

How was this patch tested?

unit test added

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

2.4?

@gatorsmile gatorsmile changed the title [Spark 32708] Query optimization fails to reuse exchange with DataSourceV2 [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 Aug 27, 2020
@@ -54,7 +55,7 @@ case class DataSourceV2Relation(
tableIdent.map(_.unquotedString).getOrElse(s"${source.name}:unknown")
}

override def pushedFilters: Seq[Expression] = Seq.empty
override def pushedFilters: Seq[Filter] = Seq.empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to change Expression to Filter, which is a public interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More explanation. Why changing from Expression to org.apache.spark.sql.sources.Filter

DataSourceV2ScanExec.pushedFilters are defined as array of Expressions whose equal function has expression_id in scope. So for example, Expression isnotnull(d_day_name#22364) is not considered equal to isnotnull(d_day_name#22420). Therefore, the right thing is to define and compare pushedFilter as Filter class.

At both Spark 3.0 and affected Spark 2.4's tests suite, Filter is the class being used. And the above 4 places seem to be the only places that miss to have pushedFilter as class Filter.
(Because pushedFilters are defined the right way in the above test suite, Spark 32708 was not caught by my tests previously added for SPARK-32609, another exchange reuse bug.

Usage of Expression was introduced by PR [SPARK-23203][SQL] DataSourceV2: Use immutable logical plan. From the PR's description and original intention, I don't see a necessary reason to maintain Expression.

@maropu
Copy link
Member

maropu commented Aug 28, 2020

If the issue does not happen in branch-3.0+, I think we might need to check which commit's resolved it from the commit history first. If it found, we might be able to just cherry-pick it.

@SparkQA
Copy link

SparkQA commented Aug 28, 2020

Test build #127968 has finished for PR 29564 at commit 704efbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

In Branch 3.0, there is a mixed-in trait SupportsPushDownFilters which is introduced by #19136 and #19424 .

However, if we are going to cherry-pick the PRs mentioned above, then there will be reasons to cherry-pick the other data source V2 related PRs into 2.4 as well.
@mingjialiu is there a strong reason to use data source v2 on branch 2.4 instead of 3.0? There seems quite some work to sync DS v2 api changes to branch 2.4
cc @cloud-fan as well.

@cloud-fan
Copy link
Contributor

If exchange reuse is broken, it means plan equality is broken somewhere. I think Seq[Expression] is OK as long as we canonicalize it before comparing it. FileSourceScanExec also contains Seq[Expression] and it's fine.

Can you look into it more and have a more surgical fix?

@mingjialiu
Copy link
Contributor Author

If exchange reuse is broken, it means plan equality is broken somewhere. I think Seq[Expression] is OK as long as we canonicalize it before comparing it. FileSourceScanExec also contains Seq[Expression] and it's fine.

Can you look into it more and have a more surgical fix?

Updated with a more surgical fix. Please review.

@mingjialiu
Copy link
Contributor Author

In Branch 3.0, there is a mixed-in trait SupportsPushDownFilters which is introduced by #19136 and #19424 .

However, if we are going to cherry-pick the PRs mentioned above, then there will be reasons to cherry-pick the other data source V2 related PRs into 2.4 as well.
@mingjialiu is there a strong reason to use data source v2 on branch 2.4 instead of 3.0? There seems quite some work to sync DS v2 api changes to branch 2.4
cc @cloud-fan as well.

My org still relies heavily on 2.4

@gengliangwang
Copy link
Member

@mingjialiu the new fix looks more reasonable. Could you add test case for the changes?

options,
QueryPlan.normalizePredicates(
pushedFilters,
AttributeSeq(pushedFilters.flatMap(_.references).distinct)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use output here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor

The fix LGTM, can you add a test?

@SparkQA
Copy link

SparkQA commented Sep 10, 2020

Test build #128481 has finished for PR 29564 at commit a6e4709.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mingjialiu
Copy link
Contributor Author

mingjialiu commented Sep 10, 2020

The fix LGTM, can you add a test?

Regarding test coverage,  it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?

The key to repro is to have the same column assigned different expression Ids.
Relative implementation : preserve old expressionId if column not found
Explained details in email.

@mingjialiu mingjialiu closed this Sep 10, 2020
@mingjialiu mingjialiu reopened this Sep 10, 2020
@mingjialiu
Copy link
Contributor Author

The fix LGTM, can you add a test?

Regarding test coverage,  it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?

The key to repro is to have the same column assigned different expression Ids.
Relative implementation : preserve old expressionId if column not found
Explained details in email.

The fix LGTM, can you add a test?

Test added. Please review.

@mingjialiu
Copy link
Contributor Author

The fix LGTM, can you add a test?

Regarding test coverage,  it's a bit tricky to repro in a unit test. Can I get some pointers on populating different expression ids for the same column? Or test suggestions?

The key to repro is to have the same column assigned different expression Ids.
Relative implementation : preserve old expressionId if column not found
Explained details in email.

Please ignore this message. I figured out that column's expression id is consistent within the same df.

@SparkQA
Copy link

SparkQA commented Sep 11, 2020

Test build #128541 has finished for PR 29564 at commit 98483c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2020

Test build #128543 has finished for PR 29564 at commit 8b864e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -393,6 +393,29 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
}
}
}

test("SPARK-32708: same columns with different ExprIds should be equal after canonicalization ") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have an end-to-end test, how about a low-level UT? Create two DataSourceV2ScanExec instances and check scan1.sameResult(scan2).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I think this test case creates two DataSourceV2ScanExec and do the check. It looks ok to me.

@gengliangwang gengliangwang changed the title [WIP][SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 [SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 Sep 11, 2020
@mingjialiu
Copy link
Contributor Author

test pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction is failing. @gatorsmile @cloud-fan @gengliangwang do you think the failure is related to this change? If yes, any suggestions on how to fix it?

@gengliangwang
Copy link
Member

I think we can rely on the jenkins test result as well

@SparkQA
Copy link

SparkQA commented Sep 11, 2020

Test build #128574 has finished for PR 29564 at commit acadafe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mingjialiu mingjialiu closed this Sep 14, 2020
@mingjialiu mingjialiu reopened this Sep 14, 2020
@mingjialiu
Copy link
Contributor Author

@cloud-fan @gatorsmile Hi, it looks to me that 1 approval is not enough for merging. Can you please approve this PR if everything looks good to you?

@gengliangwang
Copy link
Member

Thanks, merging to branch 2.4

gengliangwang pushed a commit that referenced this pull request Sep 14, 2020
…rceV2

### What changes were proposed in this pull request?

Override doCanonicalize function of class DataSourceV2ScanExec

### Why are the changes needed?

Query plan of DataSourceV2 fails to reuse any exchange. This change can make DataSourceV2's plan more optimized and reuse exchange as DataSourceV1 and parquet do.

Direct reason: equals function of DataSourceV2ScanExec returns 'false' as comparing the same V2 scans(same table, outputs and pushedfilters)

Actual cause : With query plan's default doCanonicalize function, pushedfilters of DataSourceV2ScanExec are not canonicalized correctly. Essentially expressionId of predicate columns are not normalized.

[Spark 32708](https://issues.apache.org/jira/browse/SPARK-32708#) was not caught by my [tests](https://github.com/apache/spark/blob/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala#L392) previously added for [SPARK-32609] because the above issue happens only when the same filtered column are of different expression id (eg :  join table t1 with t1)

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

unit test added

Closes #29564 from mingjialiu/branch-2.4.

Authored-by: mingjial <mingjial@google.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
@dongjoon-hyun
Copy link
Member

Thank you, @mingjialiu and all!

@dongjoon-hyun
Copy link
Member

@gengliangwang . Could you update Fix Version of SPARK-32708?

@gengliangwang
Copy link
Member

@dongjoon-hyun sure.

@dongjoon-hyun
Copy link
Member

@gengliangwang . The Fix Version seems to be empty still.
Screen Shot 2020-09-21 at 11 53 38 AM

@gengliangwang
Copy link
Member

@dongjoon-hyun Sorry, it's updated now.

@dongjoon-hyun
Copy link
Member

Thank you, @gengliangwang !

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…rceV2

### What changes were proposed in this pull request?

Override doCanonicalize function of class DataSourceV2ScanExec

### Why are the changes needed?

Query plan of DataSourceV2 fails to reuse any exchange. This change can make DataSourceV2's plan more optimized and reuse exchange as DataSourceV1 and parquet do.

Direct reason: equals function of DataSourceV2ScanExec returns 'false' as comparing the same V2 scans(same table, outputs and pushedfilters)

Actual cause : With query plan's default doCanonicalize function, pushedfilters of DataSourceV2ScanExec are not canonicalized correctly. Essentially expressionId of predicate columns are not normalized.

[Spark 32708](https://issues.apache.org/jira/browse/SPARK-32708#) was not caught by my [tests](https://github.com/apache/spark/blob/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala#L392) previously added for [SPARK-32609] because the above issue happens only when the same filtered column are of different expression id (eg :  join table t1 with t1)

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

unit test added

Closes apache#29564 from mingjialiu/branch-2.4.

Authored-by: mingjial <mingjial@google.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

RB=2951163
BUG=APA-51006,LIHADOOP-62503
G=spark-reviewers
R=yezhou,ekrogen,smahadik
A=smahadik
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants