Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression #29270

Closed
wants to merge 57 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Jul 28, 2020

What changes were proposed in this pull request?

This PR proposes to detect possible regression inside SparkPlan. To achieve this goal, this PR added a base test suite called PlanStabilitySuite. The basic workflow of this test suite is similar to SQLQueryTestSuite. It also uses SPARK_GENERATE_GOLDEN_FILES to decide whether it should regenerate the golden files or compare to the golden result for each input query. The difference is, PlanStabilitySuite uses the serialized explain result(.txt format) of the SparkPlan as the output of a query, instead of the data result.

And since SparkPlan is non-deterministic for various reasons, e.g., expressions ids changes, expression order changes, we'd reduce the plan to a simplified version that only contains node names and references. And we only identify those important nodes, e.g., Exchange, SubqueryExec, in the simplified plan.

And we'd reuse TPC-DS queries(v1.4, v2.7, modified) to test plans' stability. Currently, one TPC-DS query can only have one corresponding simplified golden plan.

This PR also did a few refactor, which extracts TPCDSBase from TPCDSQuerySuite. So, PlanStabilitySuite can use the TPC-DS queries as well.

Why are the changes needed?

Nowadays, Spark is getting more and more complex. Any changes might cause regression unintentionally. Spark already has some benchmark to catch the performance regression. But, yet, it doesn't have a way to detect the regression inside SparkPlan. It would be good if we could detect the possible regression early during the compile phase before the runtime phase.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added PlanStabilitySuite and it's subclasses.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 28, 2020

Reminder for reviewers: please pay more attention to the file: PlanStabilitySuite.scala, TPCDSBase.scala, TPCDSSchema.scala, TPCDSQuerySuite.scala. The rest files are all generated golden files of the TPC-DS queries.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 28, 2020

ping @gengliangwang @maropu @viirya @cloud-fan Please take a look, thanks!

@SparkQA
Copy link

SparkQA commented Jul 28, 2020

Test build #126717 has finished for PR 29270 at commit 1c5d192.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2020

Test build #126719 has finished for PR 29270 at commit f23776d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126764 has finished for PR 29270 at commit d101674.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126763 has finished for PR 29270 at commit ebd6f4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 30, 2020

@Ngone51 Thanks for the work and it looks nice to me. One question; we need to dump all the plans (parsed, analyzed, optimized, and executed plans) for this PR purpose? https://github.com/apache/spark/pull/29270/files#diff-5d15f99eb9e3c212603e4754629d0b7bR168 I mean that dumping formatted spark physical plans only (or, optimized logical plans and physical plans only?) is not enough for it?

@maropu
Copy link
Member

maropu commented Jul 30, 2020

in case there would be a query that has multiple acceptable SparkPlan.

What's a concrete case (or, query?) for multiple acceptable SparkPlan? That's the case where actual query running time is environment-dependent?

@HyukjinKwon
Copy link
Member

Hm, do we really need to test all plan as string output? It's going to make backporting very difficult whenever we make changes in the plans. It's more difficult because we check only output for .sql files but it tracks the internal plans as well.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 30, 2020

I mean that dumping formatted spark physical plans only (or, optimized logical plans and physical plans only?) is not enough for it?

I think the overall purpose for us is to reduce the performance regression when executing the query. The executed plan is the one who affected the execution directly. I think it won't cause obvious performance regression if the executed plan doesn't change, even though the parsed/optimized plan may change over time.

I also agree with @HyukjinKwon 's point. That's also why we only check some performance-sensitive nodes in the simplified plan, since other nodes could easily have arbitrary changes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 30, 2020

What's a concrete case (or, query?) for multiple acceptable SparkPlan? That's the case where actual query running time is environment-dependent?

I don't have a concrete case. This actually bases on Spark's original assumption that a query may have multiple physical plans and Spark needs to select the best one bases on some rules, e.g. cost-based rules. Though, Spark today never make it into reality.

@maropu
Copy link
Member

maropu commented Jul 30, 2020

I don't have a concrete case. This actually bases on Spark's original assumption that a query may have multiple physical plans and Spark needs to select the best one bases on some rules, e.g. cost-based rules. Though, Spark today never make it into reality.

Hm, I see. If so, how about separating it from this PR, then focusing on the main purpose (adding the test suite to check plan diffs)? I'm a bit worried that no one will use the feature.

@cloud-fan
Copy link
Contributor

One of my worries is that: this test generates plans with empty tables, and we lost test coverage for things like SMJ. Can the variant feature help to improve the test coverage here?

a0x8o added a commit to a0x8o/spark that referenced this pull request Aug 14, 2020
### What changes were proposed in this pull request?

Use the `LinkedHashMap` instead of `immutable.Map` to hold the `Window` expressions in `ExtractWindowExpressions.addWindow`.

### Why are the changes needed?

This is a bug fix for apache/spark#29270. In that PR, the generated plan(especially for the queries q47, q49, q57) on Jenkins always can not match the golden plan generated on my laptop.

It happens because `ExtractWindowExpressions.addWindow` now uses `immutable.Map` to hold the `Window` expressions by the key `(spec.partitionSpec, spec.orderSpec, WindowFunctionType.functionType(expr))` and converts the map to `Seq` at the end. Then, the `Seq` is used to add Window operators on top of the child plan. However, for the same query, the order of Windows expression inside the `Seq` could be undetermined when the expression id changes(which can affect the key). As a result, the same query could have different plans because of the undetermined order of Window operators.

Therefore, we use `LinkedHashMap`, which records the insertion order of entries, to make the adding order determined.

### Does this PR introduce _any_ user-facing change?

Maybe yes, users now always see the same plan for the same queries with multiple Window operators.

### How was this patch tested?

It's really hard to make a reproduce demo. I just tested manually with apache/spark#29270 and it looks good.

Closes #29432 from Ngone51/fix-addWindow.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Aug 15, 2020

Test build #127470 has finished for PR 29270 at commit 5b223fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// exploit data statistics.
conf.setConf(SQLConf.CBO_ENABLED, true)
conf.setConf(SQLConf.PLAN_STATS_ENABLED, true)
conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks better to check the preferSortMergeJoin=false case, too? I think some users might actively use hash joins instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm..I'm not sure about this. but it seems preferSortMergeJoin is enabled by default? Or you mean we should test both preferSortMergeJoin enabled and disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it later. The TPCDSQuerySuite doesn't test shuffle hash join either and we should fix it as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configurations have been extracted to the base class(TPCDSBase). If we change it here, I think it will also take effect for the TPCDSQuerySuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it needs more config tunning, to make sure we can plan shuffle hash join, not sort merge join or broadcast hash join.

@maropu
Copy link
Member

maropu commented Aug 15, 2020

I have no more comment and it looks okay to me except for the one comment.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 17, 2020

Have you checked the running time? https://github.com/apache/spark/pull/29270/files#r469675879

I may not fully understand the usage of @ExtendedSQLTest. The total running time with or without the annotation are both about 2 mins(I used the same command). We may not need it? @maropu

@cloud-fan
Copy link
Contributor

The total running time with or without the annotation are both about 2 mins

Then I think we don't need the tag.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 17, 2020

Then I think we don't need the tag.

Ok.

@SparkQA
Copy link

SparkQA commented Aug 17, 2020

Test build #127506 has finished for PR 29270 at commit ebdf7fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9f2893c Aug 17, 2020
@Ngone51
Copy link
Member Author

Ngone51 commented Aug 18, 2020

thank you all!!

@gatorsmile
Copy link
Member

great job!

(1) Scan parquet default.customer
Output [3]: [c_customer_sk#1, c_current_cdemo_sk#2, c_current_addr_sk#3]
Batched: true
Location: InMemoryFileIndex [file:/Users/yi.wu/IdeaProjects/spark/sql/core/spark-warehouse/org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite/customer]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ngone51 Could we do not include Location? Maybe we can use replaceNotIncludedMsg:

protected def replaceNotIncludedMsg(line: String): String = {
line.replaceAll("#\\d+", "#x")
.replaceAll(
s"Location.*$clsName/",
s"Location $notIncludedMsg/{warehouse_dir}/")
.replaceAll("Created By.*", s"Created By $notIncludedMsg")
.replaceAll("Created Time.*", s"Created Time $notIncludedMsg")
.replaceAll("Last Access.*", s"Last Access $notIncludedMsg")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Thanks for pointing it out. I'll make a follow-up soon.

* @param explain the full explain output; this is saved to help debug later as the simplified
* plan is not too useful for debugging
*/
private def generateApprovedPlanFile(plan: SparkPlan, name: String, explain: String): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually say a golden file instead of a approved file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename it to generateGoldenFile since we prefer expected compares to approved.

* WholeStageCodegen
* Project [c_customer_id]
*/
def getSimplifiedPlan(node: SparkPlan, depth: Int): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same name of the function (inner and outer) looks a bit odd ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rename it to simplifyNode.

private def isApproved(dir: File, actualSimplifiedPlan: String): Boolean = {
val file = new File(dir, "simplified.txt")
val approved = FileUtils.readFileToString(file, StandardCharsets.UTF_8)
approved == actualSimplifiedPlan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If actualSimplifiedPlan refers expected vs actual, I think it would be nicer if we rename approved to expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 25, 2020

@HyukjinKwon I can address your comments to the follow-up PR: #29537

HyukjinKwon pushed a commit that referenced this pull request Aug 25, 2020
### What changes were proposed in this pull request?

1. Extract `SQLQueryTestSuite.replaceNotIncludedMsg` to `PlanTest`.

2. Reuse `replaceNotIncludedMsg` to normalize the explain plan that generated in `PlanStabilitySuite`.

### Why are the changes needed?

This's a follow-up of #29270.
Eliminates the personal related information (e.g., local directories) in the explain plan.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated test.

Closes #29537 from Ngone51/follow-up-plan-stablity.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
*
* To re-generate golden files for entire suite, run:
* {{{
* SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *PlanStability[WithStats]Suite"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?

In my environment, I have to do something like

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *PlanStabilitySuite *PlanStabilityWithStatsSuite"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WithStats] means you can add it or not when running this command. It's not a supported syntax for SBT...

*
* To run the entire test suite:
* {{{
* build/sbt "sql/test-only *PlanStability[WithStats]Suite"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants