[SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression #29270

Ngone51 · 2020-07-28T13:02:58Z

What changes were proposed in this pull request?

This PR proposes to detect possible regression inside SparkPlan. To achieve this goal, this PR added a base test suite called PlanStabilitySuite. The basic workflow of this test suite is similar to SQLQueryTestSuite. It also uses SPARK_GENERATE_GOLDEN_FILES to decide whether it should regenerate the golden files or compare to the golden result for each input query. The difference is, PlanStabilitySuite uses the serialized explain result(.txt format) of the SparkPlan as the output of a query, instead of the data result.

And since SparkPlan is non-deterministic for various reasons, e.g., expressions ids changes, expression order changes, we'd reduce the plan to a simplified version that only contains node names and references. And we only identify those important nodes, e.g., Exchange, SubqueryExec, in the simplified plan.

And we'd reuse TPC-DS queries(v1.4, v2.7, modified) to test plans' stability. Currently, one TPC-DS query can only have one corresponding simplified golden plan.

This PR also did a few refactor, which extracts TPCDSBase from TPCDSQuerySuite. So, PlanStabilitySuite can use the TPC-DS queries as well.

Why are the changes needed?

Nowadays, Spark is getting more and more complex. Any changes might cause regression unintentionally. Spark already has some benchmark to catch the performance regression. But, yet, it doesn't have a way to detect the regression inside SparkPlan. It would be good if we could detect the possible regression early during the compile phase before the runtime phase.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added PlanStabilitySuite and it's subclasses.

Ngone51 · 2020-07-28T13:06:26Z

Reminder for reviewers: please pay more attention to the file: PlanStabilitySuite.scala, TPCDSBase.scala, TPCDSSchema.scala, TPCDSQuerySuite.scala. The rest files are all generated golden files of the TPC-DS queries.

Ngone51 · 2020-07-28T13:09:57Z

ping @gengliangwang @maropu @viirya @cloud-fan Please take a look, thanks!

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

SparkQA · 2020-07-28T15:27:24Z

Test build #126717 has finished for PR 29270 at commit 1c5d192.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-28T16:24:19Z

Test build #126719 has finished for PR 29270 at commit f23776d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-29T09:29:36Z

Test build #126764 has finished for PR 29270 at commit d101674.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-29T11:02:41Z

Test build #126763 has finished for PR 29270 at commit ebd6f4c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-30T01:03:14Z

@Ngone51 Thanks for the work and it looks nice to me. One question; we need to dump all the plans (parsed, analyzed, optimized, and executed plans) for this PR purpose? https://github.com/apache/spark/pull/29270/files#diff-5d15f99eb9e3c212603e4754629d0b7bR168 I mean that dumping formatted spark physical plans only (or, optimized logical plans and physical plans only?) is not enough for it?

maropu · 2020-07-30T01:29:31Z

in case there would be a query that has multiple acceptable SparkPlan.

What's a concrete case (or, query?) for multiple acceptable SparkPlan? That's the case where actual query running time is environment-dependent?

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

HyukjinKwon · 2020-07-30T02:36:03Z

Hm, do we really need to test all plan as string output? It's going to make backporting very difficult whenever we make changes in the plans. It's more difficult because we check only output for .sql files but it tracks the internal plans as well.

Ngone51 · 2020-07-30T03:27:03Z

I mean that dumping formatted spark physical plans only (or, optimized logical plans and physical plans only?) is not enough for it?

I think the overall purpose for us is to reduce the performance regression when executing the query. The executed plan is the one who affected the execution directly. I think it won't cause obvious performance regression if the executed plan doesn't change, even though the parsed/optimized plan may change over time.

I also agree with @HyukjinKwon 's point. That's also why we only check some performance-sensitive nodes in the simplified plan, since other nodes could easily have arbitrary changes.

Ngone51 · 2020-07-30T03:35:15Z

What's a concrete case (or, query?) for multiple acceptable SparkPlan? That's the case where actual query running time is environment-dependent?

I don't have a concrete case. This actually bases on Spark's original assumption that a query may have multiple physical plans and Spark needs to select the best one bases on some rules, e.g. cost-based rules. Though, Spark today never make it into reality.

maropu · 2020-07-30T03:41:49Z

I don't have a concrete case. This actually bases on Spark's original assumption that a query may have multiple physical plans and Spark needs to select the best one bases on some rules, e.g. cost-based rules. Though, Spark today never make it into reality.

Hm, I see. If so, how about separating it from this PR, then focusing on the main purpose (adding the test suite to check plan diffs)? I'm a bit worried that no one will use the feature.

cloud-fan · 2020-07-30T04:32:54Z

One of my worries is that: this test generates plans with empty tables, and we lost test coverage for things like SMJ. Can the variant feature help to improve the test coverage here?

### What changes were proposed in this pull request? Use the `LinkedHashMap` instead of `immutable.Map` to hold the `Window` expressions in `ExtractWindowExpressions.addWindow`. ### Why are the changes needed? This is a bug fix for apache/spark#29270. In that PR, the generated plan(especially for the queries q47, q49, q57) on Jenkins always can not match the golden plan generated on my laptop. It happens because `ExtractWindowExpressions.addWindow` now uses `immutable.Map` to hold the `Window` expressions by the key `(spec.partitionSpec, spec.orderSpec, WindowFunctionType.functionType(expr))` and converts the map to `Seq` at the end. Then, the `Seq` is used to add Window operators on top of the child plan. However, for the same query, the order of Windows expression inside the `Seq` could be undetermined when the expression id changes(which can affect the key). As a result, the same query could have different plans because of the undetermined order of Window operators. Therefore, we use `LinkedHashMap`, which records the insertion order of entries, to make the adding order determined. ### Does this PR introduce _any_ user-facing change? Maybe yes, users now always see the same plan for the same queries with multiple Window operators. ### How was this patch tested? It's really hard to make a reproduce demo. I just tested manually with apache/spark#29270 and it looks good. Closes #29432 from Ngone51/fix-addWindow. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2020-08-15T06:54:31Z

Test build #127470 has finished for PR 29270 at commit 5b223fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-08-15T07:39:27Z

sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala

+      // exploit data statistics.
+      conf.setConf(SQLConf.CBO_ENABLED, true)
+      conf.setConf(SQLConf.PLAN_STATS_ENABLED, true)
+      conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true)


It looks better to check the preferSortMergeJoin=false case, too? I think some users might actively use hash joins instead.

hmm..I'm not sure about this. but it seems preferSortMergeJoin is enabled by default? Or you mean we should test both preferSortMergeJoin enabled and disabled?

Let's do it later. The TPCDSQuerySuite doesn't test shuffle hash join either and we should fix it as well.

The configurations have been extracted to the base class(TPCDSBase). If we change it here, I think it will also take effect for the TPCDSQuerySuite.

I believe it needs more config tunning, to make sure we can plan shuffle hash join, not sort merge join or broadcast hash join.

maropu · 2020-08-15T07:43:21Z

I have no more comment and it looks okay to me except for the one comment.

Ngone51 · 2020-08-17T03:05:13Z

Have you checked the running time? https://github.com/apache/spark/pull/29270/files#r469675879

I may not fully understand the usage of @ExtendedSQLTest. The total running time with or without the annotation are both about 2 mins(I used the same command). We may not need it? @maropu

cloud-fan · 2020-08-17T04:41:52Z

The total running time with or without the annotation are both about 2 mins

Then I think we don't need the tag.

Ngone51 · 2020-08-17T05:03:14Z

Then I think we don't need the tag.

Ok.

SparkQA · 2020-08-17T13:54:05Z

Test build #127506 has finished for PR 29270 at commit ebdf7fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-17T14:21:57Z

thanks, merging to master!

Ngone51 · 2020-08-18T01:45:38Z

thank you all!!

gatorsmile · 2020-08-18T23:22:01Z

great job!

wangyum · 2020-08-21T14:53:21Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q10.sf100/explain.txt

+(1) Scan parquet default.customer
+Output [3]: [c_customer_sk#1, c_current_cdemo_sk#2, c_current_addr_sk#3]
+Batched: true
+Location: InMemoryFileIndex [file:/Users/yi.wu/IdeaProjects/spark/sql/core/spark-warehouse/org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite/customer]


@Ngone51 Could we do not include Location? Maybe we can use replaceNotIncludedMsg:

spark/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

Lines 521 to 528 in d80d0ce

protected def replaceNotIncludedMsg(line: String): String = {

line.replaceAll("#\\d+", "#x")

.replaceAll(

s"Location.*$clsName/",

s"Location $notIncludedMsg/{warehouse_dir}/")

.replaceAll("Created By.*", s"Created By $notIncludedMsg")

.replaceAll("Created Time.*", s"Created Time $notIncludedMsg")

.replaceAll("Last Access.*", s"Last Access $notIncludedMsg")

Oh, I see. Thanks for pointing it out. I'll make a follow-up soon.

HyukjinKwon · 2020-08-25T04:54:01Z

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

+   * @param explain the full explain output; this is saved to help debug later as the simplified
+   *                plan is not too useful for debugging
+   */
+  private def generateApprovedPlanFile(plan: SparkPlan, name: String, explain: String): Unit = {


I think we usually say a golden file instead of a approved file.

I can rename it to generateGoldenFile since we prefer expected compares to approved.

HyukjinKwon · 2020-08-25T04:55:36Z

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

+     *   WholeStageCodegen
+     *     Project [c_customer_id]
+     */
+    def getSimplifiedPlan(node: SparkPlan, depth: Int): String = {


Same name of the function (inner and outer) looks a bit odd ...

Let me rename it to simplifyNode.

HyukjinKwon · 2020-08-25T05:00:17Z

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

+  private def isApproved(dir: File, actualSimplifiedPlan: String): Boolean = {
+    val file = new File(dir, "simplified.txt")
+    val approved = FileUtils.readFileToString(file, StandardCharsets.UTF_8)
+    approved == actualSimplifiedPlan


If actualSimplifiedPlan refers expected vs actual, I think it would be nicer if we rename approved to expected.

Sounds reasonable.

Ngone51 · 2020-08-25T05:47:43Z

@HyukjinKwon I can address your comments to the follow-up PR: #29537

### What changes were proposed in this pull request? 1. Extract `SQLQueryTestSuite.replaceNotIncludedMsg` to `PlanTest`. 2. Reuse `replaceNotIncludedMsg` to normalize the explain plan that generated in `PlanStabilitySuite`. ### Why are the changes needed? This's a follow-up of #29270. Eliminates the personal related information (e.g., local directories) in the explain plan. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated test. Closes #29537 from Ngone51/follow-up-plan-stablity. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

gatorsmile · 2020-09-13T05:09:36Z

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

+ *
+ * To re-generate golden files for entire suite, run:
+ * {{{
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *PlanStability[WithStats]Suite"


Does this work?

In my environment, I have to do something like

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *PlanStabilitySuite *PlanStabilityWithStatsSuite"

[WithStats] means you can add it or not when running this command. It's not a supported syntax for SBT...

gatorsmile · 2020-09-13T05:10:31Z

sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

+ *
+ * To run the entire test suite:
+ * {{{
+ *   build/sbt "sql/test-only *PlanStability[WithStats]Suite"


the same here.

probot-autolabeler bot added the SQL label Jul 28, 2020

cloud-fan reviewed Jul 28, 2020

View reviewed changes