[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation #21668

maropu · 2018-06-29T06:58:31Z

What changes were proposed in this pull request?

This pr proposes a new independent config so that LogicalRelation could use rowCount to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enable StarSchemaDetection.reorderStarJoins because we need to turn off CBO to enable it but StarSchemaDetection internally references the rowCount that is used in LogicalRelation if CBO disabled.

Why are the changes needed?

Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests in DataFrameJoinSuite.

maropu · 2018-06-29T06:58:42Z

This comes from #20345.

SparkQA · 2018-06-29T11:05:56Z

Test build #92460 has finished for PR 21668 at commit f0db73b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

cloud-fan · 2018-07-03T04:03:00Z

yea this is a real problem, but I feel a better solution is to integrate the StarSchemaDetection into CBO. How hard will it be?

maropu · 2018-07-03T04:18:38Z

yea, ok. I'll recheck this again. Thanks!

maropu · 2018-07-04T05:27:13Z

One of refactoring ideas is to inject the functionality of ReorderJoin(=StarSchemaDetection)
into CostBasedJoinReorder;

In the batch rule Join Reorder (Once strategy), if spark.sql.cbo.starSchemaDetection enabled (false by default), the rule applies star schema detection first. If a fact table found, dimension tables are reordered by the cost-based algorithm. If spark.sql.cbo.starSchemaDetection disabled, the rule just uses CostBasedJoinReorder.

Currently, we have ReorderJoin(=StarSchemaDetection) in the batch rule with fixedPoint strategy,
so, I thnk that, if we could remove this rule from there, we would skip unnecessary checks caused by ReorderJoin per rule iteration.

@cloud-fan WDYT?

cloud-fan · 2018-07-04T07:46:34Z

sounds reasonable, also cc @wzhfy @maryannxue

maropu · 2018-07-12T06:03:29Z

@cloud-fan If no problem, could you check #20345 and merge it first? Based on that, I'd like to start refactoring for the approach.

maropu · 2018-07-21T08:20:49Z

ping

SparkQA · 2018-07-21T12:36:48Z

Test build #93382 has finished for PR 21668 at commit 0b1f751.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-03T04:15:11Z

@cloud-fan ping

SparkQA · 2019-05-21T13:36:36Z

Test build #105617 has finished for PR 21668 at commit 0b1f751.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-09T15:31:24Z

Test build #107408 has finished for PR 21668 at commit 0b1f751.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-07-19T03:30:59Z

retest this please

HyukjinKwon · 2019-07-19T03:31:11Z

Seems fine to me too

maropu · 2019-07-19T03:40:32Z

thx for your response, @HyukjinKwon

SparkQA · 2019-07-19T03:43:22Z

Test build #107881 has finished for PR 21668 at commit 0b1f751.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-07-19T03:45:23Z

will fix in hours.

SparkQA · 2019-07-19T07:05:01Z

Test build #107887 has finished for PR 21668 at commit fabd8ee.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-07-19T07:12:16Z

retest this please

SparkQA · 2019-07-19T10:53:15Z

Test build #107888 has finished for PR 21668 at commit fabd8ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-07-21T00:03:13Z

ping @dongjoon-hyun @HyukjinKwon

dongjoon-hyun · 2019-07-21T01:19:41Z

@maropu . What is the relationship with #20345? Do you want to go without that?

[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation #21668 (comment)

maropu · 2019-07-21T01:47:34Z

This pr comes from #20345 (comment). Could you check that comment? IIUC we cannot enable StarSchemaDetection.reorderStarJoins now.

HyukjinKwon · 2019-07-26T05:43:10Z

@wzhfy @maryannxue do you have any comment on this PR?

SparkQA · 2019-11-21T13:15:25Z

Test build #114231 has finished for PR 21668 at commit 8038f1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-21T17:19:35Z

Test build #114244 has finished for PR 21668 at commit 897163c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-21T22:10:30Z

Could you check this? @cloud-fan

dongjoon-hyun · 2019-11-22T00:20:11Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

-      withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {
+      withSQLConf(
+          SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString,
+        SQLConf.PLAN_STATS_ENABLED.key -> "false") {


indentation?

dongjoon-hyun · 2019-11-22T00:22:55Z

Hi, @maropu .
I'm wondering if #20345 supersedes this PR.
It was the original @cloud-fan 's suggestion (#21668 (comment)), and it seems that you created #20345 for that. Do we still need this if we have #20345 ?

maropu · 2019-11-23T00:19:31Z

Ah, thanks for the comment, @dongjoon-hyun! To be honest, I forgot the comment above... (thanks for reminding me).

On second thoughts, yea, I personally think that this pr is still worth a try. Currently, in the master, spark.sql.cbo.enabled=true directly means the cost-based join reorder + BasicStatsPlanVisitor. Recently, the new features (e.g., the dynamic part pruning) depend on LogicalPlanVisitor[Statistics] . To use the dynamic part pruning + BasicStatsPlanVisitor, we need to set spark.sql.cbo.enabled=true. But, this also activates the cost-based join reorder.

I think how to collect data stats (BasicStatsPlanVisitor or SizeInBytesOnlyStatsPlanVisitor) is orthogonal to join reorder logics and it'd better to be able to turn on/off them individually.

What I propose is the two things as follows;

Add a new config to control how to collect data stats (this pr)
Since the name of spark.sql.cbo.enabled is ambiguous, rename it to spark.sql.cbo.joinReorder.enabled
If the dynamic part pruning is one of CBO features, rename spark.sql.optimizer.dynamicPartitionPruning.enabled to spark.sql.cbo.dynamicPartitionPruning.enabled?

WDYT? @cloud-fan @dongjoon-hyun

(off-topic: I personally think CBO is one of optimizer features, so better to move spark.sql.cbo.enabled to spark.sql.optimizer.cbo.enabled?)

dongjoon-hyun · 2019-11-23T18:51:43Z

@maropu . I agree with you because this PR aims the simple clear idea which is better than now.
For the other comments (@gatorsmile , @cloud-fan ), I believe we can adjust inside that PR after merging this because there is no ETA for them.

dongjoon-hyun · 2019-11-23T18:52:54Z

Could you address #21668 (comment) , too? If there is no other feedbacks here, that is the only (nit) blocker for me. :)

maropu · 2019-11-24T00:27:44Z

oh, I missed you comment, thanks!

dongjoon-hyun · 2019-11-24T03:05:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -634,7 +634,7 @@ case class HiveTableRelation(
  )

  override def computeStats(): Statistics = {
-    tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled))
+    tableMeta.stats.map(_.toPlanStats(output, conf.planStatsEnabled))


Oh, @maropu . I'm wondering if the following is better. If someone already is cboEnabled=true, this will protect the potential regression due to the new option because the new default value of new option is false. How do you think about that?

- tableMeta.stats.map(_.toPlanStats(output, conf.planStatsEnabled)) + tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled))

Ah, I see. It looks resonable to me, and I'll update.

SparkQA · 2019-11-24T04:28:19Z

Test build #114326 has finished for PR 21668 at commit 5221c94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-24T06:02:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

@@ -41,7 +41,7 @@ case class LogicalRelation(

  override def computeStats(): Statistics = {
    catalogTable
-      .flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled)))
+      .flatMap(_.stats.map(_.toPlanStats(output, conf.planStatsEnabled)))


Oops. Maybe, this is another instance for the following?

- .flatMap(_.stats.map(_.toPlanStats(output, conf.planStatsEnabled))) + .flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled)))

dongjoon-hyun · 2019-11-24T06:04:42Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala

@@ -354,7 +354,7 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils
    assert(catalogTable.stats.get.colStats == Map("c1" -> emptyCatalogColStat))

    // Check relation statistics
-    withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+    withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") {


This change can be reverted from this PR.

dongjoon-hyun · 2019-11-24T06:05:41Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala

@@ -505,7 +505,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSparkSession {
    Seq("orc", "").foreach { useV1SourceReaderList =>
      // This test case depends on the size of ORC in statistics.
      withSQLConf(
-        SQLConf.CBO_ENABLED.key -> "true",
+        SQLConf.PLAN_STATS_ENABLED.key -> "true",


This one also can be reverted from this PR.

dongjoon-hyun · 2019-11-24T06:06:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

          val relationStats = spark.table(tbl).queryExecution.optimizedPlan.stats
          assert(relationStats.sizeInBytes == catalogStats.sizeInBytes)
          assert(relationStats.rowCount.isEmpty)
        }
        spark.sessionState.catalog.refreshTable(TableIdentifier(tbl))
-        withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+        withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") {


This one also can be reverted.

dongjoon-hyun · 2019-11-24T06:06:24Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala

@@ -42,7 +42,7 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto
    checkKeywordsNotExist(sql(explainCostCommand),
      "Parsed Logical Plan", "Analyzed Logical Plan")

-    withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+    withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") {


This one can be reverted.

maropu · 2019-11-24T07:52:57Z

ok, @dongjoon-hyun, all the comments addressed.

SparkQA · 2019-11-24T08:05:01Z

Test build #114335 has finished for PR 21668 at commit 21222f0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-24T08:05:02Z

Test build #114336 has finished for PR 21668 at commit bd26ce7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-24T08:05:28Z

retest this please

SparkQA · 2019-11-24T11:54:41Z

Test build #114337 has finished for PR 21668 at commit bd26ce7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you, @maropu . This is an improvement. If there is a refactoring, it should keep and extend this improvement at least.

cc @gatorsmile and @cloud-fan .

HyukjinKwon

LGTM too

gatorsmile reviewed Jul 2, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala Show resolved Hide resolved

maropu force-pushed the PlanStatsConf branch from f0db73b to 0b1f751 Compare July 21, 2018 08:20

dongjoon-hyun added the SQL label Jun 14, 2019

maropu force-pushed the PlanStatsConf branch from 0b1f751 to fabd8ee Compare July 19, 2019 06:22

Fix

e412d23

maropu force-pushed the PlanStatsConf branch from fabd8ee to d9b7051 Compare November 21, 2019 01:01

Fix

897163c

dongjoon-hyun changed the title ~~[SPARK-24690][SQL] Add a new config to control plan stats computation in LogicalRelation~~ [SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation Nov 22, 2019

dongjoon-hyun reviewed Nov 22, 2019

View reviewed changes

Fix

5221c94

dongjoon-hyun reviewed Nov 24, 2019

View reviewed changes

Fix

21222f0

dongjoon-hyun reviewed Nov 24, 2019

View reviewed changes

Fix

bd26ce7

dongjoon-hyun approved these changes Nov 24, 2019

View reviewed changes

dongjoon-hyun closed this in 3f3a18f Nov 24, 2019

HyukjinKwon reviewed Nov 24, 2019

View reviewed changes

[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation #21668

[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation #21668

Conversation

maropu commented Jun 29, 2018 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Jun 29, 2018

SparkQA commented Jun 29, 2018

cloud-fan commented Jul 3, 2018

maropu commented Jul 3, 2018 • edited

maropu commented Jul 4, 2018

cloud-fan commented Jul 4, 2018

maropu commented Jul 12, 2018

maropu commented Jul 21, 2018

SparkQA commented Jul 21, 2018

maropu commented Aug 3, 2018

SparkQA commented May 21, 2019

SparkQA commented Jul 9, 2019

HyukjinKwon commented Jul 19, 2019

HyukjinKwon commented Jul 19, 2019

maropu commented Jul 19, 2019

SparkQA commented Jul 19, 2019

maropu commented Jul 19, 2019

SparkQA commented Jul 19, 2019

maropu commented Jul 19, 2019

SparkQA commented Jul 19, 2019

maropu commented Jul 21, 2019

dongjoon-hyun commented Jul 21, 2019

maropu commented Jul 21, 2019

HyukjinKwon commented Jul 26, 2019

SparkQA commented Nov 21, 2019

SparkQA commented Nov 21, 2019

maropu commented Nov 21, 2019

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 22, 2019

maropu commented Nov 23, 2019

dongjoon-hyun commented Nov 23, 2019 • edited

dongjoon-hyun commented Nov 23, 2019 • edited

maropu commented Nov 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Nov 24, 2019

SparkQA commented Nov 24, 2019

SparkQA commented Nov 24, 2019

maropu commented Nov 24, 2019

SparkQA commented Nov 24, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

maropu commented Jun 29, 2018 •

edited

maropu commented Jul 3, 2018 •

edited

dongjoon-hyun commented Nov 23, 2019 •

edited

dongjoon-hyun commented Nov 23, 2019 •

edited