Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation #21668

Closed
wants to merge 6 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jun 29, 2018

What changes were proposed in this pull request?

This pr proposes a new independent config so that LogicalRelation could use rowCount to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enable StarSchemaDetection.reorderStarJoins because we need to turn off CBO to enable it but StarSchemaDetection internally references the rowCount that is used in LogicalRelation if CBO disabled.

Why are the changes needed?

Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests in DataFrameJoinSuite.

@maropu
Copy link
Member Author

maropu commented Jun 29, 2018

This comes from #20345.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92460 has finished for PR 21668 at commit f0db73b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

yea this is a real problem, but I feel a better solution is to integrate the StarSchemaDetection into CBO. How hard will it be?

@maropu
Copy link
Member Author

maropu commented Jul 3, 2018

yea, ok. I'll recheck this again. Thanks!

@maropu
Copy link
Member Author

maropu commented Jul 4, 2018

One of refactoring ideas is to inject the functionality of ReorderJoin(=StarSchemaDetection)
into CostBasedJoinReorder;

In the batch rule Join Reorder (Once strategy), if spark.sql.cbo.starSchemaDetection enabled (false by default), the rule applies star schema detection first. If a fact table found, dimension tables are reordered by the cost-based algorithm. If spark.sql.cbo.starSchemaDetection disabled, the rule just uses CostBasedJoinReorder.

Currently, we have ReorderJoin(=StarSchemaDetection) in the batch rule with fixedPoint strategy,
so, I thnk that, if we could remove this rule from there, we would skip unnecessary checks caused by ReorderJoin per rule iteration.

@cloud-fan WDYT?

@cloud-fan
Copy link
Contributor

sounds reasonable, also cc @wzhfy @maryannxue

@maropu
Copy link
Member Author

maropu commented Jul 12, 2018

@cloud-fan If no problem, could you check #20345 and merge it first? Based on that, I'd like to start refactoring for the approach.

@maropu
Copy link
Member Author

maropu commented Jul 21, 2018

ping

@SparkQA
Copy link

SparkQA commented Jul 21, 2018

Test build #93382 has finished for PR 21668 at commit 0b1f751.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Aug 3, 2018

@cloud-fan ping

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105617 has finished for PR 21668 at commit 0b1f751.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107408 has finished for PR 21668 at commit 0b1f751.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

Seems fine to me too

@maropu
Copy link
Member Author

maropu commented Jul 19, 2019

thx for your response, @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107881 has finished for PR 21668 at commit 0b1f751.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jul 19, 2019

will fix in hours.

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107887 has finished for PR 21668 at commit fabd8ee.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jul 19, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107888 has finished for PR 21668 at commit fabd8ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jul 21, 2019

ping @dongjoon-hyun @HyukjinKwon

@dongjoon-hyun
Copy link
Member

@maropu . What is the relationship with #20345? Do you want to go without that?

@maropu
Copy link
Member Author

maropu commented Jul 21, 2019

This pr comes from #20345 (comment). Could you check that comment? IIUC we cannot enable StarSchemaDetection.reorderStarJoins now.

@HyukjinKwon
Copy link
Member

@wzhfy @maryannxue do you have any comment on this PR?

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114231 has finished for PR 21668 at commit 8038f1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114244 has finished for PR 21668 at commit 897163c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Nov 21, 2019

Could you check this? @cloud-fan

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-24690][SQL] Add a new config to control plan stats computation in LogicalRelation [SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation Nov 22, 2019
withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {
withSQLConf(
SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString,
SQLConf.PLAN_STATS_ENABLED.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

@dongjoon-hyun
Copy link
Member

Hi, @maropu .
I'm wondering if #20345 supersedes this PR.
It was the original @cloud-fan 's suggestion (#21668 (comment)), and it seems that you created #20345 for that. Do we still need this if we have #20345 ?

@maropu
Copy link
Member Author

maropu commented Nov 23, 2019

Ah, thanks for the comment, @dongjoon-hyun! To be honest, I forgot the comment above... (thanks for reminding me).

On second thoughts, yea, I personally think that this pr is still worth a try. Currently, in the master, spark.sql.cbo.enabled=true directly means the cost-based join reorder + BasicStatsPlanVisitor. Recently, the new features (e.g., the dynamic part pruning) depend on LogicalPlanVisitor[Statistics] . To use the dynamic part pruning + BasicStatsPlanVisitor, we need to set spark.sql.cbo.enabled=true. But, this also activates the cost-based join reorder.

I think how to collect data stats (BasicStatsPlanVisitor or SizeInBytesOnlyStatsPlanVisitor) is orthogonal to join reorder logics and it'd better to be able to turn on/off them individually.

What I propose is the two things as follows;

  • Add a new config to control how to collect data stats (this pr)
  • Since the name of spark.sql.cbo.enabled is ambiguous, rename it to spark.sql.cbo.joinReorder.enabled
  • If the dynamic part pruning is one of CBO features, rename spark.sql.optimizer.dynamicPartitionPruning.enabled to spark.sql.cbo.dynamicPartitionPruning.enabled?

WDYT? @cloud-fan @dongjoon-hyun

(off-topic: I personally think CBO is one of optimizer features, so better to move spark.sql.cbo.enabled to spark.sql.optimizer.cbo.enabled?)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 23, 2019

@maropu . I agree with you because this PR aims the simple clear idea which is better than now.
For the other comments (@gatorsmile , @cloud-fan ), I believe we can adjust inside that PR after merging this because there is no ETA for them.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 23, 2019

Could you address #21668 (comment) , too? If there is no other feedbacks here, that is the only (nit) blocker for me. :)

@maropu
Copy link
Member Author

maropu commented Nov 24, 2019

oh, I missed you comment, thanks!

@@ -634,7 +634,7 @@ case class HiveTableRelation(
)

override def computeStats(): Statistics = {
tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled))
tableMeta.stats.map(_.toPlanStats(output, conf.planStatsEnabled))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, @maropu . I'm wondering if the following is better. If someone already is cboEnabled=true, this will protect the potential regression due to the new option because the new default value of new option is false. How do you think about that?

- tableMeta.stats.map(_.toPlanStats(output, conf.planStatsEnabled))
+ tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. It looks resonable to me, and I'll update.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114326 has finished for PR 21668 at commit 5221c94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -41,7 +41,7 @@ case class LogicalRelation(

override def computeStats(): Statistics = {
catalogTable
.flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled)))
.flatMap(_.stats.map(_.toPlanStats(output, conf.planStatsEnabled)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Maybe, this is another instance for the following?

-  .flatMap(_.stats.map(_.toPlanStats(output, conf.planStatsEnabled)))
+  .flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled)))

@@ -354,7 +354,7 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils
assert(catalogTable.stats.get.colStats == Map("c1" -> emptyCatalogColStat))

// Check relation statistics
withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be reverted from this PR.

@@ -505,7 +505,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSparkSession {
Seq("orc", "").foreach { useV1SourceReaderList =>
// This test case depends on the size of ORC in statistics.
withSQLConf(
SQLConf.CBO_ENABLED.key -> "true",
SQLConf.PLAN_STATS_ENABLED.key -> "true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also can be reverted from this PR.

val relationStats = spark.table(tbl).queryExecution.optimizedPlan.stats
assert(relationStats.sizeInBytes == catalogStats.sizeInBytes)
assert(relationStats.rowCount.isEmpty)
}
spark.sessionState.catalog.refreshTable(TableIdentifier(tbl))
withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also can be reverted.

@@ -42,7 +42,7 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto
checkKeywordsNotExist(sql(explainCostCommand),
"Parsed Logical Plan", "Analyzed Logical Plan")

withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one can be reverted.

@maropu
Copy link
Member Author

maropu commented Nov 24, 2019

ok, @dongjoon-hyun, all the comments addressed.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114335 has finished for PR 21668 at commit 21222f0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114336 has finished for PR 21668 at commit bd26ce7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Nov 24, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114337 has finished for PR 21668 at commit bd26ce7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
Thank you, @maropu . This is an improvement. If there is a refactoring, it should keep and extend this improvement at least.

cc @gatorsmile and @cloud-fan .

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants