-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation #21668
Conversation
This comes from #20345. |
Test build #92460 has finished for PR 21668 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
Show resolved
Hide resolved
yea this is a real problem, but I feel a better solution is to integrate the StarSchemaDetection into CBO. How hard will it be? |
yea, ok. I'll recheck this again. Thanks! |
One of refactoring ideas is to inject the functionality of In the batch rule Currently, we have @cloud-fan WDYT? |
sounds reasonable, also cc @wzhfy @maryannxue |
@cloud-fan If no problem, could you check #20345 and merge it first? Based on that, I'd like to start refactoring for the approach. |
ping |
Test build #93382 has finished for PR 21668 at commit
|
@cloud-fan ping |
Test build #105617 has finished for PR 21668 at commit
|
Test build #107408 has finished for PR 21668 at commit
|
retest this please |
Seems fine to me too |
thx for your response, @HyukjinKwon |
Test build #107881 has finished for PR 21668 at commit
|
will fix in hours. |
Test build #107887 has finished for PR 21668 at commit
|
retest this please |
Test build #107888 has finished for PR 21668 at commit
|
This pr comes from #20345 (comment). Could you check that comment? IIUC we cannot enable StarSchemaDetection.reorderStarJoins now. |
@wzhfy @maryannxue do you have any comment on this PR? |
fabd8ee
to
d9b7051
Compare
Test build #114231 has finished for PR 21668 at commit
|
Test build #114244 has finished for PR 21668 at commit
|
Could you check this? @cloud-fan |
withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) { | ||
withSQLConf( | ||
SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString, | ||
SQLConf.PLAN_STATS_ENABLED.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
Hi, @maropu . |
Ah, thanks for the comment, @dongjoon-hyun! To be honest, I forgot the comment above... (thanks for reminding me). On second thoughts, yea, I personally think that this pr is still worth a try. Currently, in the master, I think how to collect data stats ( What I propose is the two things as follows;
WDYT? @cloud-fan @dongjoon-hyun (off-topic: I personally think CBO is one of optimizer features, so better to move |
@maropu . I agree with you because this PR aims the simple clear idea which is better than now. |
Could you address #21668 (comment) , too? If there is no other feedbacks here, that is the only (nit) blocker for me. :) |
oh, I missed you comment, thanks! |
@@ -634,7 +634,7 @@ case class HiveTableRelation( | |||
) | |||
|
|||
override def computeStats(): Statistics = { | |||
tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled)) | |||
tableMeta.stats.map(_.toPlanStats(output, conf.planStatsEnabled)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, @maropu . I'm wondering if the following is better. If someone already is cboEnabled=true
, this will protect the potential regression due to the new option because the new default value of new option is false
. How do you think about that?
- tableMeta.stats.map(_.toPlanStats(output, conf.planStatsEnabled))
+ tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. It looks resonable to me, and I'll update.
Test build #114326 has finished for PR 21668 at commit
|
@@ -41,7 +41,7 @@ case class LogicalRelation( | |||
|
|||
override def computeStats(): Statistics = { | |||
catalogTable | |||
.flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled))) | |||
.flatMap(_.stats.map(_.toPlanStats(output, conf.planStatsEnabled))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Maybe, this is another instance for the following?
- .flatMap(_.stats.map(_.toPlanStats(output, conf.planStatsEnabled)))
+ .flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled)))
@@ -354,7 +354,7 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils | |||
assert(catalogTable.stats.get.colStats == Map("c1" -> emptyCatalogColStat)) | |||
|
|||
// Check relation statistics | |||
withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { | |||
withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can be reverted from this PR.
@@ -505,7 +505,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSparkSession { | |||
Seq("orc", "").foreach { useV1SourceReaderList => | |||
// This test case depends on the size of ORC in statistics. | |||
withSQLConf( | |||
SQLConf.CBO_ENABLED.key -> "true", | |||
SQLConf.PLAN_STATS_ENABLED.key -> "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one also can be reverted from this PR.
val relationStats = spark.table(tbl).queryExecution.optimizedPlan.stats | ||
assert(relationStats.sizeInBytes == catalogStats.sizeInBytes) | ||
assert(relationStats.rowCount.isEmpty) | ||
} | ||
spark.sessionState.catalog.refreshTable(TableIdentifier(tbl)) | ||
withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { | ||
withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one also can be reverted.
@@ -42,7 +42,7 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto | |||
checkKeywordsNotExist(sql(explainCostCommand), | |||
"Parsed Logical Plan", "Analyzed Logical Plan") | |||
|
|||
withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { | |||
withSQLConf(SQLConf.CBO_ENABLED.key -> "true", SQLConf.PLAN_STATS_ENABLED.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one can be reverted.
ok, @dongjoon-hyun, all the comments addressed. |
Test build #114335 has finished for PR 21668 at commit
|
Test build #114336 has finished for PR 21668 at commit
|
retest this please |
Test build #114337 has finished for PR 21668 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you, @maropu . This is an improvement. If there is a refactoring, it should keep and extend this improvement at least.
cc @gatorsmile and @cloud-fan .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
What changes were proposed in this pull request?
This pr proposes a new independent config so that
LogicalRelation
could userowCount
to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enableStarSchemaDetection.reorderStarJoins
because we need to turn off CBO to enable it butStarSchemaDetection
internally references therowCount
that is used in LogicalRelation if CBO disabled.Why are the changes needed?
Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added tests in
DataFrameJoinSuite
.