New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31877][SQL]Avoid stats computation for Hive table #28686
Conversation
@viirya @dongjoon-hyun @maropu @gatorsmile Can any of you help review the change. |
ok to test |
Looks okay to me. |
Test build #123362 has finished for PR 28686 at commit
|
Test build #123404 has finished for PR 28686 at commit
|
Test build #123418 has finished for PR 28686 at commit
|
Failure doesnt look genuine to me. |
retest this please |
Test build #123443 has finished for PR 28686 at commit
|
@@ -654,7 +654,7 @@ case class HiveTableRelation( | |||
tableMeta: CatalogTable, | |||
dataCols: Seq[AttributeReference], | |||
partitionCols: Seq[AttributeReference], | |||
tableStats: Option[Statistics] = None, | |||
tableStats: Option[Statistics] = Some(Statistics(sizeInBytes = SQLConf.get.defaultSizeInBytes)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to use Option
?
Test build #123456 has finished for PR 28686 at commit
|
Test build #123457 has finished for PR 28686 at commit
|
@@ -654,7 +654,8 @@ case class HiveTableRelation( | |||
tableMeta: CatalogTable, | |||
dataCols: Seq[AttributeReference], | |||
partitionCols: Seq[AttributeReference], | |||
tableStats: Option[Statistics] = None, | |||
tableStats: Option[Statistics] = Option(Statistics(sizeInBytes | |||
= SQLConf.get.defaultSizeInBytes)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant tableStats: Statistics = Statistics(sizeInBytes = SQLConf.get.defaultSizeInBytes),
Test build #123495 has finished for PR 28686 at commit
|
RelationConversions(conf, catalog) +: | ||
new DetermineTableStats(session) +: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DetermineTableStats
updates statistics in HiveTableRelation
for some cases. Updated statistics will be propagated into HadoopFsRelation
and so the sizeInBytes
depends on it.
As you change the order of DetermineTableStats
to after RelationConversions
, it could change the sizeInBytes
of converted HadoopFsRelation
.
That said, if the user is willing to use fallBackToHdfsForStatsEnabled
to calculate table size, this change will make it not work as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Hive tables are converted To HiveTableScanExec
instead of Logical Relation
(which inturn uses HadoopFSRelation) . This happens in org.apache.spark.sql.hive.HiveStrategies.HiveTableScans
. It will not affect HadoopFSRelation I think. Let me know if I am missing some thing
Note: conversion to HiveTableScanExec happen for Hive table with any formats other than Parquet and Orc.
In case or Orc and Parquet, it happens when the flag to convert to datasource table is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean previously DetermineTableStats
will calculate table size and update in HiveTableRelation
before RelationConversions
. Then in RelationConversions
, this calculated table size will be propagated into HadoopFsRelation
and used by sizeInBytes
.
Now as you change the rule order, when running RelationConversions
, even users enable fallBackToHdfsForStatsEnabled
, the HadoopFsRelation
won't get the table size calculated in DetermineTableStats
(because this rule is run after RelationConversions
now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thanks for catching this, I think this re-order will not be useful. I will decline this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated statistics will be propagated into HadoopFsRelation and so the sizeInBytes depends on it.
Ah, I see. I missed that code flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think DetermineTableStats
doesn't always calculate table size. It is controlled by fallBackToHdfsForStatsEnabled
and only for non-partitioned tables. If the user wants to avoid it, fallBackToHdfsForStatsEnabled
should be used for it.
Test build #123497 has finished for PR 28686 at commit
|
@viirya @maropu To add to how this change is useful, I took the example of q17.sql TPCDS query on scale 1000, non-partitioned data
The time is also pretty high due to SPARK-31850. The stats computed is not used and can be avoided completely. |
@@ -171,6 +171,32 @@ class HiveCatalogedDDLSuite extends DDLSuite with TestHiveSingleton with BeforeA | |||
testDropTable(isDatasourceTable = false) | |||
} | |||
|
|||
test("DetermineTableStats should not cause any plan changes" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test build #124814 has finished for PR 28686 at commit
|
Test build #124813 has finished for PR 28686 at commit
|
Test build #124819 has finished for PR 28686 at commit
|
Retest this please. |
Test build #125134 has finished for PR 28686 at commit
|
@dongjoon-hyun The error doesn't seem to be related to the change. Can u take a look, and if intermittent can we re-trigger the tests. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
As part of DetermineTableStats rule we compute the stats for a HiveTableRelation, whcih can b an expensive operation. And it could happen multiple times for a query(SPARK-31850).
In most cases(the flag for converting Parquet and Orc table to datasource table is enabled by default in master branch), RelationConversion rule converts the HiveTableRelation to LogicalRelation.
When the conversion happens, the stats computed as part of Hive Table relation does not get used.
In this change, stats compute is avoided by performing the conversion before computing stats.
Why are the changes needed?
With the change, stats for Hive table will not be computed unnecessarily.
Does this PR introduce any user-facing change?
No
How was this patch tested?
It was tested on local machine and behaviour verified.