Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31877][SQL]Avoid stats computation for Hive table #28686

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -654,7 +654,7 @@ case class HiveTableRelation(
tableMeta: CatalogTable,
dataCols: Seq[AttributeReference],
partitionCols: Seq[AttributeReference],
tableStats: Option[Statistics] = None,
tableStats: Option[Statistics] = Some(Statistics(sizeInBytes = SQLConf.get.defaultSizeInBytes)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to use Option?

@transient prunedPartitions: Option[Seq[CatalogTablePartition]] = None)
extends LeafNode with MultiInstanceRelation {
assert(tableMeta.identifier.database.isDefined)
Expand Down
Expand Up @@ -82,8 +82,8 @@ class HiveSessionStateBuilder(session: SparkSession, parentState: Option[Session

override val postHocResolutionRules: Seq[Rule[LogicalPlan]] =
new DetectAmbiguousSelfJoin(conf) +:
new DetermineTableStats(session) +:
RelationConversions(conf, catalog) +:
new DetermineTableStats(session) +:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DetermineTableStats updates statistics in HiveTableRelation for some cases. Updated statistics will be propagated into HadoopFsRelation and so the sizeInBytes depends on it.

As you change the order of DetermineTableStats to after RelationConversions, it could change the sizeInBytes of converted HadoopFsRelation.

That said, if the user is willing to use fallBackToHdfsForStatsEnabled to calculate table size, this change will make it not work as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Hive tables are converted To HiveTableScanExec instead of Logical Relation(which inturn uses HadoopFSRelation) . This happens in org.apache.spark.sql.hive.HiveStrategies.HiveTableScans . It will not affect HadoopFSRelation I think. Let me know if I am missing some thing
Note: conversion to HiveTableScanExec happen for Hive table with any formats other than Parquet and Orc.
In case or Orc and Parquet, it happens when the flag to convert to datasource table is disabled.

Copy link
Member

@viirya viirya Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean previously DetermineTableStats will calculate table size and update in HiveTableRelation before RelationConversions. Then in RelationConversions, this calculated table size will be propagated into HadoopFsRelation and used by sizeInBytes.

Now as you change the rule order, when running RelationConversions, even users enable fallBackToHdfsForStatsEnabled, the HadoopFsRelation won't get the table size calculated in DetermineTableStats (because this rule is run after RelationConversions now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thanks for catching this, I think this re-order will not be useful. I will decline this pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated statistics will be propagated into HadoopFsRelation and so the sizeInBytes depends on it.

Ah, I see. I missed that code flow.

PreprocessTableCreation(session) +:
PreprocessTableInsertion(conf) +:
DataSourceAnalysis(conf) +:
Expand Down
Expand Up @@ -62,13 +62,15 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto
}

// No statistics information if "cost" is not specified
checkKeywordsNotExist(sql("EXPLAIN SELECT * FROM src "), "sizeInBytes", "rowCount")
checkKeywordsExist(sql("EXPLAIN SELECT * FROM src "), "sizeInBytes=8.0 EiB")
checkKeywordsNotExist(sql("EXPLAIN SELECT * FROM src "), "rowCount")
}

test("explain extended command") {
checkKeywordsExist(sql(" explain select * from src where key=123 "),
"== Physical Plan ==",
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"Statistics(sizeInBytes=8.0 EiB)")

checkKeywordsNotExist(sql(" explain select * from src where key=123 "),
"== Parsed Logical Plan ==",
Expand All @@ -81,7 +83,6 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto
"Type",
"Provider",
"Properties",
"Statistics",
"Location",
"Serde Library",
"InputFormat",
Expand Down