[SPARK-22745][SQL] read partition stats from Hive#19932
[SPARK-22745][SQL] read partition stats from Hive#19932wzhfy wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #84680 has finished for PR 19932 at commit
|
|
|
||
| // Here we are reading statistics from Hive. | ||
| // Note that this statistics could be overridden by Spark's statistics if that's available. | ||
| val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) |
There was a problem hiding this comment.
The code below is moved to a new method readHiveStats
| } | ||
| } | ||
|
|
||
| test("SPARK- - read Hive's statistics for partition") { |
There was a problem hiding this comment.
oh, I forgot it, thanks!
| assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000) | ||
| assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000) | ||
| assert(queryStats("2010-01-02", "10") === None) | ||
| assert(queryStats("2010-01-02", "11") === None) |
There was a problem hiding this comment.
After the change, these checks are not right as we read hive stats. So I remove them.
|
Test build #84689 has finished for PR 19932 at commit
|
| val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) | ||
| val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) | ||
| val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) | ||
| // TODO: check if this estimate is valid for tables after partition pruning. |
There was a problem hiding this comment.
do we still need this TODO?
There was a problem hiding this comment.
good catch, we can remove this
| // TODO: still fill the rowCount even if sizeInBytes is empty. Might break anything? | ||
| None | ||
| } | ||
| val hiveStats = readHiveStats(properties) |
| def fromHivePartition(hp: HivePartition): CatalogTablePartition = { | ||
| val apiPartition = hp.getTPartition | ||
| val properties: Map[String, String] = | ||
| if (hp.getParameters != null) hp.getParameters.asScala.toMap else Map.empty |
There was a problem hiding this comment.
nit: if can't fit in one line, prefer
val xxx = if {
...
} else {
...
}
| partition = spark.sessionState.catalog | ||
| .getPartition(TableIdentifier(tableName), Map("ds" -> "2017-01-01")) | ||
|
|
||
| assert(partition.stats.get.sizeInBytes == 5812) |
There was a problem hiding this comment.
I'm expecting totalSize is picked here and the sizeInBytes would be changed, did I miss something?
There was a problem hiding this comment.
totalSize exists after the INSERT INTO command, so here sizeInBytes doesn't change after ANALYZE command, only rowCount is added.
|
LGTM |
|
Test build #84819 has finished for PR 19932 at commit
|
|
retest this please |
|
Test build #84829 has finished for PR 19932 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Currently Spark can read table stats (e.g.
totalSize, numRows) from Hive, we can also support to read partition stats from Hive using the same logic.How was this patch tested?
Added a new test case and modified an existing test case.