Skip to content

[SPARK-22745][SQL] read partition stats from Hive#19932

Closed
wzhfy wants to merge 4 commits intoapache:masterfrom
wzhfy:read_hive_partition_stats
Closed

[SPARK-22745][SQL] read partition stats from Hive#19932
wzhfy wants to merge 4 commits intoapache:masterfrom
wzhfy:read_hive_partition_stats

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Dec 9, 2017

What changes were proposed in this pull request?

Currently Spark can read table stats (e.g. totalSize, numRows) from Hive, we can also support to read partition stats from Hive using the same logic.

How was this patch tested?

Added a new test case and modified an existing test case.

@SparkQA
Copy link

SparkQA commented Dec 9, 2017

Test build #84680 has finished for PR 19932 at commit 48b81b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 9, 2017

cc @cloud-fan @gatorsmile


// Here we are reading statistics from Hive.
// Note that this statistics could be overridden by Spark's statistics if that's available.
val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
Copy link
Contributor Author

@wzhfy wzhfy Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code below is moved to a new method readHiveStats

}
}

test("SPARK- - read Hive's statistics for partition") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK- -> SPARK-22745?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I forgot it, thanks!

assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000)
assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000)
assert(queryStats("2010-01-02", "10") === None)
assert(queryStats("2010-01-02", "11") === None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the change, these checks are not right as we read hive stats. So I remove them.

@SparkQA
Copy link

SparkQA commented Dec 10, 2017

Test build #84689 has finished for PR 19932 at commit 09a7c05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_))
// TODO: check if this estimate is valid for tables after partition pruning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, we can remove this

// TODO: still fill the rowCount even if sizeInBytes is empty. Might break anything?
None
}
val hiveStats = readHiveStats(properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can inline it

def fromHivePartition(hp: HivePartition): CatalogTablePartition = {
val apiPartition = hp.getTPartition
val properties: Map[String, String] =
if (hp.getParameters != null) hp.getParameters.asScala.toMap else Map.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if can't fit in one line, prefer

val xxx = if {
  ...
} else {
  ...
}

partition = spark.sessionState.catalog
.getPartition(TableIdentifier(tableName), Map("ds" -> "2017-01-01"))

assert(partition.stats.get.sizeInBytes == 5812)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting totalSize is picked here and the sizeInBytes would be changed, did I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totalSize exists after the INSERT INTO command, so here sizeInBytes doesn't change after ANALYZE command, only rowCount is added.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84819 has finished for PR 19932 at commit b80c8f3.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84829 has finished for PR 19932 at commit b80c8f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 7453ab0 Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants