[SPARK-22745][SQL] read partition stats from Hive by wzhfy · Pull Request #19932 · apache/spark

wzhfy · 2017-12-09T08:39:43Z

What changes were proposed in this pull request?

Currently Spark can read table stats (e.g. totalSize, numRows) from Hive, we can also support to read partition stats from Hive using the same logic.

How was this patch tested?

Added a new test case and modified an existing test case.

SparkQA · 2017-12-09T10:42:50Z

Test build #84680 has finished for PR 19932 at commit 48b81b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-12-09T12:14:27Z

cc @cloud-fan @gatorsmile

wzhfy · 2017-12-09T12:20:31Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

-
-      // Here we are reading statistics from Hive.
-      // Note that this statistics could be overridden by Spark's statistics if that's available.
-      val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))


The code below is moved to a new method readHiveStats

wangyum · 2017-12-10T00:18:34Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

    }
  }

+  test("SPARK- - read Hive's statistics for partition") {


SPARK- -> SPARK-22745?

oh, I forgot it, thanks!

wzhfy · 2017-12-10T01:26:09Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

-      assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000)
-      assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000)
-      assert(queryStats("2010-01-02", "10") === None)
-      assert(queryStats("2010-01-02", "11") === None)


After the change, these checks are not right as we read hive stats. So I remove them.

SparkQA · 2017-12-10T03:33:48Z

Test build #84689 has finished for PR 19932 at commit 09a7c05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-12T15:18:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+    val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
+    val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
+    val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_))
+    // TODO: check if this estimate is valid for tables after partition pruning.


do we still need this TODO?

good catch, we can remove this

cloud-fan · 2017-12-12T15:19:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

-          // TODO: still fill the rowCount even if sizeInBytes is empty. Might break anything?
-          None
-        }
+      val hiveStats = readHiveStats(properties)


nit: we can inline it

cloud-fan · 2017-12-12T15:20:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

  def fromHivePartition(hp: HivePartition): CatalogTablePartition = {
    val apiPartition = hp.getTPartition
+    val properties: Map[String, String] =
+      if (hp.getParameters != null) hp.getParameters.asScala.toMap else Map.empty


nit: if can't fit in one line, prefer

val xxx = if { ... } else { ... }

cloud-fan · 2017-12-12T15:22:18Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+      partition = spark.sessionState.catalog
+        .getPartition(TableIdentifier(tableName), Map("ds" -> "2017-01-01"))
+
+      assert(partition.stats.get.sizeInBytes == 5812)


I'm expecting totalSize is picked here and the sizeInBytes would be changed, did I miss something?

totalSize exists after the INSERT INTO command, so here sizeInBytes doesn't change after ANALYZE command, only rowCount is added.

cloud-fan · 2017-12-13T02:08:11Z

LGTM

SparkQA · 2017-12-13T03:31:36Z

Test build #84819 has finished for PR 19932 at commit b80c8f3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-13T04:39:21Z

retest this please

SparkQA · 2017-12-13T06:48:33Z

Test build #84829 has finished for PR 19932 at commit b80c8f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-13T07:18:21Z

thanks, merging to master!

read partition stats

48b81b5

wzhfy commented Dec 9, 2017

View reviewed changes

wangyum reviewed Dec 10, 2017

View reviewed changes

small fix

09a7c05

wzhfy commented Dec 10, 2017

View reviewed changes

cloud-fan reviewed Dec 12, 2017

View reviewed changes

Zhenhua Wang added 2 commits December 13, 2017 09:56

Merge branch 'master' into read_hive_partition_stats

9b829d9

fix comment

b80c8f3

asfgit closed this in 7453ab0 Dec 13, 2017

Conversation

wzhfy commented Dec 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 9, 2017

Uh oh!

wzhfy commented Dec 9, 2017

Uh oh!

wzhfy Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 13, 2017

Uh oh!

SparkQA commented Dec 13, 2017

Uh oh!

cloud-fan commented Dec 13, 2017

Uh oh!

SparkQA commented Dec 13, 2017

Uh oh!

cloud-fan commented Dec 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wzhfy Dec 9, 2017 •

edited

Loading