[SPARK-20986] [SQL] Reset table's statistics after PruneFileSourcePartitions rule. #18205

lianhuiwang · 2017-06-05T15:11:04Z

What changes were proposed in this pull request?

After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed.

How was this patch tested?

add unit test.

dongjoon-hyun · 2017-06-05T15:53:07Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+  test("SPARK-20986 Reset table's statistics after PruneFileSourcePartitions rule") {
+    withTempView("tempTbl", "partTbl") {
+      spark.range(1000).selectExpr("id").createOrReplaceTempView("tempTbl")
+      sql("CREATE TABLE partTbl (id INT) PARTITIONED BY (part INT) STORED AS parquet")


Hi, @lianhuiwang .
withTable("partTbl") instead of withTempView(..., "partTbl")?

Yes, thanks.

SparkQA · 2017-06-05T17:33:21Z

Test build #77744 has finished for PR 18205 at commit 20a6043.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-06T03:14:48Z

Test build #77763 has finished for PR 18205 at commit c53a0c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-09T13:34:06Z

Test build #77842 has finished for PR 18205 at commit c53a0c7.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-06-10T18:31:04Z

Can you please explain why reset?
To my understanding, we don't have any statistics in CatalogStatistics initially, and we can get the right size in computeStats through HadoopFsRelation.sizeInbytes, because in the original code we already replace it with prunedFsRelation.

wzhfy · 2017-06-10T18:42:24Z

OK. I get your point. But the test case does not clearly show the problem. We can first analyze the table to fill stats in CatalogStatistics, then show difference after partition pruning.

wzhfy · 2017-06-10T18:45:09Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

-        val prunedLogicalRelation = logicalRelation.copy(relation = prunedFsRelation)
-
+        val withStats = logicalRelation.catalogTable.map(_.copy(
+          stats = Some(CatalogStatistics(sizeInBytes = BigInt(prunedFileIndex.sizeInBytes)))))


add a comment here indicating we are reseting stats based on pruned file size?

Yes, Thanks.

wzhfy · 2017-06-10T18:47:14Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+          val df = sql("SELECT * FROM partTbl where part = 1")
+          val query = df.queryExecution.analyzed.analyze
+          val sizes1 = query.collect {
+            case relation: LogicalRelation => relation.computeStats(conf).sizeInBytes


We'd better not to compute stats for an analyzed plan. We can use spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName)).stats to query the catalog stats.

Yes, Thanks.

wzhfy · 2017-06-10T18:49:09Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+          val sizes2 = Optimize.execute(query).collect {
+            case relation: LogicalRelation => relation.computeStats(conf).sizeInBytes
+          }
+          assert(sizes2.size === 1, s"Size wrong for:\n ${df.queryExecution}")


assert the new size in catalog stats is larger than the previous one, and equal to computeStats(conf).sizeInBytes?

I donot think that it have changed the stats of catalog. after the optimizer, the size in catalog stats is larger than computeStats(conf).sizeInBytes because the partition pruned.

LogicalRelation overrides computeStats and it will use CatalogStatistics if it exists

I donot think that it have changed the stats of catalog.

Don't we reset the catalog stats using the pruned size here?

SparkQA · 2017-06-11T12:58:53Z

Test build #77892 has finished for PR 18205 at commit 120662e.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-06-11T17:38:57Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+          val df = sql("SELECT * FROM partTbl where part = 1")
+          val query = df.queryExecution.analyzed.analyze
+          val sizes1 = query.collect {
+            case relation: LogicalRelation => relation.computeStats(conf).sizeInBytes


Can we get catalog stats by relation.catalogTable.get.stats.get here and check it? I just think we need to cover this reset code path

Yes, Thanks.

wzhfy · 2017-06-11T17:39:33Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+        }
+
+        val tableName = "partTbl"
+        sql(s"analyze table partTbl compute STATISTICS")


nit: ANALYZE TABLE partTbl COMPUTE STATISTICS

Yes, Thanks.

wzhfy · 2017-06-11T17:41:41Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+
+        withSQLConf(SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key -> "true") {
+          val df = sql("SELECT * FROM partTbl where part = 1")
+          val query = df.queryExecution.analyzed.analyze


nit: just df.queryExecution.analyzed?

Because there is SubqueryAlias plan, I think that we need analyze() to eliminate it.

but why we need to eliminate SubqueryAlias here?

wzhfy · 2017-06-11T19:41:54Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+  test("SPARK-20986 Reset table's statistics after PruneFileSourcePartitions rule") {
+    withTempView("tempTbl") {
+      withTable("partTbl") {
+        spark.range(1000).selectExpr("id").createOrReplaceTempView("tempTbl")


For this test, we can use a much smaller size (e.g. 10) to accelerate testing.

Yes, Thanks.

lianhuiwang · 2017-06-12T12:27:20Z

@wzhfy I have addressed your comments. Thanks.

SparkQA · 2017-06-12T14:49:22Z

Test build #77934 has finished for PR 18205 at commit f7c3dfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-06-12T17:23:33Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+          assert(sizes1.size === 1, s"Size wrong for:\n ${df.queryExecution}")
+          assert(sizes1(0) == tableStats.get.sizeInBytes)
+          val sizes2 = Optimize.execute(query).collect {
+            case relation: LogicalRelation => relation.catalogTable.get.stats.get.sizeInBytes


fixed the wrong place? For size1, could you get the catalog stats? We'd better not to computeStats for analyzed plan. After optimization, for size2 or size3, we can get sizes from both the catalog stats and computeStats, see if they are equal and larger than size1.

Yes, Thanks. I have update with it.

SparkQA · 2017-06-13T03:35:55Z

Test build #77958 has finished for PR 18205 at commit d1513a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-06-13T06:31:59Z

LGTM, ping @cloud-fan

cloud-fan · 2017-06-13T06:49:07Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

-
+        // Change table stats based on the sizeInBytes of pruned files
+        val withStats = logicalRelation.catalogTable.map(_.copy(
+          stats = Some(CatalogStatistics(sizeInBytes = BigInt(prunedFileIndex.sizeInBytes)))))


do we ignore all column stats here?

Yes, Now it replace stats of CatalogTable with new CatalogStatistics() like DetermineTableStats.

Column stats are collected as table-level, here we need partition-specific stats, so we can ignore column stats.

ah actually we have to, the column stats is table level and is invalid for partitions.

cloud-fan · 2017-06-13T06:54:40Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+  test("SPARK-20986 Reset table's statistics after PruneFileSourcePartitions rule") {
+    withTempView("tempTbl") {
+      withTable("partTbl") {
+        spark.range(10).selectExpr("id").createOrReplaceTempView("tempTbl")


to simplify the test:

spark.range(10).select('id, 'id % 3 as 'p).write.partitionBy("p").saveAsTable("t")

Yes,Great, Thanks.

cloud-fan · 2017-06-13T06:55:22Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+          }
+          assert(sizes1.size === 1, s"Size wrong for:\n ${df.queryExecution}")
+          assert(sizes1(0) == tableStats.get.sizeInBytes)
+          val relations = Optimize.execute(query).collect {


df.queryExecution.optimized

cloud-fan · 2017-06-13T06:58:01Z

...hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala

+          }
+          assert(relations.size === 1, s"Size wrong for:\n ${df.queryExecution}")
+          val size2 = relations(0).computeStats(conf).sizeInBytes
+          val size3 = relations(0).catalogTable.get.stats.get.sizeInBytes


nit:

assert(size2 == relations(0).catalogTable.get.stats.get.sizeInBytes) assert(size2 < tableStats.get.sizeInBytes)

lianhuiwang · 2017-06-13T12:07:34Z

@cloud-fan I have addressed your comments. Thanks.

SparkQA · 2017-06-13T14:47:20Z

Test build #77989 has finished for PR 18205 at commit 16a3f7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-13T18:57:54Z

We have another related PR: #14655

cloud-fan · 2017-06-14T01:58:27Z

thanks, merging to master/2.2!

…itions rule. ## What changes were proposed in this pull request? After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed. ## How was this patch tested? add unit test. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #18205 from lianhuiwang/SPARK-20986. (cherry picked from commit 8b5b2e2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

lianhuiwang · 2017-06-14T12:14:05Z

@cloud-fan Thanks.

…itions rule. ## What changes were proposed in this pull request? After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed. ## How was this patch tested? add unit test. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes apache#18205 from lianhuiwang/SPARK-20986.

init commit

20a6043

dongjoon-hyun reviewed Jun 5, 2017

View reviewed changes

address comments.

c53a0c7

wzhfy reviewed Jun 10, 2017

View reviewed changes

address comments.

120662e

wzhfy reviewed Jun 11, 2017

View reviewed changes

address comments.

f7c3dfc

wzhfy reviewed Jun 12, 2017

View reviewed changes

address comments for UT.

d1513a8

cloud-fan reviewed Jun 13, 2017

View reviewed changes

lianhuiwang added 2 commits June 13, 2017 20:07

address comments for UT.

6e28460

remove unused import.

16a3f7e

asfgit closed this in 8b5b2e2 Jun 14, 2017

[SPARK-20986] [SQL] Reset table's statistics after PruneFileSourcePartitions rule. #18205

[SPARK-20986] [SQL] Reset table's statistics after PruneFileSourcePartitions rule. #18205

Conversation

lianhuiwang commented Jun 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 5, 2017

SparkQA commented Jun 6, 2017

SparkQA commented Jun 9, 2017

wzhfy commented Jun 10, 2017

wzhfy commented Jun 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzhfy Jun 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 11, 2017

wzhfy Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzhfy Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianhuiwang commented Jun 12, 2017

SparkQA commented Jun 12, 2017

wzhfy Jun 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 13, 2017

wzhfy commented Jun 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianhuiwang commented Jun 13, 2017

SparkQA commented Jun 13, 2017

gatorsmile commented Jun 13, 2017

cloud-fan commented Jun 14, 2017

lianhuiwang commented Jun 14, 2017

wzhfy commented Jun 10, 2017 •

edited

Loading

wzhfy Jun 10, 2017 •

edited

Loading

wzhfy Jun 11, 2017 •

edited

Loading

wzhfy Jun 11, 2017 •

edited

Loading

wzhfy Jun 12, 2017 •

edited

Loading