[SPARK-21031] [SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats #18248

wzhfy · 2017-06-09T06:17:33Z

What changes were proposed in this pull request?

Currently, hive's stats are read into CatalogStatistics, while spark's stats are also persisted through CatalogStatistics. As a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into CatalogStatistics. Then, by using "ALTER TABLE" command, we will store the stats in CatalogStatistics into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right sizeInBytes in CatalogStatistics, because we respect spark's stats (should not exist) over hive's stats.

A running example is shown in JIRA.

To fix this, we add a new method alterTableStats to store spark's stats, and let alterTable keep existing stats.

How was this patch tested?

Added new tests.

SparkQA · 2017-06-09T06:24:05Z

Test build #77837 has started for PR 18248 at commit 0d56f16.

wzhfy · 2017-06-09T06:32:39Z

cc @cloud-fan @gatorsmile

gatorsmile · 2017-06-09T06:38:33Z

To be honest, I also hit this error and plan to fix it. Fortunately, I have not started it yet. : )

gatorsmile · 2017-06-09T06:42:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

If we are designing the interface like this, we might need to refactor it again in the near future. Stats could be collected from Spark, imported from Hive, set by external users, or even from the data source API v2 (in the future).

If we have another source of stats, we can just add a field, and then decide which one to use. That is, we collect different sources of stats in CatalogStatistics, and unify them when convert to plan's Statistics.

wzhfy · 2017-06-09T16:13:25Z

retest this please

SparkQA · 2017-06-09T18:20:24Z

Test build #77847 has finished for PR 18248 at commit 0d56f16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExternalStatistics(

SparkQA · 2017-06-10T00:44:13Z

Test build #77860 has finished for PR 18248 at commit 835b6f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-10T02:52:14Z

I think the real issue is that, we mistakenly add statistics in ALTER TABLE. This is because ExternalCatalog.alterTable is heavily used when we wanna change something for a table. I think it would be better to introduce a ExternalCatalog.alterTableStats and use it in ANALYZE TABLE, so that ALTER TABLE won't add statistics.

wzhfy · 2017-06-10T03:26:33Z

@cloud-fan Actually that is my first version.
It also has problems: if we generate spark's stats first (through analyze command and alterTableStats), then do a regular alter table command, the stats will lost. Because we don't store stats info (which is from CatalogStatistics) through alterTable, and in CatalogStatistics, we don't know where the stats info come from (hive or spark), so we can't decide whether to store the stats or not in alterTable.

cloud-fan · 2017-06-10T03:28:50Z

alterTable won't set new stats but can still keep existing states, can we implement this?

wzhfy · 2017-06-10T03:30:47Z

@cloud-fan How can we tell in alterTable whether it's new stats or not?

wzhfy · 2017-06-10T03:38:00Z

I mean how can we keep existing stats? Since we cannot tell whether it's from hive or spark, if we store it as spark's stats, then we come back to the problem. If we don't, then we lost stats if it's actually generated by spark.

cloud-fan · 2017-06-10T03:42:52Z

alterTable can read states from the old table and keep it: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L564

basically alterTable should ignore the stats from input table metadata and keep the stats as it was in the old table metadata.

wzhfy · 2017-06-10T04:35:51Z

@cloud-fan Oh, right, let me try. Thanks!

SparkQA · 2017-06-10T05:48:54Z

Test build #77867 has started for PR 18248 at commit 2649135.

wzhfy · 2017-06-10T05:59:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

    }
  }

-  test("alter table SET TBLPROPERTIES after analyze table") {


Here I found the logic is the same for two cases except only the command (set and unset, respectively), so I extracted the common logic.

SparkQA · 2017-06-10T06:08:39Z

Test build #77868 has started for PR 18248 at commit 38d03d7.

cloud-fan · 2017-06-10T06:23:31Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

        TABLE_PARTITION_PROVIDER -> TABLE_PARTITION_PROVIDER_FILESYSTEM
      }

      // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from the old table definition,


please update comments to include states

cloud-fan · 2017-06-10T06:28:03Z

LGTM except one minor comment

SparkQA · 2017-06-10T11:43:44Z

Test build #77869 has finished for PR 18248 at commit 0ba01ac.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-06-10T17:06:37Z

retest this please

SparkQA · 2017-06-10T21:19:13Z

Test build #77877 has finished for PR 18248 at commit 0ba01ac.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-06-10T23:11:03Z

retest this please

gatorsmile · 2017-06-10T23:31:14Z

In addition to e2e test cases, we also need to add unit test cases in SessionCatalogSuite and ExternalCatalogSuite.

gatorsmile · 2017-06-10T23:35:55Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

  override def alterTable(tableDefinition: CatalogTable): Unit = withClient {
    assert(tableDefinition.identifier.database.isDefined)
    val db = tableDefinition.identifier.database.get
    requireTableExists(db, tableDefinition.identifier.table)


Please update the description of this function too.

gatorsmile · 2017-06-10T23:45:48Z

LGTM except the above two comments.

SparkQA · 2017-06-11T01:56:47Z

Test build #77885 has finished for PR 18248 at commit 0ba01ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-11T22:02:30Z

Test build #77896 has finished for PR 18248 at commit 221d052.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-12T00:24:26Z

thanks, merging to master!

…et `alterTable` keep existing stats ## What changes were proposed in this pull request? Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats. A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031). To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats. ## How was this patch tested? Added new tests. Author: Zhenhua Wang <wzh_zju@163.com> Closes apache#18248 from wzhfy/separateHiveStats.

wzhfy changed the title ~~Separation between spark's stats and hive's stats~~ [SPARK-21031] [SQL] Clearly separate spark's stats and hive's stats Jun 9, 2017

gatorsmile reviewed Jun 9, 2017

View reviewed changes

use alterStats for spark's stats

2649135

wzhfy force-pushed the separateHiveStats branch from 835b6f2 to 2649135 Compare June 10, 2017 05:46

wzhfy changed the title ~~[SPARK-21031] [SQL] Clearly separate spark's stats and hive's stats~~ [SPARK-21031] [SQL] Add alterTableStats to store spark's stats and let alterTable keep existing stats Jun 10, 2017

wzhfy commented Jun 10, 2017

View reviewed changes

minor

38d03d7

cloud-fan reviewed Jun 10, 2017

View reviewed changes

update comments

0ba01ac

gatorsmile reviewed Jun 10, 2017

View reviewed changes

add test cases and update comment

221d052

asfgit closed this in a7c61c1 Jun 12, 2017

[SPARK-21031] [SQL] Add alterTableStats to store spark's stats and let alterTable keep existing stats #18248

[SPARK-21031] [SQL] Add alterTableStats to store spark's stats and let alterTable keep existing stats #18248

Uh oh!

Conversation

wzhfy commented Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 9, 2017

Uh oh!

wzhfy commented Jun 9, 2017

Uh oh!

gatorsmile commented Jun 9, 2017

Uh oh!

gatorsmile Jun 9, 2017

Choose a reason for hiding this comment

Uh oh!

wzhfy Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Jun 9, 2017

Uh oh!

SparkQA commented Jun 9, 2017

Uh oh!

SparkQA commented Jun 10, 2017

Uh oh!

cloud-fan commented Jun 10, 2017

Uh oh!

wzhfy commented Jun 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jun 10, 2017

Uh oh!

wzhfy commented Jun 10, 2017

Uh oh!

wzhfy commented Jun 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jun 10, 2017

Uh oh!

wzhfy commented Jun 10, 2017

Uh oh!

SparkQA commented Jun 10, 2017

Uh oh!

wzhfy Jun 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 10, 2017

Uh oh!

cloud-fan Jun 10, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 10, 2017

Uh oh!

SparkQA commented Jun 10, 2017

Uh oh!

wzhfy commented Jun 10, 2017

Uh oh!

SparkQA commented Jun 10, 2017

Uh oh!

wzhfy commented Jun 10, 2017

Uh oh!

gatorsmile commented Jun 10, 2017

Uh oh!

gatorsmile Jun 10, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 10, 2017

Uh oh!

SparkQA commented Jun 11, 2017

Uh oh!

SparkQA commented Jun 11, 2017

Uh oh!

cloud-fan commented Jun 12, 2017

Uh oh!

Reviewers

Assignees

[SPARK-21031] [SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats #18248

[SPARK-21031] [SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats #18248

wzhfy commented Jun 9, 2017 •

edited

Loading

wzhfy Jun 9, 2017 •

edited

Loading

wzhfy commented Jun 10, 2017 •

edited

Loading

wzhfy commented Jun 10, 2017 •

edited

Loading

wzhfy Jun 10, 2017 •

edited

Loading