Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Jun 9, 2017

What changes were proposed in this pull request?

Currently, hive's stats are read into CatalogStatistics, while spark's stats are also persisted through CatalogStatistics. As a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into CatalogStatistics. Then, by using "ALTER TABLE" command, we will store the stats in CatalogStatistics into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right sizeInBytes in CatalogStatistics, because we respect spark's stats (should not exist) over hive's stats.

A running example is shown in JIRA.

To fix this, we add a new method alterTableStats to store spark's stats, and let alterTable keep existing stats.

How was this patch tested?

Added new tests.

@wzhfy wzhfy changed the title Separation between spark's stats and hive's stats [SPARK-21031] [SQL] Clearly separate spark's stats and hive's stats Jun 9, 2017
@SparkQA
Copy link

SparkQA commented Jun 9, 2017

Test build #77837 has started for PR 18248 at commit 0d56f16.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 9, 2017

cc @cloud-fan @gatorsmile

@gatorsmile
Copy link
Member

To be honest, I also hit this error and plan to fix it. Fortunately, I have not started it yet. : )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are designing the interface like this, we might need to refactor it again in the near future. Stats could be collected from Spark, imported from Hive, set by external users, or even from the data source API v2 (in the future).

Copy link
Contributor Author

@wzhfy wzhfy Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have another source of stats, we can just add a field, and then decide which one to use. That is, we collect different sources of stats in CatalogStatistics, and unify them when convert to plan's Statistics.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 9, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Jun 9, 2017

Test build #77847 has finished for PR 18248 at commit 0d56f16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExternalStatistics(

@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77860 has finished for PR 18248 at commit 835b6f2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I think the real issue is that, we mistakenly add statistics in ALTER TABLE. This is because ExternalCatalog.alterTable is heavily used when we wanna change something for a table. I think it would be better to introduce a ExternalCatalog.alterTableStats and use it in ANALYZE TABLE, so that ALTER TABLE won't add statistics.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 10, 2017

@cloud-fan Actually that is my first version.
It also has problems: if we generate spark's stats first (through analyze command and alterTableStats), then do a regular alter table command, the stats will lost. Because we don't store stats info (which is from CatalogStatistics) through alterTable, and in CatalogStatistics, we don't know where the stats info come from (hive or spark), so we can't decide whether to store the stats or not in alterTable.

@cloud-fan
Copy link
Contributor

alterTable won't set new stats but can still keep existing states, can we implement this?

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 10, 2017

@cloud-fan How can we tell in alterTable whether it's new stats or not?

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 10, 2017

I mean how can we keep existing stats? Since we cannot tell whether it's from hive or spark, if we store it as spark's stats, then we come back to the problem. If we don't, then we lost stats if it's actually generated by spark.

@cloud-fan
Copy link
Contributor

alterTable can read states from the old table and keep it: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L564

basically alterTable should ignore the stats from input table metadata and keep the stats as it was in the old table metadata.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 10, 2017

@cloud-fan Oh, right, let me try. Thanks!

@wzhfy wzhfy force-pushed the separateHiveStats branch from 835b6f2 to 2649135 Compare June 10, 2017 05:46
@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77867 has started for PR 18248 at commit 2649135.

@wzhfy wzhfy changed the title [SPARK-21031] [SQL] Clearly separate spark's stats and hive's stats [SPARK-21031] [SQL] Add alterTableStats to store spark's stats and let alterTable keep existing stats Jun 10, 2017
}
}

test("alter table SET TBLPROPERTIES after analyze table") {
Copy link
Contributor Author

@wzhfy wzhfy Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I found the logic is the same for two cases except only the command (set and unset, respectively), so I extracted the common logic.

@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77868 has started for PR 18248 at commit 38d03d7.

TABLE_PARTITION_PROVIDER -> TABLE_PARTITION_PROVIDER_FILESYSTEM
}

// Sets the `schema`, `partitionColumnNames` and `bucketSpec` from the old table definition,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update comments to include states

@cloud-fan
Copy link
Contributor

LGTM except one minor comment

@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77869 has finished for PR 18248 at commit 0ba01ac.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 10, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77877 has finished for PR 18248 at commit 0ba01ac.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jun 10, 2017

retest this please

@gatorsmile
Copy link
Member

In addition to e2e test cases, we also need to add unit test cases in SessionCatalogSuite and ExternalCatalogSuite.

override def alterTable(tableDefinition: CatalogTable): Unit = withClient {
assert(tableDefinition.identifier.database.isDefined)
val db = tableDefinition.identifier.database.get
requireTableExists(db, tableDefinition.identifier.table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the description of this function too.

@gatorsmile
Copy link
Member

LGTM except the above two comments.

@SparkQA
Copy link

SparkQA commented Jun 11, 2017

Test build #77885 has finished for PR 18248 at commit 0ba01ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2017

Test build #77896 has finished for PR 18248 at commit 221d052.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in a7c61c1 Jun 12, 2017
dataknocker pushed a commit to dataknocker/spark that referenced this pull request Jun 16, 2017
…et `alterTable` keep existing stats

## What changes were proposed in this pull request?

Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats.

A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031).

To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats.

## How was this patch tested?

Added new tests.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes apache#18248 from wzhfy/separateHiveStats.
dilipbiswal pushed a commit to dilipbiswal/spark that referenced this pull request Aug 4, 2017
…et `alterTable` keep existing stats

## What changes were proposed in this pull request?

Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats.

A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031).

To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats.

## How was this patch tested?

Added new tests.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes apache#18248 from wzhfy/separateHiveStats.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants