-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21031] [SQL] Add alterTableStats to store spark's stats and let alterTable keep existing stats
#18248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #77837 has started for PR 18248 at commit |
|
To be honest, I also hit this error and plan to fix it. Fortunately, I have not started it yet. : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are designing the interface like this, we might need to refactor it again in the near future. Stats could be collected from Spark, imported from Hive, set by external users, or even from the data source API v2 (in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have another source of stats, we can just add a field, and then decide which one to use. That is, we collect different sources of stats in CatalogStatistics, and unify them when convert to plan's Statistics.
|
retest this please |
|
Test build #77847 has finished for PR 18248 at commit
|
|
Test build #77860 has finished for PR 18248 at commit
|
|
I think the real issue is that, we mistakenly add statistics in |
|
@cloud-fan Actually that is my first version. |
|
|
|
@cloud-fan How can we tell in |
|
I mean how can we keep existing stats? Since we cannot tell whether it's from hive or spark, if we store it as spark's stats, then we come back to the problem. If we don't, then we lost stats if it's actually generated by spark. |
|
basically |
|
@cloud-fan Oh, right, let me try. Thanks! |
|
Test build #77867 has started for PR 18248 at commit |
alterTableStats to store spark's stats and let alterTable keep existing stats
| } | ||
| } | ||
|
|
||
| test("alter table SET TBLPROPERTIES after analyze table") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I found the logic is the same for two cases except only the command (set and unset, respectively), so I extracted the common logic.
|
Test build #77868 has started for PR 18248 at commit |
| TABLE_PARTITION_PROVIDER -> TABLE_PARTITION_PROVIDER_FILESYSTEM | ||
| } | ||
|
|
||
| // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from the old table definition, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update comments to include states
|
LGTM except one minor comment |
|
Test build #77869 has finished for PR 18248 at commit
|
|
retest this please |
|
Test build #77877 has finished for PR 18248 at commit
|
|
retest this please |
|
In addition to e2e test cases, we also need to add unit test cases in SessionCatalogSuite and ExternalCatalogSuite. |
| override def alterTable(tableDefinition: CatalogTable): Unit = withClient { | ||
| assert(tableDefinition.identifier.database.isDefined) | ||
| val db = tableDefinition.identifier.database.get | ||
| requireTableExists(db, tableDefinition.identifier.table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the description of this function too.
|
LGTM except the above two comments. |
|
Test build #77885 has finished for PR 18248 at commit
|
|
Test build #77896 has finished for PR 18248 at commit
|
|
thanks, merging to master! |
…et `alterTable` keep existing stats ## What changes were proposed in this pull request? Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats. A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031). To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats. ## How was this patch tested? Added new tests. Author: Zhenhua Wang <wzh_zju@163.com> Closes apache#18248 from wzhfy/separateHiveStats.
…et `alterTable` keep existing stats ## What changes were proposed in this pull request? Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats. A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031). To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats. ## How was this patch tested? Added new tests. Author: Zhenhua Wang <wzh_zju@163.com> Closes apache#18248 from wzhfy/separateHiveStats.
What changes were proposed in this pull request?
Currently, hive's stats are read into
CatalogStatistics, while spark's stats are also persisted throughCatalogStatistics. As a result, hive's stats can be unexpectedly propagated into spark' stats.For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into
CatalogStatistics. Then, by using "ALTER TABLE" command, we will store the stats inCatalogStatisticsinto metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right
sizeInBytesinCatalogStatistics, because we respect spark's stats (should not exist) over hive's stats.A running example is shown in JIRA.
To fix this, we add a new method
alterTableStatsto store spark's stats, and letalterTablekeep existing stats.How was this patch tested?
Added new tests.