-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics #18105
Conversation
Test build #77359 has finished for PR 18105 at commit
|
I think we should always trust Spark's table stats over Hive's, no matter CBO is on or not. If users update the stats at hive side, it's their own responsibility to update it at Spark side. IIUC |
Test build #77365 has finished for PR 18105 at commit
|
@cloud-fan > What was the behavior before? Previously, analyze table command only updates the size of table, and it uses the same hive stats name "totalSize", and stores it in metastore through table parameter. Now since we are using a different name for spark's size ("spark.sql.statistics.totalSize"), they can be different. |
I think we'd better respect the "totalSize" stats when cbo is disabled, otherwise user has no way back to the default behavior (which has the right size info) unless he re-runs the analyze command. I personally think that's unfriendly to users. |
If users have not analyzed the table in Spark yet, we should respect the stats from hive metastore. But if users have already run the analyze table command in Spark, I think it's fair to ask them to re-analyze if data changed. BTW I don't think the analyze table command is bound with CBO, if you think the behavior is reasonable when CBO is on, I think it's also reasonable when CBO is off. |
@cloud-fan I mean the behavior when cbo is disabled should be the same as the behavior previously without cbo. |
I don't think the analyze table command is bound with CBO, neither. I just want to change how we read stats from metastore. That is, which stats (spark or hive) we respect based on cbo switch. |
Now, we have two sources of statistics. We need a mechanism to decide which one should be chosen. I also think we should respect Spark-generated statistics over Hive's when it is available. We might need to update the code comments at least to document the behaviors we choose. |
@gatorsmile OK. Then it's consistent with the current implementation. I'll change the description of this pr and update the code comments based on this mechanism. |
LGTM pending Jenkins |
Test build #77490 has finished for PR 18105 at commit
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
Now, we have two sources of statistics, i.e. Spark's stats and Hive's stats. Spark's stats is generated by running "analyze" command in Spark. Once it's available, we respect this stats over Hive's.
This pr is to clearly document in related code the mechanism to choose between these two sources of stats.
How was this patch tested?
Not related.