[SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics #18105

wzhfy · 2017-05-25T09:33:23Z

What changes were proposed in this pull request?

Now, we have two sources of statistics, i.e. Spark's stats and Hive's stats. Spark's stats is generated by running "analyze" command in Spark. Once it's available, we respect this stats over Hive's.

This pr is to clearly document in related code the mechanism to choose between these two sources of stats.

How was this patch tested?

Not related.

wzhfy · 2017-05-25T10:30:53Z

cc @cloud-fan @gatorsmile

SparkQA · 2017-05-25T10:54:49Z

Test build #77359 has finished for PR 18105 at commit 4655292.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-25T12:46:42Z

I think we should always trust Spark's table stats over Hive's, no matter CBO is on or not. If users update the stats at hive side, it's their own responsibility to update it at Spark side.

IIUC AnalyzeTableCommand appears before CBO right? What was the behavior before?

SparkQA · 2017-05-25T14:12:39Z

Test build #77365 has finished for PR 18105 at commit 59619ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-05-25T16:29:34Z

@cloud-fan > What was the behavior before?

Previously, analyze table command only updates the size of table, and it uses the same hive stats name "totalSize", and stores it in metastore through table parameter.

Now since we are using a different name for spark's size ("spark.sql.statistics.totalSize"), they can be different.
Actually the parameter "totalSize" will change even in spark when you put some new data into the table, which seems to be hive metastore's behavior.

wzhfy · 2017-05-25T16:35:49Z

I think we'd better respect the "totalSize" stats when cbo is disabled, otherwise user has no way back to the default behavior (which has the right size info) unless he re-runs the analyze command. I personally think that's unfriendly to users.

cloud-fan · 2017-05-25T17:53:11Z

If users have not analyzed the table in Spark yet, we should respect the stats from hive metastore. But if users have already run the analyze table command in Spark, I think it's fair to ask them to re-analyze if data changed. BTW I don't think the analyze table command is bound with CBO, if you think the behavior is reasonable when CBO is on, I think it's also reasonable when CBO is off.

wzhfy · 2017-05-26T02:18:39Z

@cloud-fan I mean the behavior when cbo is disabled should be the same as the behavior previously without cbo.
Previously, size is read from "totalSize", and it changes after update.
Now, when cbo is enabled, I agree that user is responsible to re-analyze. But when cbo is disabled, I think user expects the behavior without cbo, i.e. read "totalSize" from metastore.

wzhfy · 2017-05-26T02:21:52Z

I don't think the analyze table command is bound with CBO, neither. I just want to change how we read stats from metastore. That is, which stats (spark or hive) we respect based on cbo switch.

gatorsmile · 2017-05-26T20:40:49Z

Now, we have two sources of statistics. We need a mechanism to decide which one should be chosen. I also think we should respect Spark-generated statistics over Hive's when it is available.

We might need to update the code comments at least to document the behaviors we choose.

wzhfy · 2017-05-28T15:54:51Z

I also think we should respect Spark-generated statistics over Hive's when it is available.

@gatorsmile OK. Then it's consistent with the current implementation. I'll change the description of this pr and update the code comments based on this mechanism.

gatorsmile · 2017-05-28T16:57:00Z

LGTM pending Jenkins

SparkQA · 2017-05-28T18:36:20Z

Test build #77490 has finished for PR 18105 at commit 556a56f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-28T20:33:24Z

Thanks! Merging to master.

wzhfy changed the title ~~[SPARK-20881] [SQL] Use Hive's stats in metastore when cbo is disabled~~ [SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics May 28, 2017

add document

556a56f

wzhfy force-pushed the cboSwitchStats branch from 59619ee to 556a56f Compare May 28, 2017 16:56

asfgit closed this in 9d0db5a May 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics #18105

[SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics #18105

wzhfy commented May 25, 2017 •

edited

Loading

wzhfy commented May 25, 2017

SparkQA commented May 25, 2017

cloud-fan commented May 25, 2017

SparkQA commented May 25, 2017

wzhfy commented May 25, 2017 •

edited

Loading

wzhfy commented May 25, 2017 •

edited

Loading

cloud-fan commented May 25, 2017 •

edited

Loading

wzhfy commented May 26, 2017

wzhfy commented May 26, 2017 •

edited

Loading

gatorsmile commented May 26, 2017 •

edited

Loading

wzhfy commented May 28, 2017

gatorsmile commented May 28, 2017

SparkQA commented May 28, 2017

gatorsmile commented May 28, 2017

[SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics #18105

[SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics #18105

Conversation

wzhfy commented May 25, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

wzhfy commented May 25, 2017

SparkQA commented May 25, 2017

cloud-fan commented May 25, 2017

SparkQA commented May 25, 2017

wzhfy commented May 25, 2017 • edited Loading

wzhfy commented May 25, 2017 • edited Loading

cloud-fan commented May 25, 2017 • edited Loading

wzhfy commented May 26, 2017

wzhfy commented May 26, 2017 • edited Loading

gatorsmile commented May 26, 2017 • edited Loading

wzhfy commented May 28, 2017

gatorsmile commented May 28, 2017

SparkQA commented May 28, 2017

gatorsmile commented May 28, 2017

wzhfy commented May 25, 2017 •

edited

Loading

wzhfy commented May 25, 2017 •

edited

Loading

wzhfy commented May 25, 2017 •

edited

Loading

cloud-fan commented May 25, 2017 •

edited

Loading

wzhfy commented May 26, 2017 •

edited

Loading

gatorsmile commented May 26, 2017 •

edited

Loading