Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics #18105

Closed
wants to merge 1 commit into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented May 25, 2017

What changes were proposed in this pull request?

Now, we have two sources of statistics, i.e. Spark's stats and Hive's stats. Spark's stats is generated by running "analyze" command in Spark. Once it's available, we respect this stats over Hive's.

This pr is to clearly document in related code the mechanism to choose between these two sources of stats.

How was this patch tested?

Not related.

@wzhfy
Copy link
Contributor Author

wzhfy commented May 25, 2017

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77359 has finished for PR 18105 at commit 4655292.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I think we should always trust Spark's table stats over Hive's, no matter CBO is on or not. If users update the stats at hive side, it's their own responsibility to update it at Spark side.

IIUC AnalyzeTableCommand appears before CBO right? What was the behavior before?

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77365 has finished for PR 18105 at commit 59619ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented May 25, 2017

@cloud-fan > What was the behavior before?

Previously, analyze table command only updates the size of table, and it uses the same hive stats name "totalSize", and stores it in metastore through table parameter.

Now since we are using a different name for spark's size ("spark.sql.statistics.totalSize"), they can be different.
Actually the parameter "totalSize" will change even in spark when you put some new data into the table, which seems to be hive metastore's behavior.

@wzhfy
Copy link
Contributor Author

wzhfy commented May 25, 2017

I think we'd better respect the "totalSize" stats when cbo is disabled, otherwise user has no way back to the default behavior (which has the right size info) unless he re-runs the analyze command. I personally think that's unfriendly to users.

@cloud-fan
Copy link
Contributor

cloud-fan commented May 25, 2017

If users have not analyzed the table in Spark yet, we should respect the stats from hive metastore. But if users have already run the analyze table command in Spark, I think it's fair to ask them to re-analyze if data changed. BTW I don't think the analyze table command is bound with CBO, if you think the behavior is reasonable when CBO is on, I think it's also reasonable when CBO is off.

@wzhfy
Copy link
Contributor Author

wzhfy commented May 26, 2017

@cloud-fan I mean the behavior when cbo is disabled should be the same as the behavior previously without cbo.
Previously, size is read from "totalSize", and it changes after update.
Now, when cbo is enabled, I agree that user is responsible to re-analyze. But when cbo is disabled, I think user expects the behavior without cbo, i.e. read "totalSize" from metastore.

@wzhfy
Copy link
Contributor Author

wzhfy commented May 26, 2017

I don't think the analyze table command is bound with CBO, neither. I just want to change how we read stats from metastore. That is, which stats (spark or hive) we respect based on cbo switch.

@gatorsmile
Copy link
Member

gatorsmile commented May 26, 2017

Now, we have two sources of statistics. We need a mechanism to decide which one should be chosen. I also think we should respect Spark-generated statistics over Hive's when it is available.

We might need to update the code comments at least to document the behaviors we choose.

@wzhfy
Copy link
Contributor Author

wzhfy commented May 28, 2017

I also think we should respect Spark-generated statistics over Hive's when it is available.

@gatorsmile OK. Then it's consistent with the current implementation. I'll change the description of this pr and update the code comments based on this mechanism.

@wzhfy wzhfy changed the title [SPARK-20881] [SQL] Use Hive's stats in metastore when cbo is disabled [SPARK-20881] [SQL] Clearly document the mechanism to choose between two sources of statistics May 28, 2017
@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented May 28, 2017

Test build #77490 has finished for PR 18105 at commit 556a56f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 9d0db5a May 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants