Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-22248 Fix statistics persisting issues #801

Closed
wants to merge 1 commit into from

Conversation

miklosgergely
Copy link
Contributor

  • During the thrift call the XXXXColumnStatsDataInspector was transformed into a XXXXColumnStatsData object, which then was converted back, by calling the xxxxInspectorFromStats functions. The new object was never put back though to the aggregateStats, so all the modifications made by the XXXXColumnStatsMerger was made on an object that was never used again. Added aggregateColStats.getStatsData().setXXXXStats(aggregateData); calls to put them there, so the changes made by the merger are actually in effect.

  • The min value was miscalculated for Long and Double types, as the null value was treated as 0. It was fixed by calculating the min values by also using the isSetLowValue() function.

  • In case of vector_coalesce_3.q the bad statistics made the engine "think" that the column is a primary key following some heuristics based on statistics, and made it guess the statistics in a different way, thus is the output change.


private double getMinValue(DoubleColumnStatsDataInspector aggregateData, DoubleColumnStatsDataInspector newData) {
if (!aggregateData.isSetLowValue() && !newData.isSetLowValue()) {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are both unset, why do we return 0 here instead of leaving it unset? I am not sure whether this will ever happen, but it seems the right behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original logic was that the value for the statistics should be 0 (as we don't have anything for that column yet), but I agree with you, it is nicer to leave it unset in this case. In effect it's the same, but the code is cleaner. Fixed.

@@ -30,7 +30,7 @@
public void merge(ColumnStatisticsObj aggregateColStats, ColumnStatisticsObj newColStats) {
DoubleColumnStatsDataInspector aggregateData = doubleInspectorFromStats(aggregateColStats);
DoubleColumnStatsDataInspector newData = doubleInspectorFromStats(newColStats);
aggregateData.setLowValue(Math.min(aggregateData.getLowValue(), newData.getLowValue()));
setMinValue(aggregateData, newData);
aggregateData.setHighValue(Math.max(aggregateData.getHighValue(), newData.getHighValue()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same for high value?

@miklosgergely miklosgergely deleted the HIVE-22248 branch October 5, 2019 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants