-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28082: HiveAggregateReduceFunctionsRule could generate an inconsistent result #5091
Conversation
1f7955a
to
45d3067
Compare
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- IIUC based on SQL standard an exception should be thrown when a cast fails. In case of
avg('text')
the actual parameter'text'
should be casted to some numeric type.
Unfortunately Hive has a special behaviour: when a cast fails it doesn't throw exception but logs a warn into hive.log, the cast returnsnull
and execution continues.
At this point it is not clear what the end result should be:
- according to the standard if the input single column table of the aggregate function should not contain
null
values and if this input is empty the function result should benull
- however in Hive's case the null values are introduced due to an invalid cast.
- The
HiveAggregateReduceFunctionsRule
seems to be ok.
The result ofavg(argument)
andsum(argument)/count(argument)
must be the same and this is what happening in the CBO path. However in case of non-CBO path the results does not match when the parameter is character
set hive.cbo.enable=false;
select avg('text'), sum('text')/count('text');
NULL 0.0
In general we tend to go with the CBO path.
I understand that Hive 2 worked differently but IMHO it is acceptable to have breaking changes when we step major versions. So if we want to fix something at this area we should fix the non-CBO path to have the same result as CBO path.
@kasakrisz Thanks for checking. I double-checked the SQL:2023 Part 2. (A) (B) If we allow implicit cast here(I've not found the rule justifying it), it is reasonable to follow the rules of CAST. Based on (C) If the implicit numeric cast of So, if (A) we follow the standard strictly or (B) we accept implicit cast, In summary, if we accept a breaking change, I would say either of the following ways sounds consistent. What do you think?
|
Chiming in late on this topic .. |
Yes, it seems reasonable because Hive doesn't follow the standard in the case of I think we can consider |
Thank you both. I believe we strongly agree with some points.
Let me check AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) and CAST, and propose reasonable behaviors. |
45d3067
to
c6333a4
Compare
c6333a4
to
468ea05
Compare
This table illustrates the summary of the current behavior tested by 468ea05. I will add my proposal in the next comment.
|
22a2f76
to
da7507d
Compare
@okumin
Please add tests for |
ql/src/test/results/clientpositive/llap/cbo_rp_groupby3_noskew_multi_distinct.q.out
Outdated
Show resolved
Hide resolved
@kasakrisz Thanks. I found
My PoC is here. We need #5245 .
I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket? Literally,
|
It is ok to file another jira to track this.
|
...java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateReduceFunctionsRule.java
Outdated
Show resolved
Hide resolved
...java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateReduceFunctionsRule.java
Outdated
Show resolved
Hide resolved
da7507d
to
e78e27b
Compare
Rebased & followed your comments. Waiting for CI to finish... |
Quality Gate passedIssues Measures |
@@ -92,4 +92,4 @@ FROM src | |||
POSTHOOK: type: QUERY | |||
POSTHOOK: Input: default@src | |||
#### A masked pattern was here #### | |||
0.0 0.0 NULL NULL | |||
0.0 NULL NULL NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this diff is expected based on the discussion.
https://github.com/apache/hive/blob/e78e27bc593c20a312c76b7ba9c06b96176bc647/ql/src/test/queries/clientpositive/udaf_number_format.q
CI passed.
BTW, I filed another ticket for this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
What changes were proposed in this pull request?
This PR would disable HiveAggregateReduceFunctionsRule when the type of argument of AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) is a string-like type.
https://issues.apache.org/jira/browse/HIVE-28082
Why are the changes needed?
While testing Hive 4, we found the behavior of those UDFs changed since Hive 2. We identified HiveAggregateReduceFunctionsRule as the root cause. We believe Calcite rules should not change the original semantics.
This PR would disable the rule when it could cause inconsistency. That's because it is not simple to decompose
AVG(str)
intoSUM(str)
andCOUNT(*)
, or correctly transformSTDDEV_*
orVAR_*
.Does this PR introduce any user-facing change?
The result could change if a user uses the UDFs.
Is the change a dependency upgrade?
No
How was this patch tested?
I added/updated test cases. You can see the result of the current master branch through the first commit to add a new test case. Results change with or without CBO.