Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28082: HiveAggregateReduceFunctionsRule could generate an inconsistent result #5091

Merged
merged 4 commits into from
Jun 5, 2024

Conversation

okumin
Copy link
Contributor

@okumin okumin commented Feb 20, 2024

What changes were proposed in this pull request?

This PR would disable HiveAggregateReduceFunctionsRule when the type of argument of AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) is a string-like type.

https://issues.apache.org/jira/browse/HIVE-28082

Why are the changes needed?

While testing Hive 4, we found the behavior of those UDFs changed since Hive 2. We identified HiveAggregateReduceFunctionsRule as the root cause. We believe Calcite rules should not change the original semantics.

This PR would disable the rule when it could cause inconsistency. That's because it is not simple to decompose AVG(str) into SUM(str) and COUNT(*), or correctly transform STDDEV_* or VAR_*.

Does this PR introduce any user-facing change?

The result could change if a user uses the UDFs.

Is the change a dependency upgrade?

No

How was this patch tested?

I added/updated test cases. You can see the result of the current master branch through the first commit to add a new test case. Results change with or without CBO.

Copy link

sonarcloud bot commented Feb 21, 2024

Quality Gate Passed Quality Gate passed

Issues
2 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@okumin okumin changed the title [WIP] HiveAggregateReduceFunctionsRule could generate an inconsistent result Feb 21, 2024
@okumin okumin marked this pull request as ready for review February 21, 2024 11:23
@okumin okumin changed the title HiveAggregateReduceFunctionsRule could generate an inconsistent result HIVE-28082: HiveAggregateReduceFunctionsRule could generate an inconsistent result Feb 21, 2024
@deniskuzZ deniskuzZ requested a review from kasakrisz April 9, 2024 11:38
Copy link
Contributor

@kasakrisz kasakrisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. IIUC based on SQL standard an exception should be thrown when a cast fails. In case of avg('text') the actual parameter 'text' should be casted to some numeric type.
    Unfortunately Hive has a special behaviour: when a cast fails it doesn't throw exception but logs a warn into hive.log, the cast returns null and execution continues.
    At this point it is not clear what the end result should be:
  • according to the standard if the input single column table of the aggregate function should not contain null values and if this input is empty the function result should be null
  • however in Hive's case the null values are introduced due to an invalid cast.
  1. The HiveAggregateReduceFunctionsRule seems to be ok.
    The result of avg(argument) and sum(argument)/count(argument) must be the same and this is what happening in the CBO path. However in case of non-CBO path the results does not match when the parameter is character
set hive.cbo.enable=false;
select avg('text'), sum('text')/count('text');
NULL	0.0

In general we tend to go with the CBO path.

I understand that Hive 2 worked differently but IMHO it is acceptable to have breaking changes when we step major versions. So if we want to fix something at this area we should fix the non-CBO path to have the same result as CBO path.

@okumin
Copy link
Contributor Author

okumin commented Apr 18, 2024

@kasakrisz Thanks for checking. I double-checked the SQL:2023 Part 2.

(A) 10.9 <aggregate function> -> Syntax Rules -> 7) -> g) mentions the specification of SUM and AVG. It says the type of value expression shall be numeric. So, if we strictly adhere to the standard, it could be best to throw an exception when a string type is specified.

(B) If we allow implicit cast here(I've not found the rule justifying it), it is reasonable to follow the rules of CAST. Based on 6.13 <cast specification> -> General Rules -> 8) -> b), it should throw an exception.

(C) If the implicit numeric cast of text can be null here(though it sounds like we accumulate too many assumptions, it can keep the current behavior...), based on 10.9 <aggregate function> -> General Rules -> 7) -> a), we should eliminate null values with warning. If the result set is empty, based on 10.9 <aggregate function> -> General Rules -> 7) -> d) -> ⅱ), it should return NULL.

So, if (A) we follow the standard strictly or (B) we accept implicit cast, SUM and AVG should fail.
If we justify (C) the null conversion, avg('text') and sum('text') should return NULL. NULL / 1 should be NULL based on 6.30 <numeric value expression> -> General Rules -> 1). So, in this case, all the cases, avg('text'), sum('text'), sum('text') / count('text') should be null with raising a warning.

In summary, if we accept a breaking change, I would say either of the following ways sounds consistent. What do you think?

  • We no longer accept the string type for AVG or SUM. It leads to a huge incompatibility, but we can follow the standard
  • We will make the UDFs skip invalid texts and SUM('text') return NULL. Maybe we will fix GenericUDAFSum.

@amansinha100
Copy link

Chiming in late on this topic ..
I think that the CBO path should be the main focus since it has been the default for quite some time now - with Hive 3 and Hive 4. In this path, to avoid breaking change, we should not throw an exception for the string argument type for the aggregate function even though the SQL standard suggests otherwise. In an ETL operation, there could be one row out of millions that has 'dirty' data with string type but Hive has been permissive for such data, otherwise the whole job would fail. If it means that we should let SUM('text') return NULL (similar to AVG('text')), it seems a better option than erroring out.

@kasakrisz
Copy link
Contributor

We will make the UDFs skip invalid texts and SUM('text') return NULL. Maybe we will fix GenericUDAFSum.

Yes, it seems reasonable because Hive doesn't follow the standard in the case of cast function either: Cast returns NULL instead of throwing exception due to the reasons Aman mentioned.

I think we can consider sum(string) as an overload of the standard sum function which should behave like sum(cast('text' as <numeric_type>))

@okumin
Copy link
Contributor Author

okumin commented May 2, 2024

Thank you both. I believe we strongly agree with some points.

  • We are not motivated to keep the current behaviors of the non-CBO mode. Those of CBO are rather preferred
  • We would like to accept texts as an argument of AVG or other UDFs here

Let me check AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) and CAST, and propose reasonable behaviors.

@okumin
Copy link
Contributor Author

okumin commented May 5, 2024

This table illustrates the summary of the current behavior tested by 468ea05. I will add my proposal in the next comment.

CBO Non-CBO Note
$SUM0(c_numeric) 228.0 228.0
$SUM0(CAST(c_numeric AS DOUBLE)) 228.0 228.0
$SUM0(c_non_numeric) 0.0 0.0
$SUM0(CAST(c_non_numeric AS DOUBLE)) 0.0 0.0
$SUM0(c_mix) 193.0 193.0
$SUM0(CAST(c_mix AS DOUBLE)) 193.0 193.0
AVG(c_numeric) 28.5 28.5
AVG(CAST(c_numeric AS DOUBLE)) 28.5 28.5
AVG(c_non_numeric) 0.0 NULL Inconsistent
AVG(CAST(c_non_numeric AS DOUBLE)) NULL NULL
AVG(c_mix) 24.125 32.167 Inconsistent
AVG(CAST(c_mix AS DOUBLE)) 32.167 32.167
STDDEV_POP(c_numeric) 47.344 47.344
STDDEV_POP(CAST(c_numeric AS DOUBLE)) 47.344 47.344
STDDEV_POP(c_non_numeric) NULL NULL
STDDEV_POP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
STDDEV_POP(c_mix) 48.401 53.524 Inconsistent
STDDEV_POP(CAST(c_mix AS DOUBLE)) 53.524 53.524
STDDEV_SAMP(c_numeric) 50.613 50.613
STDDEV_SAMP(CAST(c_numeric AS DOUBLE)) 50.613 50.613
STDDEV_SAMP(c_non_numeric) NULL NULL
STDDEV_SAMP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
STDDEV_SAMP(c_mix) 51.742 58.632 Inconsistent
STDDEV_SAMP(CAST(c_mix AS DOUBLE)) 58.632 58.632
VAR_POP(c_numeric) 2241.5 2241.5
VAR_POP(CAST(c_numeric AS DOUBLE)) 2241.5 2241.5
VAR_POP(c_non_numeric) NULL NULL
VAR_POP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
VAR_POP(c_mix) 2342.609 2864.806 Inconsistent
VAR_POP(CAST(c_mix AS DOUBLE)) 2864.806 2864.806
VAR_SAMP(c_numeric) 2561.714 2561.714
VAR_SAMP(CAST(c_numeric AS DOUBLE)) 2561.714 2561.714
VAR_SAMP(c_non_numeric) NULL NULL
VAR_SAMP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
VAR_SAMP(c_mix) 2677.268 3437.767 Inconsistent
VAR_SAMP(CAST(c_mix AS DOUBLE)) 3437.767 3437.767
SUM(c_numeric) 228.0 228.0
SUM(CAST(c_numeric AS DOUBLE)) 228.0 228.0
SUM(c_non_numeric) 0.0 0.0
SUM(CAST(c_non_numeric AS DOUBLE)) NULL NULL
SUM(c_mix) 193.0 193.0
SUM(CAST(c_mix AS DOUBLE)) 193.0 193.0
COUNT(c_numeric) 8 8
COUNT(CAST(c_numeric AS DOUBLE)) 8 8
COUNT(c_non_numeric) 8 8
COUNT(CAST(c_non_numeric AS DOUBLE)) 0 0
COUNT(c_mix) 8 8
COUNT(CAST(c_mix AS DOUBLE)) 6 6

@kasakrisz
Copy link
Contributor

@okumin
Thanks for the research and sharing the details of your results. I think these are the issues to solve:

  • as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path.
    Swapping these lines may help to solve this:
    ((SumDoubleAgg)agg).nonNullCount++;
    ((SumDoubleAgg)agg).sum += PrimitiveObjectInspectorUtils.getDouble(parameters[0], inputOI);
  • Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that count only counts values which can be converted to a numeric type.

Please add tests for sum(c_non_numeric) and sum(c_mix) if not exists.

@okumin
Copy link
Contributor Author

okumin commented May 13, 2024

@kasakrisz Thanks. I found cbo_rp_groupby3_noskew_multi_distinct.q fails and it turned out the root cause is not HiveAggregateReduceFunctionsRule but CBO Return Path + non-map-side aggregation. We may check the problem first.

#5245

Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that count only counts values which can be converted to a numeric type.

My PoC is here. We need #5245 .
da7507d

as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path.
Swapping these lines may help to solve this:

I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket? Literally, SUM is not a member accelerated by HiveAggregateReduceFunctionsRule. And if we add an explicit cast, at least all problems related to HiveAggregateReduceFunctionsRule will likely be gone. It is up to your convenience.

CBO Non-CBO Note
SUM(c_non_numeric) 0.0 0.0
SUM(CAST(c_non_numeric AS DOUBLE)) NULL NULL

@kasakrisz
Copy link
Contributor

kasakrisz commented May 13, 2024

@okumin

I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket?

It is ok to file another jira to track this.

Sorry I'm confused. Is #5245 somehow related to this patch? Could you please clarify.
Found the change in cbo_rp_groupby3_noskew_multi_distinct.q

@kasakrisz
Copy link
Contributor

@okumin
#5245 was meged to master.
Could you please rebase this patch.

@okumin
Copy link
Contributor Author

okumin commented Jun 4, 2024

Rebased & followed your comments. Waiting for CI to finish...
e78e27b

Copy link

sonarcloud bot commented Jun 4, 2024

Quality Gate Passed Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@@ -92,4 +92,4 @@ FROM src
POSTHOOK: type: QUERY
POSTHOOK: Input: default@src
#### A masked pattern was here ####
0.0 0.0 NULL NULL
0.0 NULL NULL NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okumin
Copy link
Contributor Author

okumin commented Jun 5, 2024

CI passed.

as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path.
Swapping these lines may help to solve this:

BTW, I filed another ticket for this change.
https://issues.apache.org/jira/browse/HIVE-28302

Copy link
Contributor

@kasakrisz kasakrisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kasakrisz kasakrisz merged commit 529fd45 into apache:master Jun 5, 2024
6 checks passed
@okumin okumin deleted the HIVE-28082-agg-reduce branch June 5, 2024 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants