-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Flushable distinct agg caused correctness issue #4421
Comments
It seems the aggregation before partial_count should not be flushable. Can we fully rely on whether the data being directly sent to exchange to decide whether flushable aggregation can be generated? cc @zhztheplayer |
When agg function is empty, we may not use flushable agg which may produce duplicate rows. |
Thank you for reporting @Yohahaha . I think this is very likely similar to the possible corner case I mentioned in #4312 (comment)
We'd enhance FlushableAggregateRule to handle this case. Probably our CI tests with result comparisons are not handling enough data to trigger flush. I'll take a look into that as well. |
Flushing on distinct aggregation can be useful, e.g., do flush on scan-side then do final distinct on reducers. The problem here is Spark has specialized logic on count(distinct), however if we think it generally it's about the aggregate key distribution. If there is an exchange on the exit side of aggregation, and the exchange would be enforced to redistribute the data populated in the same aggregate group but on different mappers, to the same reducer, the partial aggregation can be considered safe to flush. |
Update: The issue should impact Spark version >= 3.3 |
Backend
VL (Velox)
Bug description
TPC-DS q38 has distinct agg, when partial distinct is flushed, final result is incorrect.
expect: [26056]
actual: [26131]
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
The text was updated successfully, but these errors were encountered: