Fix TopK DISTINCT aggregation preserving NULLs#22571
Conversation
alamb
left a comment
There was a problem hiding this comment.
Looks good to me overall @kumarUjjawal -- i had a few questions
I see you say
The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.
I don't have any sense of how complicated this would be / not be -- did you try it?
Also, perhaps @avantgardnerio you can help review this PR as I think you are familiar with the original logic
|
|
||
| impl GroupedTopKAggregateStream { | ||
| fn is_distinct(&self) -> bool { | ||
| self.aggregate_arguments.is_empty() |
There was a problem hiding this comment.
I think you can get no aggregates for just a normal SELECT x,y,z GROUP BY x,y,z type query, Though maybe that has the same semantics
| for row_idx in 0..len { | ||
| if has_nulls && vals.is_null(row_idx) { | ||
| if self.is_distinct() { | ||
| self.null_group_seen = true; |
There was a problem hiding this comment.
this is in the (hot) inner loop and I worry about the performance implications of this change
Is it enough to check outside the loop? something like
let has_nulls = vals.null_count() > 0;
if has_nulls && self.is_distinct() {
self.null_group_seen = true;
}
for row_idx in 0..len {
...
}| // For DISTINCT case (no aggregate expressions), only use the group key column | ||
| // since the schema only has one field and key/value are the same | ||
| if self.aggregate_arguments.is_empty() { | ||
| if self.is_distinct() { |
There was a problem hiding this comment.
can we try and encapsulate some of this logic in a helper function perhaps to try and keep this code easy to read?
It would also perhaps help to add some comments here about why concat is needed
I did try that as my first approach but I find it broad and since the only question I needed to answer was did we see a NULL group key and since SQL DISTINCT has only one NULL group, a boolean seemed good choice. Let me know what you think, I can go back to the other approach. |
Which issue does this PR close?
Rationale for this change
TopK aggregation dropped NULL group keys for ordered DISTINCT queries.
For example,
SELECT DISTINCT v FROM t ORDER BY v ASC NULLS FIRST LIMIT 1could return an empty string instead of NULL when TopK aggregation was enabled.What changes are included in this PR?
This PR preserves NULL group keys for DISTINCT TopK aggregation by tracking whether a NULL group key was seen separately from the heap.
The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.
The stream also now marks itself done after emitting, so NULL-only DISTINCT results are emitted once and do not repeat.
Are these changes tested?
Yes
Are there any user-facing changes?
No API Change