You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
#6904 introduces some fancy new hashing and ways to implement aggregates
min/max for strings (StringArray / LargeStringArray, etc) now uses the slower Accumulator implementation which could be made much faster
Describe the solution you'd like
I would like to implement a fast GroupsAccumulator for Min/Max
Describe alternatives you've considered
here is one potential way to implement it:
We could store the current minimum for all groups in the same Rows 🤔 and track an index into that Rows for the current minimum for each group.
This would require an extra copy of the input values, but it could probably be vectorized pretty well, as shown in the following diagram.
Sorry what I meant was something like the following where the accumulator only stored the current minimum values.
This approach would potentially end up with min_storage being full of "garbage" if many batches had new minumums, but I think we could heuristically "compact" min_storage (if it had 2*num_groups, for example) if it got too large
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│ Accumulator │
state
┌─────────┐ ┌─────────┐ │ ┌─────────┐ ┌─────────┐ │
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │
│ │ A │ │ │ │ A │─┼────┐ │ │ │ D │ │ ┌─────┼─│ 1 │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ │ ├─────┤ │ │ │ ├─────┤ │
│ │ B │ │ │ │ B │ │ └──┼─┼▶│ A │◀┼────┘ │ │ 0 │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ └─────┘ │ │ └─────┘ │
│ │ A │ │ │ │ A │ │ │ │ │ │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ │ │ │
│ │ A │ │ │ │ A │ │ │ │ │ │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ │ │ │
│ │ C │ │ │ │ C │ │ │ │ │ │ │ │
│ └─────┘ │ │ └─────┘ │ │ │ │ │
└─────────┘ └─────────┘ │ └─────────┘ └─────────┘ │
input input │ min_storage: min_values │
values values Rows
(Array) (Rows) └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
step 1: step 2: for
convert any value step 3: min value
arguments to that is a new (per group) is
Row format group tracked as an
minimum, copy index into
it to a min_storage `Rows`
second `Rows`
alamb
changed the title
Implement fast min/max accumulator for strings (now it uses the slower path)
Implement fast min/max accumulator for binary / strings (now it uses the slower path)
Jan 8, 2024
Is your feature request related to a problem or challenge?
#6904 introduces some fancy new hashing and ways to implement aggregates
min/max for strings (
StringArray
/LargeStringArray
, etc) now uses the slowerAccumulator
implementation which could be made much fasterDescribe the solution you'd like
I would like to implement a fast
GroupsAccumulator
for Min/MaxDescribe alternatives you've considered
here is one potential way to implement it:
We could store the current minimum for all groups in the same Rows 🤔 and track an index into that Rows for the current minimum for each group.
This would require an extra copy of the input values, but it could probably be vectorized pretty well, as shown in the following diagram.
Sorry what I meant was something like the following where the accumulator only stored the current minimum values.
This approach would potentially end up with
min_storage
being full of "garbage" if many batches had new minumums, but I think we could heuristically "compact"min_storage
(if it had2*num_groups
, for example) if it got too largeSee #6800 (comment) for more details
Additional context
No response
The text was updated successfully, but these errors were encountered: