You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The improved grouping algorithm on #790 improves grouping performance in general for DataFusion and is also general in that it works for all types of keys.
However, @sundy-li noted on #790 (comment) that additional performance is likely possible by special casing "small" and fixed sized keys.
Describe the solution you'd like
From @sundy-li ' comment:
Introduce the variant hash methods would help in this case.
E.G:
Query which group by 3 columns, which are [u8, u8, u16], a fixed hash key U32 will be enough.
We can allocate one large fixed memory than multiple vec allocate.
@Dandandan also suggests that for small ranges / data types we can even avoid using a hash table and move to direct indexing instead. That might be interesting for u8 values or small dictionaries.
The text was updated successfully, but these errors were encountered:
If a column is nullable, we can use another byte to store the nullable bits.
If [u8, u8, u16] are all nullable, u64 key can be used.
How do you avoid that hashes are very similar to each other and only differ in some bits?
As of now each element gets hashed using ahash. With grouping the values in one value I am wondering whether it's good enough for the hashtable? Or would you hash that again?
With grouping the values in one value I am wondering whether it's good enough for the hashtable? Or would you hash that again?
We don't care about the rehash in hashmap, it's the problem of hashmap.
We just ensure the Key is unique (Fixed keys are represented as Number, String key can use hash256 method, don't need to care about the key conflict because it's as safe as crack the bank's password ).
With the specified key, we can get unique AggregateFunctionState from the HashMap, then we calculate/merge this row to the state. So the block is not modified by take, we just modified the state, and only need one function or expr for each aggregate function.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The improved grouping algorithm on #790 improves grouping performance in general for DataFusion and is also general in that it works for all types of keys.
However, @sundy-li noted on #790 (comment) that additional performance is likely possible by special casing "small" and fixed sized keys.
Describe the solution you'd like
From @sundy-li ' comment:
Introduce the variant hash methods would help in this case.
E.G:
Query which group by 3 columns, which are [u8, u8, u16], a fixed hash key U32 will be enough.
Refer:
https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by.rs#L17-L36
https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by_hash.rs#L264-L274
Alternate Ideas
@Dandandan also suggests that for small ranges / data types we can even avoid using a hash table and move to direct indexing instead. That might be interesting for u8 values or small dictionaries.
The text was updated successfully, but these errors were encountered: