-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When the range of the join keys is small and we have integer keys, instead of using a hashtable we can use a array/vec which starts on the minimum value of the values in the build-side key range.
Idea from:
duckdb/duckdb#1959
Describe the solution you'd like
Filter on
- join constraint should be on single integer (boolean/byte also possible, as long as it is limited in domain) keys.
- no duplicates (?) - not sure whether its required
- If no duplicates - > we can see how much unique values we have -> should be less than max limit, e.g. less than 100K values.
- Get statistics on min/max values. It should be a small enough build-side (e.g. max 100K key values)
If this is all true we can copy the offsets to somethig like a Vec<Option<u64>> (or arrow equivalent) which contains the offsets starting at the minimum offset at index 0. Each element is indexed by x - MIN(x).
Instead of hashing the right-side values and probing, we can compute y - MIN(x) for the right side and index directly into the array. Checking hash collisions is not necessary for this approach.
Describe alternatives you've considered
Additional context