Skip to content

Hash join with direct indexing #816

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When the range of the join keys is small and we have integer keys, instead of using a hashtable we can use a array/vec which starts on the minimum value of the values in the build-side key range.

Idea from:
duckdb/duckdb#1959

Describe the solution you'd like
Filter on

  • join constraint should be on single integer (boolean/byte also possible, as long as it is limited in domain) keys.
  • no duplicates (?) - not sure whether its required
  • If no duplicates - > we can see how much unique values we have -> should be less than max limit, e.g. less than 100K values.
  • Get statistics on min/max values. It should be a small enough build-side (e.g. max 100K key values)

If this is all true we can copy the offsets to somethig like a Vec<Option<u64>> (or arrow equivalent) which contains the offsets starting at the minimum offset at index 0. Each element is indexed by x - MIN(x).

Instead of hashing the right-side values and probing, we can compute y - MIN(x) for the right side and index directly into the array. Checking hash collisions is not necessary for this approach.

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions