Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize hash join collision check #50

Closed
Dandandan opened this issue Apr 24, 2021 · 4 comments · Fixed by #6724
Closed

Vectorize hash join collision check #50

Dandandan opened this issue Apr 24, 2021 · 4 comments · Fixed by #6724
Assignees
Labels
enhancement New feature or request performance

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Apr 24, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Further optimize the hash join algorithm

Describe the solution you'd like
There are a couple of optimizations we could implement:

  • Vectorize the row-equality check which now uses the equal_rows functions. We should be able to speed this up by vectorizing this, and also specialize it for handling non-null batches too. We probably can utilize the kernels take and eq here.
  • Don't use a Hashmap but a Vec (or similar) with a certain amount of buckets (proportional to the number of rows or the expected number of keys in the left side). I tried this, but as it causes much more collisions than we have currently, it causes a big (3x) slowdown, so vectorizing the collision check is a prerequisite.

Additional context

https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487

@Dandandan Dandandan added the enhancement New feature or request label Apr 24, 2021
@Dandandan Dandandan changed the title Hash join further optimization / vectorization Hash join further vectorization Apr 24, 2021
@Dandandan Dandandan changed the title Hash join further vectorization Vectorize hash join collision check Apr 25, 2021
@Dandandan Dandandan self-assigned this Jun 19, 2023
@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 19, 2023

FYI @metesynnada I'm working on a prototype.

Translated the code out of https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487 to this, which seems roughly equivalent:

    let mut to_check: Vec<(u64, usize)> = hash_values
        .iter()
        .enumerate()
        .flat_map(|(row, hash_value)| {
            build_hashmap
                .map
                .get(*hash_value, |(hash, _)| *hash_value == *hash)
                .map(|(_, v)| (*v - 1, row))
        })
        .collect();

    while to_check.len() > 0 {
        // TODO Perform column-wise (vectorized) equality check

        // check next items
        to_check = to_check
            .iter()
            .flat_map(|(index, row)| {
                let next = build_hashmap.next[*index as usize];
                (next != 0).then(|| (next - 1, *row))
            })
            .collect();
    }

@metesynnada
Copy link
Contributor

I will be thinking about this as well. Building an array for the equality check might improve the equality check performance, however, compute::take will copy the elements so I am not sure without a performance test on different cases.

@Dandandan
Copy link
Contributor Author

Yeah - we sure have to test real-world performance :)

@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 20, 2023

Looks like on TPC-H we get some nice improvements overall #6724

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants