Skip to content

Conversation

@metesynnada
Copy link
Contributor

@metesynnada metesynnada commented Aug 22, 2023

Which issue does this PR close?

Continue on #6679.

Rationale for this change

The current implementation of the JoinHashMap and SymmetricJoinHashMap types could benefit from being more generic and flexible. Specifically, the ability to support different types of list data structures for chaining, as well as handling resizing in a more idiomatic and efficient manner, would be advantageous. This PR introduces the JoinHashMapType trait and implements it for both JoinHashMap and PruningJoinHashMap, which allows for more code reuse and a clearer separation of concerns.

In this PR, Several unused hash join utilities are removed. Also, we can introduce a vectorized implementation of SymmetricHashJoin that includes hash collision checks.

What changes are included in this PR?

  • Added a JoinHashMapType trait with methods for handling the mutable map and mutable list, as well as a method as_any_mut for dynamic downcasting.
  • Implemented the JoinHashMapType trait for both JoinHashMap and PruningJoinHashMap.
  • Updated the update_hash function to use the JoinHashMapType trait and only resize the list in the case of PruningJoinHashMap.
  • Updated the build_equal_condition_join_indices function to use the JoinHashMapType trait and introduced an offset parameter.
  • Added inline comments and docstrings for better readability and documentation of the code.

I have removed smallvec completely from the code, but I am unsure whether or not to remove it from Cargo.toml.

Are these changes tested?

Yes, the changes are covered by the existing tests. No new tests were required as the new implementation preserves the existing functionality. All tests passed successfully after the changes were applied.

Are there any user-facing changes?

No, the changes made in this PR are internal and do not affect the public API or the functionality of the crate.

cc @Dandandan

@github-actions github-actions bot added the core Core DataFusion crate label Aug 22, 2023
@metesynnada metesynnada changed the title Upstream/prunable hash join Merge hash join implementations and remove leftover utilities Aug 22, 2023
@metesynnada metesynnada changed the title Merge hash join implementations and remove leftover utilities Merge hash table implementations and remove leftover utilities Aug 22, 2023
@Dandandan
Copy link
Contributor

Dandandan commented Aug 22, 2023

I have removed smallvec completely from the code, but I am unsure whether or not to remove it from Cargo.toml.

I think if this is the last usage, we should remove it from Cargo.toml as well

@Dandandan
Copy link
Contributor

Dandandan commented Aug 22, 2023

I wonder whether you're seeing performance improvements in symmetric / streaming hash join with this PR?

@metesynnada
Copy link
Contributor Author

I created a small benchmark for streaming using tpch data.

Benchmark streaming.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ apache_main ┃ upstream_prunable-hash-join ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │   1483.13ms │                   1483.27ms │     no change │
│ QQuery 2     │  11033.15ms │                   6903.41ms │ +1.60x faster │
└──────────────┴─────────────┴─────────────────────────────┴───────────────┘

First query is

SELECT
    o_orderkey
FROM
    orders,
    lineitem
WHERE
  o_orderdate = l_shipdate
  AND l_orderkey >= o_orderkey - 10
  AND l_orderkey < o_orderkey + 10
  AND l_returnflag = 'R'

and the second one is

SELECT
    o_orderkey
FROM
    orders,
    lineitem
WHERE
        o_orderstatus = l_linestatus
  AND l_orderkey >= o_orderkey - 10
  AND l_orderkey < o_orderkey + 10
  AND l_returnflag = 'R'
    LIMIT 10000;

The second query involves key pairs with low cardinality. While smallvec was effective in allocating new keys, deleting from it resulted in performance issues. With the removal of the smallvec mechanism in this PR, we have significantly improved performance.

metesynnada and others added 2 commits August 23, 2023 17:09
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go from my perspective, what do you think @Dandandan?

@ozankabak
Copy link
Contributor

I will go ahead and merge this PR after CI passes. We will file a follow-on PR in case any other suggestions come in post merge

@ozankabak ozankabak merged commit da8a244 into apache:main Aug 23, 2023
@Dandandan
Copy link
Contributor

Thanks @metesynnada @ozankabak 🙏

@Dandandan
Copy link
Contributor

This is good to go from my perspective, what do you think @Dandandan?

Looks great, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants