Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFusion] Optimize hash join inner workings, null handling fix #24

Merged
merged 19 commits into from
Apr 28, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Apr 21, 2021

> 5% speed up for query 5 by speeding up some hot functions

  • Don't store hashes in hashtable and allocate capacity upfront (adding some memory usage for non-high cardinality inputs, but this grows at the same rate as the left side input). This avoids re-hashing (copying + calling hash functions) when resizing the hash table
  • Do some more optimization for hashing primitives and avoid .value(i) which does a bound check.
  • Don't call hash_combine on joins with one column
  • Fix null equality / enable test

Closes #44
Fixes #195

PR

Query 5 avg time: 88.59 ms

Master

Query 5 avg time: 95.91 ms

@Dandandan Dandandan changed the title Micro optimize hash join functions [DataFusion] Micro optimize hash join functions Apr 21, 2021
@Dandandan Dandandan changed the title [DataFusion] Micro optimize hash join functions [DataFusion] [WIP] Micro optimize hash join functions Apr 21, 2021
@Dandandan Dandandan changed the title [DataFusion] [WIP] Micro optimize hash join functions [DataFusion] Micro optimize hash join functions Apr 21, 2021
@Dandandan Dandandan changed the title [DataFusion] Micro optimize hash join functions [DataFusion] Optimize hash join inner workings Apr 21, 2021
@codecov-commenter
Copy link

codecov-commenter commented Apr 24, 2021

Codecov Report

Merging #24 (e1c0a4e) into master (245f0b8) will increase coverage by 0.16%.
The diff coverage is 63.00%.

❗ Current head e1c0a4e differs from pull request most recent head 0e1bdb4. Consider uploading reports for the commit 0e1bdb4 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master      #24      +/-   ##
==========================================
+ Coverage   76.24%   76.41%   +0.16%     
==========================================
  Files         134      134              
  Lines       23051    23181     +130     
==========================================
+ Hits        17576    17714     +138     
+ Misses       5475     5467       -8     
Impacted Files Coverage Δ
ballista/rust/client/src/context.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/main.rs 0.00% <0.00%> (ø)
benchmarks/src/bin/nyctaxi.rs 0.00% <ø> (ø)
datafusion-examples/examples/flight_server.rs 0.00% <0.00%> (ø)
datafusion-examples/examples/simple_udaf.rs 0.00% <ø> (ø)
datafusion/src/physical_plan/expressions/case.rs 72.91% <16.66%> (-0.39%) ⬇️
benchmarks/src/bin/tpch.rs 35.35% <33.33%> (+0.20%) ⬆️
datafusion/src/datasource/csv.rs 73.33% <69.86%> (-9.28%) ⬇️
datafusion/src/physical_plan/csv.rs 79.09% <75.72%> (-4.13%) ⬇️
datafusion/src/physical_plan/hash_join.rs 85.94% <78.57%> (-0.49%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 245f0b8...0e1bdb4. Read the comment docs.

@Dandandan Dandandan changed the title [DataFusion] Optimize hash join inner workings [DataFusion] Optimize hash join inner workings, null fix Apr 24, 2021
@Dandandan Dandandan changed the title [DataFusion] Optimize hash join inner workings, null fix [DataFusion] Optimize hash join inner workings, null hanling fix Apr 24, 2021
@Dandandan Dandandan changed the title [DataFusion] Optimize hash join inner workings, null hanling fix [DataFusion] Optimize hash join inner workings, null handling fix Apr 24, 2021
@Dandandan
Copy link
Contributor Author

@alamb @jorgecarleitao FYI if you have time, I am not planning any changes to this PR for now.

I think the more important part of the PR is the fix wrt null handling which turned out to be a one line change.

The other part is some small performance tweaks and prepares it a bit for further improvements down the road.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, @Dandandan and thanks for the nudge.

I went through this. I have a change that I think we should do it, but otherwise looks great. Super cool.

datafusion/src/physical_plan/hash_join.rs Outdated Show resolved Hide resolved
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();

match (left_array.is_null($left), left_array.is_null($right)) {
(true, true) => true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the line, right? Damm, so many wrong things for this one.

Really great finding, @Dandandan !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! To be fair I also added this line myself 😆

I think we also should start testing DataFusion more rigorously wrt null handling, but one step at a time.

Copy link
Contributor Author

@Dandandan Dandandan Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it is, I noticed the values method doesn't take offset into account right now, so this is maybe an optimization worth doing once hashing is included as arrow kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Sorry for the delay @Dandandan -- I am about out of time today, but I will take a careful look tomorrow morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused about the lack of keys in the hash map, but otherwise things look good to me

datafusion/src/physical_plan/hash_join.rs Show resolved Hide resolved
datafusion/src/physical_plan/hash_join.rs Show resolved Hide resolved
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();

match (left_array.is_null($left), left_array.is_null($right)) {
(true, true) => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣

}
}
} else {
if $multi_col {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special casing multi-value join columns is a good one

@alamb alamb merged commit 3371574 into apache:master Apr 28, 2021
@houqp houqp added ballista bug Something isn't working datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 29, 2021
andygrove added a commit that referenced this pull request Jan 12, 2023
* Initial commit

* initial commit

* failing test

* table scan projection

* closer

* test passes, with some hacks

* use DataFrame (#2)

* update README

* update dependency

* code cleanup (#3)

* Add support for Filter operator and BinaryOp expressions (#4)

* GitHub action (#5)

* Split code into producer and consumer modules (#6)

* Support more functions and scalar types (#7)

* Use substrait 0.1 and datafusion 8.0 (#8)

* use substrait 0.1

* use datafusion 8.0

* update datafusion to 10.0 and substrait to 0.2 (#11)

* Add basic join support (#12)

* Added fetch support (#23)

Added fetch to consumer

Added limit to producer

Added unit tests for limit

Added roundtrip_fill_none() for testing when None input can be converted to 0

Update src/consumer.rs

Co-authored-by: Andy Grove <andygrove73@gmail.com>

Co-authored-by: Andy Grove <andygrove73@gmail.com>

* Upgrade to DataFusion 13.0.0 (#25)

* Add sort consumer and producer (#24)

Add consumer

Add producer and test

Modified error string

* Add serializer/deserializer (#26)

* Add plan and function extension support (#27)

* Add plan and function extension support

* Removed unwraps

* Implement GROUP BY (#28)

* Add consumer, producer and tests for aggregate relation

Change function extension registration from absolute to relative anchor
(reference)

Remove operator to/from reference

* Fixed function registration bug

* Add test

* Addressed PR comments

* Changed field reference from mask to direct reference (#29)

* Changed field reference from masked reference to direct reference

* Handle unsupported case (struct with child)

* Handle SubqueryAlias (#30)

Fixed aggregate function register bug

* Add support for SELECT DISTINCT (#31)

Add test case

* Implement BETWEEN (#32)

* Add case (#33)

* Implement CASE WHEN

* Add more case to test

* Addressed comments

* feat: support explicit catalog/schema names in ReadRel (#34)

* feat: support explicit catalog/schema names in ReadRel

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* fix: use re-exported expr crate

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* move files to subfolder

* RAT

* remove rust.yaml

* revert .gitignore changes

* tomlfmt

* tomlfmt

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: JanKaul <jankaul@mailbox.org>
Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com>
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
alamb pushed a commit that referenced this pull request Aug 8, 2024
…calculations, limit/order/distinct (#11756)

* Fix unparser derived table with columns include calculations, limit/order/distinct (#24)

* compare format output to make sure the two level of projects match

* add method to find inner projection that could be nested under limit/order/distinct

* use format! for matching in unparser sort optimization too

* refactor

* use to_string and also put comments in

* clippy

* fix unparser derived table contains cast (#25)

* fix unparser derived table contains cast

* remove dbg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix null handling hash join Optimize hash join inner workings
5 participants