Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented May 2, 2021

Which issue does this PR close?

Closes #235
Closes #239
Closes #143

Rationale for this change

Fixes behavior of left join with regard to multiple batches on the right side and 0 right side batches.

This way it also also likely much faster as it avoids keeping generating / keeping / indexing into a HashSet for each right side batch.

What changes are included in this PR?

  • We add a Vec<bool> to keep track of left-side rows that didn't match with the right side. (number o left rows bytes extra memory usage for left joins, not that much compared to storing the left side data and hashmap + indices. It could be more memory-efficient by using a bitmap instead.
  • Generate rows for any item that is not marked as visited.
  • Add some tests for 0 / multiple batches on the right

Are there any user-facing changes?

Only correctness improvements

@Dandandan Dandandan requested a review from andygrove May 2, 2021 11:08
@codecov-commenter
Copy link

codecov-commenter commented May 2, 2021

Codecov Report

Merging #238 (ab760af) into master (c945b03) will increase coverage by 0.30%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #238      +/-   ##
==========================================
+ Coverage   76.46%   76.76%   +0.30%     
==========================================
  Files         135      134       -1     
  Lines       23250    23248       -2     
==========================================
+ Hits        17777    17847      +70     
+ Misses       5473     5401      -72     
Impacted Files Coverage Δ
datafusion/src/physical_plan/hash_utils.rs 100.00% <ø> (ø)
datafusion/src/physical_plan/hash_join.rs 87.21% <96.42%> (+1.27%) ⬆️
ballista/rust/core/src/client.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/main.rs 0.00% <0.00%> (ø)
ballista/rust/scheduler/src/main.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/execution_loop.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/flight_service.rs 0.00% <0.00%> (ø)
...ust/core/src/execution_plans/unresolved_shuffle.rs 50.00% <0.00%> (ø)
ballista/rust/executor/src/lib.rs
ballista/rust/scheduler/src/lib.rs 21.13% <0.00%> (+1.62%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c945b03...ab760af. Read the comment docs.

@Dandandan Dandandan requested a review from jorgecarleitao May 2, 2021 11:31
@Dandandan Dandandan changed the title Fix left join unmatched rows Fix Left join implementation is incorrect for 0 or multiple batches on the right side May 2, 2021
@Dandandan Dandandan mentioned this pull request May 2, 2021
3 tasks
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I pulled this branch locally and confirmed that this fixes the failing tests. Thanks, @Dandandan!

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yap, the implementation makes a lot of sense. Thanks a lot @Dandandan !

@Dandandan
Copy link
Contributor Author

Merging this when it's green

@Dandandan Dandandan merged commit 3072df6 into apache:master May 2, 2021
@houqp houqp added the bug Something isn't working label Jul 31, 2021
mustafasrepo added a commit that referenced this pull request Dec 25, 2023
* Add first reverse support for partitioned join conversion

* Minor changes

* minor changes

* Minor changes

* Add ordering requirement propogation

* Remove wrong check

* Simplifications

* Simplifications

* Minor changes

* Minor changes

* add test case

* Review

* Propagate group by and aggregate through join

* Minor changes

* Minor changes

* Simplifications

* Buggy state

* Minor changes

* Simplifications

* Add comments

* Update comments

* Update join_pipeline_selection.rs

* Mini

* Update comments

* Fix formatting

* Review

---------

Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com>
Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
mustafasrepo added a commit that referenced this pull request Dec 25, 2023
* Add first reverse support for partitioned join conversion

* Minor changes

* minor changes

* Minor changes

* Add ordering requirement propogation

* Remove wrong check

* Simplifications

* Simplifications

* Minor changes

* Minor changes

* add test case

* Review

* Propagate group by and aggregate through join

* Minor changes

* Minor changes

* Simplifications

* Buggy state

* Minor changes

* Simplifications

* Add comments

* Update comments

* Update join_pipeline_selection.rs

* Mini

* Update comments

* Fix formatting

* Review

---------

Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com>
Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
* Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type

* remove redundant code

---------

Co-authored-by: Huaxin Gao <huaxin.gao@apple.com>
H0TB0X420 pushed a commit to H0TB0X420/datafusion that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

6 participants