Refactor the Hash Join #4377

liukun4515 · 2022-11-25T14:23:26Z

Which issue does this PR close?

Closes #4355
Closes #4247
Closes #4356

Rationale for this change

What changes are included in this PR?

described in this issue #4356

combine the logic of building left/right indices
use the join type to adjust left/right indices
output the result according to the join type

Are these changes tested?

Are there any user-facing changes?

liukun4515 · 2022-11-25T14:28:27Z

datafusion/core/src/physical_plan/joins/hash_join.rs

-                    Some(result.map(|x| x.0))
                }
+                // the right side has been consumed
+                // TODO: Some(Err) case


resolved by https://github.com/apache/arrow-datafusion/pull/4373/files

liukun4515 · 2022-11-25T14:29:34Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+                            Some(result)
+                        }
+                        Err(_) => {
+                            // TODO why the type of result stream is `Result<T, ArrowError>`, and not the `DataFusionError`


@alamb @Dandandan why we use the ArrowError instead the DataFusionError

Is it same with #4172?

Is it same with #4172?

yes

liukun4515 · 2022-11-25T14:30:12Z

cc @alamb @Dandandan @mingmwang @jackwener PTAL

alamb · 2022-11-26T13:02:09Z

I will try and find time to review this carefully over the next few days -- joins are a complicated subject so thanks for taking them on. However they aren't very high priority to my day job at InfluxData yet so finding time to review this kind of PR is hard for me

jackwener · 2022-11-26T13:07:26Z

I will review this PR carefully tomorrow, thanks @liukun4515

liukun4515 · 2022-11-27T03:28:59Z

datafusion/core/tests/sql/joins.rs


 #[tokio::test]
-#[ignore = "Test ignored, will be enabled after fixing right semi join bug"]
-// https://github.com/apache/arrow-datafusion/issues/4247


fix the bug for right semi join cc @mingmwang

jackwener

Great job! 👍 The whole work is very clear.

jackwener · 2022-11-27T09:06:07Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+// "+----+----+-----+----+----+-----+"
+// And the result of left and right indices
+// left indices:  5,6,6,4
+// right indices: 3,4,5,3


jackwener · 2022-11-27T13:02:27Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+    RecordBatch::try_new(Arc::new(schema.clone()), columns)
 }

+// Get left and right indices which is satisfies the on condition in the Join


It is easy to forget the filter and equal-conditon both are on-condition and they are combined by And implicitly.

Suggested change

// Get left and right indices which is satisfies the on condition in the Join

// Get left and right indices which is satisfies the on condition (include equal_conditon and filter_in_join) in the Join

jackwener · 2022-11-27T13:38:23Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+        JoinType::LeftSemi | JoinType::LeftAnti => {
+            // matched or unmatched left row will be produced in the end of loop
+            (
+                UInt64Array::from_iter_values(vec![]),
+                UInt32Array::from_iter_values(vec![]),
+            )


We can add a TODO optimized points
Because semi don't need to wait the end

alamb

Thank you @liukun4515 -- I reviewed this code and tests carefully; I found it easier to read / understand. 🏅

alamb · 2022-11-29T16:23:02Z

datafusion/core/src/physical_plan/joins/hash_join.rs

-                PrimitiveArray::<UInt32Type>::from(right),
-            ))
-        }
-        JoinType::RightSemi => {


alamb · 2022-11-29T16:29:49Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+        let matched_size = left_indices.len();
+        let unmatched_size = appended_right_indices.len();
+        let total_size = matched_size + unmatched_size;
+        // the new left indices: left_indices + null array
+        // the new right indices: right_indices + appended_right_indices
+        let new_left_indices = (0..total_size)
+            .map(|pos| {
+                if pos < matched_size {
+                    unsafe { Some(left_indices.value_unchecked(pos)) }
+                } else {
+                    None
+                }
+            })
+            .collect::<UInt64Array>();
+        let new_right_indices = (0..total_size)
+            .map(|pos| {
+                if pos < matched_size {
+                    unsafe { Some(right_indices.value_unchecked(pos)) }
+                } else {
+                    unsafe {
+                        Some(appended_right_indices.value_unchecked(pos - matched_size))
+                    }
+                }
+            })
+            .collect::<UInt32Array>();
+        (new_left_indices, new_right_indices)


I think you might be able to do this without unsafe and more concisely:

Suggested change

let matched_size = left_indices.len();

let unmatched_size = appended_right_indices.len();

let total_size = matched_size + unmatched_size;

// the new left indices: left_indices + null array

// the new right indices: right_indices + appended_right_indices

let new_left_indices = (0..total_size)

.map(|pos| {

if pos < matched_size {

unsafe { Some(left_indices.value_unchecked(pos)) }

} else {

None

}

})

.collect::<UInt64Array>();

let new_right_indices = (0..total_size)

.map(|pos| {

if pos < matched_size {

unsafe { Some(right_indices.value_unchecked(pos)) }

} else {

unsafe {

Some(appended_right_indices.value_unchecked(pos - matched_size))

}

}

})

.collect::<UInt32Array>();

(new_left_indices, new_right_indices)

let matched_size = left_indices.len();

let unmatched_size = appended_right_indices.len();

// the new left indices: left_indices + null array

// the new right indices: right_indices + appended_right_indices

let new_left_indices = left_indices

.iter()

.chain(std::iter::repeat(None).take(unmatched_size))

.collect::<UInt64Array>();

let new_right_indices = right_indices

.iter()

.chain(appended_right_indices.iter())

.collect::<UInt32Array>();

(new_left_indices, new_right_indices)

alamb · 2022-11-29T16:40:06Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+        // b2 = 10
+        build_table(
+            ("a2", &vec![2, 4, 6, 8, 10, 12]),
+            ("b2", &vec![2, 4, 6, 8, 10, 10]),


I recommend changing the order of the inputs so they are no sorted to add additional coverage

For example:

Suggested change

("b2", &vec![2, 4, 6, 8, 10, 10]),

("b2", &vec![8, 10 6, 2, 10, 4]),

Bonus points for fuzzing and trying several different combinations

If i change the order for b2, i need to change a2 with the same changes, because there are some filter for a2 and the join_equal is b2 with b1.

And the result for c2 in the rightsemi or rightanti will be changed.

cc @alamb
The commit a24ba09 for change the order of the b2 and a2

ursabot · 2022-12-01T02:41:28Z

Benchmark runs are scheduled for baseline = 48f0f3a and contender = 8e0556b. 8e0556b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liukun4515 added 2 commits November 24, 2022 11:50

fix bug: right anti join with filter

8308f2d

refactor the hash join, and combine common method

5a38a23

github-actions bot added the core Core DataFusion crate label Nov 25, 2022

liukun4515 commented Nov 25, 2022

View reviewed changes

liukun4515 requested review from Dandandan, alamb and andygrove November 25, 2022 14:30

Merge remote-tracking branch 'upstream/master' into refactor_hash_join

99dd10e

liukun4515 commented Nov 27, 2022

View reviewed changes

jackwener reviewed Nov 27, 2022

View reviewed changes

liukun4515 added 2 commits November 28, 2022 11:06

Merge remote-tracking branch 'upstream/master' into refactor_hash_join

999e00c

optimize the comments

d0394b4

alamb mentioned this pull request Nov 28, 2022

[Epic] Generate runtime errors if the memory budget is exceeded #3941

Closed

4 tasks

alamb approved these changes Nov 29, 2022

View reviewed changes

liukun4515 added 2 commits November 30, 2022 13:27

address comments; change the data order

a24ba09

Merge remote-tracking branch 'upstream/master' into refactor_hash_join

05e6d9f

alamb approved these changes Nov 30, 2022

View reviewed changes

liukun4515 merged commit 8e0556b into apache:master Dec 1, 2022

liukun4515 deleted the refactor_hash_join branch December 1, 2022 02:40

	// Get left and right indices which is satisfies the on condition in the Join
	// Get left and right indices which is satisfies the on condition (include equal_conditon and filter_in_join) in the Join

	("b2", &vec![2, 4, 6, 8, 10, 10]),
	("b2", &vec![8, 10 6, 2, 10, 4]),

Refactor the Hash Join #4377

Refactor the Hash Join #4377

Uh oh!

Conversation

liukun4515 commented Nov 25, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liukun4515 commented Nov 25, 2022

Uh oh!

alamb commented Nov 26, 2022

Uh oh!

jackwener commented Nov 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackwener left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackwener Nov 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liukun4515 Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ursabot commented Dec 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jackwener Nov 27, 2022 •

edited

Loading

liukun4515 Nov 30, 2022 •

edited

Loading