Skip to content

Conversation

@liukun4515
Copy link
Contributor

Which issue does this PR close?

Closes #4355
Closes #4247
Closes #4356

Rationale for this change

What changes are included in this PR?

described in this issue #4356

  • combine the logic of building left/right indices
  • use the join type to adjust left/right indices
  • output the result according to the join type

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 25, 2022
Some(result.map(|x| x.0))
}
// the right side has been consumed
// TODO: Some(Err) case
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some(result)
}
Err(_) => {
// TODO why the type of result stream is `Result<T, ArrowError>`, and not the `DataFusionError`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @Dandandan why we use the ArrowError instead the DataFusionError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it same with #4172?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it same with #4172?

yes

@liukun4515
Copy link
Contributor Author

cc @alamb @Dandandan @mingmwang @jackwener PTAL

@alamb
Copy link
Contributor

alamb commented Nov 26, 2022

I will try and find time to review this carefully over the next few days -- joins are a complicated subject so thanks for taking them on. However they aren't very high priority to my day job at InfluxData yet so finding time to review this kind of PR is hard for me

@jackwener
Copy link
Member

I will review this PR carefully tomorrow, thanks @liukun4515


#[tokio::test]
#[ignore = "Test ignored, will be enabled after fixing right semi join bug"]
// https://github.com/apache/arrow-datafusion/issues/4247
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the bug for right semi join cc @mingmwang

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! 👍 The whole work is very clear.

// "+----+----+-----+----+----+-----+"
// And the result of left and right indices
// left indices: 5,6,6,4
// right indices: 3,4,5,3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

RecordBatch::try_new(Arc::new(schema.clone()), columns)
}

// Get left and right indices which is satisfies the on condition in the Join
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easy to forget the filter and equal-conditon both are on-condition and they are combined by And implicitly.

Suggested change
// Get left and right indices which is satisfies the on condition in the Join
// Get left and right indices which is satisfies the on condition (include equal_conditon and filter_in_join) in the Join

Comment on lines 1222 to 1227
JoinType::LeftSemi | JoinType::LeftAnti => {
// matched or unmatched left row will be produced in the end of loop
(
UInt64Array::from_iter_values(vec![]),
UInt32Array::from_iter_values(vec![]),
)
Copy link
Member

@jackwener jackwener Nov 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a TODO optimized points
Because semi don't need to wait the end

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liukun4515 -- I reviewed this code and tests carefully; I found it easier to read / understand. 🏅

PrimitiveArray::<UInt32Type>::from(right),
))
}
JoinType::RightSemi => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment on lines 1244 to 1269
let matched_size = left_indices.len();
let unmatched_size = appended_right_indices.len();
let total_size = matched_size + unmatched_size;
// the new left indices: left_indices + null array
// the new right indices: right_indices + appended_right_indices
let new_left_indices = (0..total_size)
.map(|pos| {
if pos < matched_size {
unsafe { Some(left_indices.value_unchecked(pos)) }
} else {
None
}
})
.collect::<UInt64Array>();
let new_right_indices = (0..total_size)
.map(|pos| {
if pos < matched_size {
unsafe { Some(right_indices.value_unchecked(pos)) }
} else {
unsafe {
Some(appended_right_indices.value_unchecked(pos - matched_size))
}
}
})
.collect::<UInt32Array>();
(new_left_indices, new_right_indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be able to do this without unsafe and more concisely:

Suggested change
let matched_size = left_indices.len();
let unmatched_size = appended_right_indices.len();
let total_size = matched_size + unmatched_size;
// the new left indices: left_indices + null array
// the new right indices: right_indices + appended_right_indices
let new_left_indices = (0..total_size)
.map(|pos| {
if pos < matched_size {
unsafe { Some(left_indices.value_unchecked(pos)) }
} else {
None
}
})
.collect::<UInt64Array>();
let new_right_indices = (0..total_size)
.map(|pos| {
if pos < matched_size {
unsafe { Some(right_indices.value_unchecked(pos)) }
} else {
unsafe {
Some(appended_right_indices.value_unchecked(pos - matched_size))
}
}
})
.collect::<UInt32Array>();
(new_left_indices, new_right_indices)
let matched_size = left_indices.len();
let unmatched_size = appended_right_indices.len();
// the new left indices: left_indices + null array
// the new right indices: right_indices + appended_right_indices
let new_left_indices = left_indices
.iter()
.chain(std::iter::repeat(None).take(unmatched_size))
.collect::<UInt64Array>();
let new_right_indices = right_indices
.iter()
.chain(appended_right_indices.iter())
.collect::<UInt32Array>();
(new_left_indices, new_right_indices)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// b2 = 10
build_table(
("a2", &vec![2, 4, 6, 8, 10, 12]),
("b2", &vec![2, 4, 6, 8, 10, 10]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend changing the order of the inputs so they are no sorted to add additional coverage

For example:

Suggested change
("b2", &vec![2, 4, 6, 8, 10, 10]),
("b2", &vec![8, 10 6, 2, 10, 4]),

Bonus points for fuzzing and trying several different combinations

Copy link
Contributor Author

@liukun4515 liukun4515 Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i change the order for b2, i need to change a2 with the same changes, because there are some filter for a2 and the join_equal is b2 with b1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the result for c2 in the rightsemi or rightanti will be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @alamb
The commit a24ba09 for change the order of the b2 and a2

@liukun4515 liukun4515 merged commit 8e0556b into apache:master Dec 1, 2022
@liukun4515 liukun4515 deleted the refactor_hash_join branch December 1, 2022 02:40
@ursabot
Copy link

ursabot commented Dec 1, 2022

Benchmark runs are scheduled for baseline = 48f0f3a and contender = 8e0556b. 8e0556b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor the code of the HashJoin RightSemi/RightAnti HashJoin has bug, the left_indices is never populated, causing failure to apply join filters.

4 participants