Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: propagate EmptyRelation for more join types #10963

Merged

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Jun 17, 2024

Which issue does this PR close?

Doesn't fully close, but related to #10967

Rationale for this change

Currently the PropagateEmptyRelation relation only can optimize inner joins (probably the most important).

This PR updates the join handling to include simple cases for left, right, left-semi, right-semi, left-anti, and right-anti joins.

What changes are included in this PR?

Slight refactoring of the join match statement PropagateEmptyRelation::rewrite then add the aforementioned joins.

Are these changes tested?

Add tests for the added join types to confirm they're rewritten to an empty relation.

Are there any user-facing changes?

No

@github-actions github-actions bot added the optimizer Optimizer rules label Jun 17, 2024
@tshauck tshauck force-pushed the fill-out-empty-relation-opt-for-more-join-types branch from dda1845 to 5a82986 Compare June 17, 2024 19:32
@@ -400,6 +433,182 @@ mod tests {
assert_together_optimized_plan_eq(plan, expected)
}

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing these could/should be slt tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having some end to end slt would be good too, but for these rules I think having good unit tests are key as well

01)LeftSemi Join: t1.t1_id = __correlated_sq_1.t2_id
02)--TableScan: t1 projection=[t1_id, t1_name]
03)--EmptyRelation
logical_plan EmptyRelation
Copy link
Contributor Author

@tshauck tshauck Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the EmptyRelation in the original test is causing the entire join to be optimized out with the additional join types now propagating emptyrelation.

Is there a way I can preserve this test's testing of just the subquery rewrite logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

causing the entire join to be optimized out with the additional join types now propagating emptyrelation.

I agree with your assesment. I think the limit 0 means this plan is better

In terms of preserving the test of the subquery rewrite I think it does (otherwise it wouldn't be rewritten to a SEMI JOIN and thus the join couldn't be removed)

@tshauck tshauck marked this pull request as ready for review June 17, 2024 20:57
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tshauck -- I think this looks like a nice improvement to me

I agree it would be nice to add some slt tests for this feature as well as maybe reduce repetition in the unit tests, but I don't think they are required to merge the PR

I'll plan to leave this one open for another day or two to give others a chance to review it as well if they would like

@@ -400,6 +433,182 @@ mod tests {
assert_together_optimized_plan_eq(plan, expected)
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having some end to end slt would be good too, but for these rules I think having good unit tests are key as well

01)LeftSemi Join: t1.t1_id = __correlated_sq_1.t2_id
02)--TableScan: t1 projection=[t1_id, t1_name]
03)--EmptyRelation
logical_plan EmptyRelation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

causing the entire join to be optimized out with the additional join types now propagating emptyrelation.

I agree with your assesment. I think the limit 0 means this plan is better

In terms of preserving the test of the subquery rewrite I think it does (otherwise it wouldn't be rewritten to a SEMI JOIN and thus the join couldn't be removed)

if left_empty || right_empty {
return Ok(Transformed::yes(LogicalPlan::EmptyRelation(
EmptyRelation {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that this code faithfully implements the semantics in the comments, and I did some mental review and verification to convince myself that these semantics are correct

i wonder if @jackwener has a moment to double check too?

}

#[test]
fn test_right_anti_join_empty_right_table() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add the negative cases ? Like test right antijoin_empty_left_table and assert it wasn't rewritten?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see

fn test_join_empty_propagation_rules() -> Result<()> {
// test left join with empty left
empty_left_and_right_lp(true, false, JoinType::Left, true)?;
// test right join with empty right
empty_left_and_right_lp(false, true, JoinType::Right, true)?;
// test left semi join with empty left
empty_left_and_right_lp(true, false, JoinType::LeftSemi, true)?;
// test left semi join with empty right
empty_left_and_right_lp(false, true, JoinType::LeftSemi, true)?;
// test right semi join with empty left
empty_left_and_right_lp(true, false, JoinType::RightSemi, true)?;
// test right semi join with empty right
empty_left_and_right_lp(false, true, JoinType::RightSemi, true)?;
// test left anti join empty left
empty_left_and_right_lp(true, false, JoinType::LeftAnti, true)?;
// test right anti join empty right
empty_left_and_right_lp(false, true, JoinType::RightAnti, true)
}
#[test]
fn test_join_empty_propagation_rules_noop() -> Result<()> {
// these cases should not result in an empty relation
// test left join with empty right
empty_left_and_right_lp(false, true, JoinType::Left, false)?;
// test right join with empty left
empty_left_and_right_lp(true, false, JoinType::Right, false)?;
// test left semi with non-empty left and right
empty_left_and_right_lp(false, false, JoinType::LeftSemi, false)?;
// test right semi with non-empty left and right
empty_left_and_right_lp(false, false, JoinType::RightSemi, false)?;
// test left anti join with non-empty left and right
empty_left_and_right_lp(false, false, JoinType::LeftAnti, false)?;
// test left anti with non-empty left and empty right
empty_left_and_right_lp(false, true, JoinType::LeftAnti, false)?;
// test right anti join with non-empty left and right
empty_left_and_right_lp(false, false, JoinType::RightAnti, false)?;
// test right anti with empty left and non-empty right
empty_left_and_right_lp(true, false, JoinType::RightAnti, false)
}
(second half)

@tshauck
Copy link
Contributor Author

tshauck commented Jun 18, 2024

Thanks for the feedback @alamb, I took some steps in d8bd11a to improve the tests by reducing redundancy and adding negative cases

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these rewrites are correct.

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for some small details, I have no problem with this PR.
In addition, I hope to see more slt tests.

Thanks @tshauck

Comment on lines 201 to 221
pub fn assert_optimized_plan_eq_with_rules(
rules: Vec<Arc<dyn OptimizerRule + Send + Sync>>,
plan: LogicalPlan,
expected: &str,
) -> Result<()> {
let optimized_plan = generate_optimized_plan_with_rules(rules, plan);
let formatted_plan = format!("{optimized_plan:?}");
assert_eq!(formatted_plan, expected);
Ok(())
}

pub fn assert_optimized_plan_ne_with_rules(
rules: Vec<Arc<dyn OptimizerRule + Send + Sync>>,
plan: LogicalPlan,
expected: &str,
) -> Result<()> {
let optimized_plan = generate_optimized_plan_with_rules(rules, plan);
let formatted_plan = format!("{optimized_plan:?}");
assert_ne!(formatted_plan, expected);
Ok(())
}
Copy link
Member

@jackwener jackwener Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can merge them into one function and use a param to determine assert_ne or assert_eq

Like

pub fn assert_optimized_plan(
    rules: Vec<Arc<dyn OptimizerRule + Send + Sync>>,
    plan: LogicalPlan,
    expected: &str,
    eq: bool,
) -> Result<()> {
    let optimized_plan = generate_optimized_plan_with_rules(rules, plan);
    let formatted_plan = format!("{optimized_plan:?}");
    if eq {
        assert_eq!(formatted_plan, expected);
    } else {
        assert_ne!(formatted_plan, expected);
    }
    Ok(())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 401f2e2

@github-actions github-actions bot added the core Core datafusion crate label Jun 20, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @tshauck @jonahgao and @jackwener

@alamb alamb merged commit 18042fd into apache:main Jun 20, 2024
23 checks passed
@tshauck tshauck deleted the fill-out-empty-relation-opt-for-more-join-types branch June 20, 2024 16:59
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jun 22, 2024
* feat: propagate empty for more join types

* feat: update subquery de-correlation test

* tests: simplify tests

* refactor: better name

* style: clippy

* refactor: update tests

* refactor: rename

* refactor: fix spellings

* add slt tests
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jun 22, 2024
* feat: propagate empty for more join types

* feat: update subquery de-correlation test

* tests: simplify tests

* refactor: better name

* style: clippy

* refactor: update tests

* refactor: rename

* refactor: fix spellings

* add slt tests
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: propagate empty for more join types

* feat: update subquery de-correlation test

* tests: simplify tests

* refactor: better name

* style: clippy

* refactor: update tests

* refactor: rename

* refactor: fix spellings

* add slt tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate optimizer Optimizer rules sqllogictest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants