JOIN conditions are order dependent #778

seddonm1 · 2021-07-26T07:46:16Z

Which issue does this PR close?

Closes #777.

Rationale for this change

Currently the Logical Plan builder assumes that the join conditions fields are provided in the same order as the tables are specified. This is different behavior to how Postgres works which does not care about the order of the tables listed in the condition. This PR aims to rectify this discrepancy.

What changes are included in this PR?

This change allows the conditions to be reversed and will fail as previously if neither relation contains the column. A lot of the tests in sql.rs have been updated to test both conditions.

Are there any user-facing changes?

This does not change public APIs, and should not be noticeable by any current users.

Dandandan · 2021-07-26T18:01:31Z

datafusion/src/logical_plan/builder.rs

+            .map(|c| c.normalize(&self.plan))
+            .collect::<Result<_>>()
+            .or_else(|_| {
+                swap = true;


Shouldn't "swapping" be done on an individual level, e.g. for left1=right1 and right2=left2 it will swap everything now?

Yes, thanks for you review. I will address this.

I have refactored this to operate at an individual condition level and I think the code is much more succinct.

alamb

Welcome back @seddonm1 ! It is great to see a contribution from you. Thank you very much

alamb · 2021-07-26T23:50:06Z

datafusion/tests/sql.rs

    let mut ctx = create_join_context("t1_id", "t2_id")?;
-    let sql = "SELECT t1_id, t1_name, t2_name FROM t1 LEFT JOIN t2 ON t1_id = t2_id ORDER BY t1_id";
-    let actual = execute(&mut ctx, sql).await;
+    let equivalent_sql = [


I wonder if this test case and those below it add any additional coverage compared to

async fn equijoin_multiple_condition_ordering() -> Result<()> {

Like are there bugs that would cause equijoin_multiple_condition_ordering to pass but the others to fail?

My main objective was to ensure nothing specific to equijoin (INNER) would break this implementation for the other join types (which occurred when developing it). This may be more relevant if any further optimisations occur.

alamb · 2021-07-26T23:54:53Z

datafusion/src/logical_plan/builder.rs

-            .into_iter()
-            .map(|c| c.into().normalize(right))
-            .collect::<Result<_>>()?;
+        let (left_keys, right_keys): (Vec<Result<Column>>, Vec<Result<Column>>) =


I don't fully understand this code -- if LogicalPlanBuilder::join is called it has two inputs, a left (self) and a right (right) so the idea that left_keys might contain references to right is confusing to me.

I wonder if there might be a better place to fix up the errors -- perhaps somewhere further up in the call stack thatn LogicalPlanBuilder::join -- for example extract_join_keys in the SQL planner (especially since this bug seems related to SQL planning)

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/sql/planner.rs#L372-L380

Welcome back @seddonm1 ! It is great to see a contribution from you. Thank you very much

Thanks @alamb. Unfortunately my employment situation makes it hard to contribute too much but I am still very interested in seeing Datafusion succeed.

The LogicalPlanBuilder::join has left and right keys but they are left and right relative to a join condition not to a left or right relation.

The left_keys and right_keys are where the join conditions are resolved to underlying relations. The Postgres behavior does not require a specific order of conditions for joining so we need to be able to handle both conditions.

Yeah, I think the argument name makes sense with the old assumption that left_keys should only refer to columns from the left plan. With this fix, it can become confusing.

Perhaps a better signature for this function would be something along the lines of: join_keys: Vec<(impl Into<Column>, impl Into<Column>)>.

If we want the dataframe API to also support order independent JOIN conditions, then it is indeed better to address the fix within this builder method. Otherwise I agree it would be better to fix it in the SQL planner instead. I am leaning towards supporting this in dataframe API as well.

changing the signature to be join_keys: Vec<(impl Into<Column>, impl Into<Column>)>. instead of left_keys:... and right_keys: ... makes sense to me

seddonm1 · 2021-07-27T21:45:53Z

@alamb @houqp I have updated the PR with @alamb suggestion of a join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>) signature. I think this does make it less ambiguous than previously but does increase the surface area of this PR - so now needs additional scrutiny.

Regarding @houqp dataframe suggestion, I am not sure how we proceed; I would expect users of the dataframe API to be much more capable of adapting to logically specifying keys in the 'correct' order vs SQL users who just want it to behave like any other SQL engine?

Dandandan

Looks great. Thanks. @seddonm1

houqp

LGTM, thanks for the fix!

Regarding @houqp dataframe suggestion, I am not sure how we proceed; I would expect users of the dataframe API to be much more capable of adapting to logically specifying keys in the 'correct' order vs SQL users who just want it to behave like any other SQL engine?

I agree. If it ever becomes a problem in the future, we can address it with minimal code change since this PR has already done all the heavy lifting.

alamb · 2021-07-28T12:16:37Z

🎉 thanks again @seddonm1 !

* chore: update datafusion deps * feat: impl ExecutionPlan::static_name() for DatasetExec This required trait method was added upstream [0] and recommends to simply forward to `static_name`. [0]: apache#10266 * feat: update first_value and last_value wrappers. Upstream signatures were changed for the new new `AggregateBuilder` api [0]. This simply gets the code to work. We should better incorporate that API into `datafusion-python`. [0] apache#10560 * migrate count to UDAF Builtin Count was removed upstream. TBD whether we want to re-implement `count_star` with new API. Ref: apache#10893 * migrate approx_percentile_cont, approx_distinct, and approx_median to UDAF Ref: approx_distinct apache#10851 Ref: approx_median apache#10840 Ref: approx_percentile_cont and _with_weight apache#10917 * migrate avg to UDAF Ref: apache#10964 * migrage corr to UDAF Ref: apache#10884 * migrate grouping to UDAF Ref: apache#10906 * add alias `mean` for UDAF `avg` * migrate stddev to UDAF Ref: apache#10827 * remove rust alias for stddev The python wrapper now provides stddev_samp alias. * migrage var_pop to UDAF Ref: apache#10836 * migrate regr_* functions to UDAF Ref: apache#10898 * migrate bitwise functions to UDAF The functions now take a single expression instead of a Vec<_>. Ref: apache#10930 * add missing variants for ScalarValue with todo * fix typo in approx_percentile_cont * add distinct arg to count * comment out failing test `approx_percentile_cont` is now returning a DoubleArray instead of an IntArray. This may be a bug upstream; it requires further investigation. * update tests to expect lowercase `sum` in query plans This was changed upstream. Ref: apache#10831 * update ScalarType data_type map * add docs dependency pickleshare * re-implement count_star * lint: ruff python lint * lint: rust cargo fmt * include name of window function in error for find_window_fn * refactor `find_window_fn` for debug clarity * search default aggregate functions by both name and aliases The alias list no longer includes the name of the function. Ref: apache#10658 * fix markdown in find_window_fn docs * parameterize test_window_functions `first_value` and `last_value` are currently failing and marked as xfail. * add test ids to test_simple_select tests marked xfail * update find_window_fn to search built-ins first The behavior of `first_value` and `last_value` UDAFs currently does not match the built-in behavior. This allowed me to remove `marks=pytest.xfail` from the window tests. * improve first_call and last_call use of the builder API * remove trailing todos * fix examples/substrait.py * chore: remove explicit aliases from functions.rs Ref: apache/datafusion-python#779 * remove `array_fn!` aliases * remove alias rules for `expr_fn_vec!` * remove alias rules from `expr_fn!` macro * remove unnecessary pyo3 var-arg signatures in functions.rs * remove pyo3 signatures that provided defaults for first_value and last_value * parametrize test_string_functions * test regr_ function wrappers Closes apache#778

allow either order joins

f027c2d

github-actions bot added the datafusion label Jul 26, 2021

Dandandan reviewed Jul 26, 2021

View reviewed changes

refactor to individual condition level

14d21f3

alamb reviewed Jul 26, 2021

View reviewed changes

houqp approved these changes Jul 27, 2021

View reviewed changes

change join signature to 'join_keys' tuple

158f96d

github-actions bot added the ballista label Jul 27, 2021

Dandandan approved these changes Jul 28, 2021

View reviewed changes

houqp approved these changes Jul 28, 2021

View reviewed changes

Dandandan merged commit 5416341 into apache:master Jul 28, 2021

houqp added api change Changes the API exposed to users of the crate bug Something isn't working labels Jul 29, 2021

unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025

perf: add metrics to CopyExec and ScanExec (apache#778)

5af85f6

JOIN conditions are order dependent #778

JOIN conditions are order dependent #778

Uh oh!

Conversation

seddonm1 commented Jul 26, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seddonm1 Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seddonm1 commented Jul 27, 2021

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seddonm1 Jul 27, 2021 •

edited

Loading