Support optional filter in Join #2509

yjshen · 2022-05-11T08:23:41Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It would be necessary to support filters in the join operator, instead of a join operator followed by a filter, the necessity comes from two folds:

semi-join that columns from one table would be removed after the join, which would make a filter that references both sides impossible.
filter out records earlier to reduce the need for constructing more records(batches).

Describe the solution you'd like

pub type FilterOn = (Vec<Column>, Vec<Column>, Arc<dyn PhysicalExpr>); and Option<FilterOn> for JoinExec.
generate record batch using arrays in the filter expr, but rebind the expr to point to the right columns of the newly generated batch.

Describe alternatives you've considered

pub type FilterOn = Vec<(Column, Column, datafusion_expr::Operator)>;
Normalize each filter into two sides of a binary op. like: t1.a + t2.b > 100 to t1.a > 100 - t2.b. evaluates a , 100-b separately as two columns and apply binary expr calculation logic.

But the approach would be quite limited since it greatly limits the expressions that could be used in a join filter.

Additional context

Consider Part of TPC-DS query-95's SparkSQL plan as an example:

+- SortMergeJoin [ws_order_number#251], [ws_order_number#285], Inner, NOT (ws_warehouse_sk#249 = ws_warehouse_sk#283)
   :- Sort [ws_order_number#251 ASC NULLS FIRST], false, 0
   :  +- CustomShuffleReader coalesced
   :     +- ShuffleQueryStage 4
   :        +- ReusedExchange [ws_warehouse_sk#249, ws_order_number#251], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]
   +- Sort [ws_order_number#285 ASC NULLS FIRST], false, 0
      +- CustomShuffleReader coalesced
         +- ShuffleQueryStage 5
            +- ReusedExchange [ws_warehouse_sk#283, ws_order_number#285], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]

The text was updated successfully, but these errors were encountered:

yjshen · 2022-05-11T09:21:12Z

A related issue #2496.

yjshen · 2022-05-26T10:36:21Z

Reopened and renamed to track sort-merge join filter as well.

alamb · 2022-05-26T21:28:23Z

@yjshen do you mind if I close this ticket and reopen another describing the support needed for Sort-merge? I think it might be clearer to a future reader that we just needed to extend the support in HashJoin to MergeJoin whereas the description of this ticket now may confuse people as it talks about differing implementation possibilities

yjshen · 2022-05-27T00:20:39Z

Get it; I will open a new issue instead.

alamb · 2022-05-27T10:31:19Z

Get it; I will open a new issue instead.

Thanks @yjshen !

yjshen added the enhancement New feature or request label May 11, 2022

korowa mentioned this issue May 22, 2022

Support for non equality predicates in ON clause of LEFT, RIGHT, and FULL joins #2591

Merged

alamb closed this as completed in #2591 May 26, 2022

yjshen changed the title ~~Support optional filter in Join~~ Support optional filter in SortMergeJoin May 26, 2022

yjshen reopened this May 26, 2022

yjshen closed this as completed May 27, 2022

yjshen changed the title ~~Support optional filter in SortMergeJoin~~ Support optional filter in Join May 27, 2022

yjshen mentioned this issue May 27, 2022

Support optional filter in SortMergeJoin #2628

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support optional filter in Join #2509

Support optional filter in Join #2509

yjshen commented May 11, 2022

yjshen commented May 11, 2022

yjshen commented May 26, 2022

alamb commented May 26, 2022

yjshen commented May 27, 2022

alamb commented May 27, 2022

Support optional filter in Join #2509

Support optional filter in Join #2509

Comments

yjshen commented May 11, 2022

yjshen commented May 11, 2022

yjshen commented May 26, 2022

alamb commented May 26, 2022

yjshen commented May 27, 2022

alamb commented May 27, 2022