Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support optional filter in Join #2509

Closed
yjshen opened this issue May 11, 2022 · 5 comments · Fixed by #2591
Closed

Support optional filter in Join #2509

yjshen opened this issue May 11, 2022 · 5 comments · Fixed by #2591
Labels
enhancement New feature or request

Comments

@yjshen
Copy link
Member

yjshen commented May 11, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It would be necessary to support filters in the join operator, instead of a join operator followed by a filter, the necessity comes from two folds:

  1. semi-join that columns from one table would be removed after the join, which would make a filter that references both sides impossible.
  2. filter out records earlier to reduce the need for constructing more records(batches).

Describe the solution you'd like

  1. pub type FilterOn = (Vec<Column>, Vec<Column>, Arc<dyn PhysicalExpr>); and Option<FilterOn> for JoinExec.
  2. generate record batch using arrays in the filter expr, but rebind the expr to point to the right columns of the newly generated batch.

Describe alternatives you've considered

  1. pub type FilterOn = Vec<(Column, Column, datafusion_expr::Operator)>;
  2. Normalize each filter into two sides of a binary op. like: t1.a + t2.b > 100 to t1.a > 100 - t2.b. evaluates a , 100-b separately as two columns and apply binary expr calculation logic.

But the approach would be quite limited since it greatly limits the expressions that could be used in a join filter.

Additional context

Consider Part of TPC-DS query-95's SparkSQL plan as an example:

+- SortMergeJoin [ws_order_number#251], [ws_order_number#285], Inner, NOT (ws_warehouse_sk#249 = ws_warehouse_sk#283)
   :- Sort [ws_order_number#251 ASC NULLS FIRST], false, 0
   :  +- CustomShuffleReader coalesced
   :     +- ShuffleQueryStage 4
   :        +- ReusedExchange [ws_warehouse_sk#249, ws_order_number#251], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]
   +- Sort [ws_order_number#285 ASC NULLS FIRST], false, 0
      +- CustomShuffleReader coalesced
         +- ShuffleQueryStage 5
            +- ReusedExchange [ws_warehouse_sk#283, ws_order_number#285], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]
@yjshen yjshen added the enhancement New feature or request label May 11, 2022
@yjshen
Copy link
Member Author

yjshen commented May 11, 2022

A related issue #2496.

@yjshen yjshen changed the title Support optional filter in Join Support optional filter in SortMergeJoin May 26, 2022
@yjshen yjshen reopened this May 26, 2022
@yjshen
Copy link
Member Author

yjshen commented May 26, 2022

Reopened and renamed to track sort-merge join filter as well.

@alamb
Copy link
Contributor

alamb commented May 26, 2022

@yjshen do you mind if I close this ticket and reopen another describing the support needed for Sort-merge? I think it might be clearer to a future reader that we just needed to extend the support in HashJoin to MergeJoin whereas the description of this ticket now may confuse people as it talks about differing implementation possibilities

@yjshen
Copy link
Member Author

yjshen commented May 27, 2022

Get it; I will open a new issue instead.

@yjshen yjshen closed this as completed May 27, 2022
@yjshen yjshen changed the title Support optional filter in SortMergeJoin Support optional filter in Join May 27, 2022
@alamb
Copy link
Contributor

alamb commented May 27, 2022

Get it; I will open a new issue instead.

Thanks @yjshen !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants