Skip to content

Push Down Filter Subexpressions in Nested Loop Joins as Projections #17904

@tobixdev

Description

@tobixdev

Is your feature request related to a problem or challenge?

If a subexpression in a join filter expression can be evaluated on a single side of an NL Join, we could push down this subexpression to the respective side. We do not do this for all joins because often evaluating the expression after any equi-joins is more performant. However, the NL Join doesn't have a huge performance benefit from equi-joins and so it could benefit from such an approach.

In theory, this should improve performance as the subexpression does not need to be evaluated for the cartesian product in the join. In practice the question is whether this optimization is worth its additional complexity. I think it is worth it for complex UDFs, but for simple ones not so much. Initially, we built this optimization in RDF Fusion and as I've now a bit more time on my hand I could try to upstream it. There we have some costly UDFs and see significant improvements for some queries (~60 %).

Example:

Before:

NestedLoopJoinExec: join_type=LeftSemi, filter=complex_udf(lhs.a) > rhs.b, projection=[rhs.b@1]
     <lhs>
     <rhs>

After:

NestedLoopJoinExec: join_type=LeftSemi, filter= join_proj_push_down_1 > rhs.b, projection=[rhs.b@1]
      ProjectionExec: expr=[a@0 as a, complex_udf(lhs.a) as join_proj_push_down_1]
           <lhs>
     <rhs>

Describe the solution you'd like

Discuss whether we want to include this optimization in DataFusion.

Describe alternatives you've considered

Some alternative thoughts:

  • Just don't include the optimization
  • Include it, but not in the default Optimizer (Do we have something like that?)
  • We could also implement this as an optimization in the NL Join stream itself. However, this makes the join more complicated.
  • Implement it as logical plan optimization that checks whether the Join will become an NL join (duplicates planning logic)

Additional context

I'll create a draft PR that would upstream the optimization pass to DataFusion. The optimization changes the execution plan of two TPC-H queries, but I doubt that we will see much gains there.

I am not so sure about the naming of this optimization as it could be mixed up with regular projection push down. Feel free to change it.

Here is the PR with the code: #17906

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions