Projection Pushdown through user defined LogicalPlan nodes. #9690

mustafasrepo · 2024-03-19T10:58:03Z

Which issue does this PR close?

Improves the situation on #9146.

Rationale for this change

Currently, we do not have support for projection pushdown through user defined logical plan nodes. This can cause moving around unnecessary columns during execution, when plan involves user defined logical plan node.

What changes are included in this PR?

This PR adds support for projection pushdown through user defined logical plan nodes, when

UserDefined Logical Plan node has single child.
Output schema of the UserDefined node contains all of the fields in its input (May have more fields as in window).

Are these changes tested?

Yes

Are there any user-facing changes?

alamb

Thank you @mustafasrepo -- I think this is a very interesting and useful PR, but I am not sure the handling of parent indicies is quite correct

alamb · 2024-03-19T16:08:20Z

datafusion/optimizer/src/optimize_projections.rs

@@ -1192,4 +1291,24 @@ mod tests {
        \n    TableScan: test projection=[a]";
        assert_optimized_plan_equal(&plan, expected)
    }
+
+    // Optimize Projections Rule, pushes down projection through users defined logical plan node.


Can we please add a few more tests:

The NoOpUserDefined plan itself refers to column a in its expressions

The NoOpUserDefined plan itself refers to column b in its expressions

The NoOpUserDefined plan itself refers to column a + b in its expressions

I added tests for these cases.

datafusion/optimizer/src/optimize_projections.rs

alamb · 2024-03-19T16:31:10Z

datafusion/optimizer/src/optimize_projections.rs

+            let child = children[0];
+            let node_schema = extension.node.schema();
+            let child_schema = child.schema();
+            if let Some(parent_required_indices_mapped) =


This doesn't seem right to me -- I don't think we can assume that a column named "foo" on the input is the same as a column named "foo" in the output schema (this isn't true for LogicalPlan::Projection for example).

I think the information that is needed is "given the parent only needs some of the output columns of this UserDefinedPlan, can it avoid requiring additional input columns"

Given the UserDefinedPlan is basically a black box and the optimizer doesn't know how it works, I think the only way to get this information would be to add some sort of API to UserDefined node

I suggest for this PR, only push down information about used expressions (don't try and incorporate a projection from the parent) and we can handle the parent projection information in a follow on PR

I am thinking of an API like

trait UserDefinedLogicalNode { ... /// If only the specified indices of the output of node are needed /// by the consumer, optionally returns a new UserDefinedNode that /// computes only those indices, in the specified order. /// /// Return Ok(None), the default, if no such pushdown is possible /// /// This is used to implement projection pushdown for user defined nodes fn try_project(&self, indices: &[usize]) -> Result<Option<Arc<dyn Self>>> { Ok(None) }

I suggest for this PR, only push down information about used expressions (don't try and incorporate a projection from the parent) and we can handle the parent projection information in a follow on PR.

This behavior may not be safe. Assuming a user defined operator similar to LogicalPlan::Filter. Following Plan

Projection(a,b,c) --UserDefineFilter(d=0) ----TableScan(a,b,c,d)

If we don't incorporate requirements from the parent, user defined filter may insert Projection(d) below it. Which would produce following invalid plan:

Projection(a,b,c) (a,b,c is missing at its input) --UserDefineFilter(d=0) ----Projection(d) ------TableScan(a,b,c,d)

I am thinking of an API like

trait UserDefinedLogicalNode { ... /// If only the specified indices of the output of node are needed /// by the consumer, optionally returns a new UserDefinedNode that /// computes only those indices, in the specified order. /// /// Return Ok(None), the default, if no such pushdown is possible /// /// This is used to implement projection pushdown for user defined nodes fn try_project(&self, indices: &[usize]) -> Result<Option<Arc<dyn Self>>> { Ok(None) }

I think, implementers of the try_project needs to insert LogicalPlan::Projection on top of their input to pushdown projection. This might not be the case, if I am missing something. If that is the case, this might be cumbersome for the users. What about an API something like

/// Returns the necessary expressions at each child to calculate argument. /// assuming exprs refers to valid fields at the output of the operator. /// Return Ok(None), the default, Cannot determine necessary expressions at the children to calculate given exprs. fn necessary_children_exprs(&self, exprs: &[Expr]) -> Option<Vec<Vec<Expr>>> { Ok(None) }

with an API like above, we can handle projection insertion outside the trait method.

Which would produce following invalid plan:

That is a good point.

What about an API something like

That is a great idea, though I don't fully understand why it uses Exprs

How about a slight refinement in terms of input/output columns:

/// Returns the necessary input columns to this node required to compute /// the columns in the output schema /// /// This is used for projection pushdown when DataFusion has determined that /// only a subset of the output columns of this node are needed by its parents. /// This API is used to tell DataFusion which, if any, of the input columns are no longer /// needed. /// /// Return `Ok(None)`, the default, if this information can not be determined /// Returns `Ok(input_columns)` with the column indexes from this nodes' input that are /// needed to compute `output_columns`) fn necessary_children_exprs(&self, output_columns: &[usize]) -> Option<Vec<usize>> { Ok(None) }

🤔

This is much better. Knowing the mapping of the Columns is adequate. Using Expr has no additional support. Thanks @alamb for this suggestion. I will update this PR to use this mechanism. Then we can discuss further. I will let you know when it ready for further review.

@alamb I introduced new API fn necessary_children_exprs(&self, output_columns: &[usize]) -> Option<Vec<Vec<usize>>> to this PR (where return type is changed to return result per child). I also added a new test case showing the support for user defined nodes with multiple children. This Pr is ready for further review.

alamb · 2024-03-19T16:46:25Z

I also filed #9698 to try and clarify the documentation about what LogicalPlan::expressions() actually does

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

mustafasrepo · 2024-03-20T07:12:59Z

Thank you @mustafasrepo -- I think this is a very interesting and useful PR, but I am not sure the handling of parent indicies is quite correct

I think, you are right, we shouldn't rely on name match during analysis.

alamb · 2024-03-21T17:46:36Z

I am sorry -- I am behind this week on reviews. I will give this one another look shortly

…ined

alamb

Thank you @mustafasrepo -- this PR looks good to me.

I had a few suggestions that would be nice to fix prior to merging but I don't think they are required

datafusion/optimizer/src/optimize_projections.rs

alamb · 2024-03-27T16:08:56Z

Thanks again @mustafasrepo

) * Naive support for schema preserving plans * Add mapping support between schemas * Fix name * Update comment * Update comment * Do not calculate mapping for unnecessary sections * Update datafusion/optimizer/src/optimize_projections.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add new tests * Add new api to get necessary columns * Add new test for multi children * Address reviews --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

mustafasrepo added 2 commits March 19, 2024 13:33

Naive support for schema preserving plans

41d06f4

Add mapping support between schemas

cd7072a

github-actions bot added the optimizer Optimizer rules label Mar 19, 2024

mustafasrepo added 4 commits March 19, 2024 14:06

Fix name

8b8ccab

Update comment

708f449

Update comment

f190d46

Do not calculate mapping for unnecessary sections

95db5be

alamb reviewed Mar 19, 2024

View reviewed changes

alamb mentioned this pull request Mar 19, 2024

Minor: Improve documentation for LogicalPlan::expressions #9698

Merged

alamb mentioned this pull request Mar 19, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

7 tasks

mustafasrepo and others added 2 commits March 20, 2024 09:03

Update datafusion/optimizer/src/optimize_projections.rs

c9566c6

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Add new tests

1b2c408

mustafasrepo added 2 commits March 25, 2024 14:56

Add new api to get necessary columns

bd49625

Merge branch 'apache_main' into feature/projection_push_down_user_def…

fc689df

…ined

github-actions bot added the logical-expr Logical plan and expressions label Mar 25, 2024

Add new test for multi children

cb38c79

alamb changed the title ~~Projection Pushdown through trivial user defined LogicalPlan nodes.~~ Projection Pushdown through user defined LogicalPlan nodes. Mar 27, 2024

alamb approved these changes Mar 27, 2024

View reviewed changes

datafusion/optimizer/src/optimize_projections.rs Outdated Show resolved Hide resolved

datafusion/optimizer/src/optimize_projections.rs Show resolved Hide resolved

alamb mentioned this pull request Mar 27, 2024

DataFusion weekly project plan (Andrew Lamb) - March 25, 2024 #9796

Closed

6 tasks

Address reviews

5349c0f

mustafasrepo merged commit ba8f1af into apache:main Mar 27, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Projection Pushdown through user defined LogicalPlan nodes. #9690

Projection Pushdown through user defined LogicalPlan nodes. #9690

mustafasrepo commented Mar 19, 2024

alamb left a comment

alamb Mar 19, 2024

mustafasrepo Mar 20, 2024

alamb Mar 19, 2024

alamb Mar 19, 2024

mustafasrepo Mar 20, 2024 •

edited

Loading

mustafasrepo Mar 20, 2024

alamb Mar 22, 2024

mustafasrepo Mar 25, 2024

mustafasrepo Mar 25, 2024 •

edited

Loading

alamb commented Mar 19, 2024

mustafasrepo commented Mar 20, 2024

alamb commented Mar 21, 2024

alamb left a comment

alamb commented Mar 27, 2024

Projection Pushdown through user defined LogicalPlan nodes. #9690

Projection Pushdown through user defined LogicalPlan nodes. #9690

Conversation

mustafasrepo commented Mar 19, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 19, 2024

Choose a reason for hiding this comment

mustafasrepo Mar 20, 2024

Choose a reason for hiding this comment

alamb Mar 19, 2024

Choose a reason for hiding this comment

alamb Mar 19, 2024

Choose a reason for hiding this comment

mustafasrepo Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

mustafasrepo Mar 20, 2024

Choose a reason for hiding this comment

alamb Mar 22, 2024

Choose a reason for hiding this comment

mustafasrepo Mar 25, 2024

Choose a reason for hiding this comment

mustafasrepo Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

alamb commented Mar 19, 2024

mustafasrepo commented Mar 20, 2024

alamb commented Mar 21, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Mar 27, 2024

mustafasrepo Mar 20, 2024 •

edited

Loading

mustafasrepo Mar 25, 2024 •

edited

Loading