ARROW-9619: [Rust] [DataFusion] Add predicate push-down #7880

jorgecarleitao · 2020-08-02T12:23:09Z

This PR adds a new optimizer to push filters down. For example, a plan of the form

Selection: #SUM(c) Gt Int64(10)\
  Selection: #b Gt Int64(10)\
    Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\
      Projection: #a AS b, #c\
        TableScan: test projection=None"

is converted to

Selection: #SUM(c) Gt Int64(10)\
  Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\
    Projection: #a AS b, #c\
      Selection: #a Gt Int64(10)\
        TableScan: test projection=None";

(note how the filter expression changed from #b Gt Int64(10) to #a Gt Int64(10), and how only the filter on the key of the aggregate was pushed)

This works by performing two passes on the plan. On the first pass (analyze), it identifies:

all filters on the plan (selections)
all projections on the plan (projections)
all places where a filter on a column cannot be pushed down (break_points)

After this pass, it computes the maximum depth that a filter can be pushed down as well as the new expression that the filter should have, given all the projections that exist.

On the second pass (optimize), it:

removes all old filters
adds all new filters

See comments on the code for details.

This PR is built on top of #7879 (first two commits).

FYI @andygrove @sunchao

github-actions · 2020-08-02T12:31:38Z

https://issues.apache.org/jira/browse/ARROW-9619

jorgecarleitao · 2020-08-12T19:50:51Z

Any of you @alamb @houqp @nevi-me @paddyhoran could help out here? I think that this does significantly speeds querying for anything more complex, as we run aggregations and projections on much less data.

alamb · 2020-08-12T21:22:42Z

I'll try and check it out carefully tomorrow morning (US EST time)

…

On Wed, Aug 12, 2020 at 3:51 PM Jorge Leitao ***@***.***> wrote: Any of you @alamb <https://github.com/alamb> @houqp <https://github.com/houqp> @nevi-me <https://github.com/nevi-me> @paddyhoran <https://github.com/paddyhoran> could help out here? I think that this does significantly speeds querying for anything more complex, as we run aggregations and projections on much less data. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7880 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXZMNMPNXVREY4G7Y5XWDSALXCVANCNFSM4PSQWBHA> .

alamb

All in all I think this is a good initial Selection pushdown optimization.

I spent a while reading the code, and while I can't say I was able to follow the algorithm exactly ( I found the split of logic between the calculation of where to break push the Selection and the actual pushing of them hard to follow), I think the number and breadth of tests is good and in general sufficient to convince me this code does what it says.

One question I had was if you have any particular SQL queriesthis optimization will help with? The one kind I could think of (a HAVING clause which is normally executed after aggregate but could be pushed down) doesn't appear to be implemented yet:

> SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status HAVING status = '2XX';
NotImplemented("HAVING is not implemented yet")

And typically the WHERE clause in SQL statements ar already starts as far down in the plan as possible:

> explain SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status;
+--------------+----------------------------------------------------------------+
| plan_type    | plan                                                           |
+--------------+----------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]]       |
|              |   Selection: #path Eq Utf8("/api/v2/write")                    |
|              |     TableScan: http_api_requests_total projection=Some([6, 8]) |
+--------------+----------------------------------------------------------------+

Another case that might be interesting would be if you had a Selection of ANDs where one clause could be pushed down and part could not be. For example

  Selection: #b Gt Int64(10) AND SUM(c) Gt Int64(10)\
    Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\
      Projection: #a AS b, #c\
        TableScan: test projection=None"

Could still be converted into the following (though there are now a different number of LogicalPlan nodes)

Selection: #SUM(c) Gt Int64(10)\
  Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\
    Projection: #a AS b, #c\
      Selection: #a Gt Int64(10)\
        TableScan: test projection=None";

rust/datafusion/src/optimizer/utils.rs

rust/datafusion/src/optimizer/type_coercion.rs

alamb · 2020-08-13T14:07:29Z

rust/datafusion/src/optimizer/filter_push_down.rs

+            .filter(col("a").eq(&Expr::Literal(ScalarValue::Int64(1))))?
+            .build()?;
+
+        // not part of the test, just good to know:


jorgecarleitao · 2020-08-14T05:23:55Z

Thank you very much @alamb for reviewing it!

This optimizer is mostly useful in the table or DataFrame API, on which a view can be declared as a sequence of statements that are not optimized for execution, but optimized for a logical and code organization's point of view.

One example is when we have a dataframe df that was constructed optimally, but we would like to only look at rows whose 'a' > 2. Instead of having to go through the actual code that built that DataFrame and place the filter in the correct place after investigating where we should place it, we can just write df.filter(df['a'] > 2).collect(), and let the optimizer figure it out where to place it.

I incorporated the comments above into #7879 , as IMO they are part of that PR, and rebased the whole thing. I will still address your comment about not full understanding the algorithm by adding a more extended comment and maybe try drawing some ASCII to better explain the idea, so that it is not only on my head.

alamb

Thanks @jorgecarleitao !

rust/datafusion/src/optimizer/type_coercion.rs

jorgecarleitao · 2020-08-14T18:43:15Z

@alamb , I have added a comment describing the algorithm. Could you take a look and evaluate if it helps at understanding the underlying code?

alamb

I started reading through this more carefully @jorgecarleitao and there are some additional test cases I want to try (when there are several selections). Thanks for the comments -- they are super helpful

rust/datafusion/src/optimizer/filter_push_down.rs

houqp

great work 👍

alamb · 2020-08-15T17:58:42Z

I have had this nagging sensation that the algorithm isn't quite right when the same column is used multiple times. I finally came up with an example that shows part of what I have been worrying about.

Here is a new test that passes on this branch but I think is incorrect. Specifically, with two Selections for the same variable separated by a Limit, one of the Selections is lost with the algorithm as written. Am I missing something?


    #[test]
    fn filter_2_breaks_limits() -> Result<()> {
        let table_scan = test_table_scan()?;
        let plan = LogicalPlanBuilder::from(&table_scan)
            .project(vec![col("a")])?
            .filter(col("a").lt_eq(&Expr::Literal(ScalarValue::Int64(1))))?
            .limit(1)?
            .project(vec![col("a")])?
            .filter(col("a").gt_eq(&Expr::Literal(ScalarValue::Int64(1))))?
            .build()?;
        // Should be able to move both filters below the projections

        // not part of the test
        assert_eq!(
            format!("{:?}", plan),
            "Selection: #a GtEq Int64(1)\
             \n  Projection: #a\
             \n    Limit: 1\
             \n      Selection: #a LtEq Int64(1)\
             \n        Projection: #a\
             \n          TableScan: test projection=None"
        );

        // This just seems wong: we lost a selection....
        let expected = "\
        Projection: #a\
        \n  Selection: #a GtEq Int64(1)\
        \n    Limit: 1\
        \n      Projection: #a\
        \n        TableScan: test projection=None";

        assert_optimized_plan_eq(&plan, expected);
        Ok(())
    }

FYI @jorgecarleitao

jorgecarleitao · 2020-08-15T18:42:08Z

@alamb , thank you so much for taking the time to think through this and come up with an example. I agree with you that it is wrong. I will evaluate whether the current approach is able to coupe with this, or whether we will have to scratch it and start from a different direction.

I changed this PR back to draft as it is obviously out of spec.

jorgecarleitao · 2020-08-16T03:15:17Z

@alamb, @houqp @andygrove , I think that this is ready to re-review.

I modified the result returned by analyze to ensure that we do not lose relevant information (that lead to the error @alamb found).

I also found and fixed another error related to the placement of two filters in the same depth, that caused filters to be dropped: their expressions are now ANDed instead, which has the added bonus of gobbling filters together whenever possible.

All the changes are in new commits, in case it is easier for the review.

houqp · 2020-08-16T04:18:04Z

rust/datafusion/src/optimizer/filter_push_down.rs

+        if max_depth.is_none() {
+            // it is unlikely that the plan is correct without break points as all scans
+            // adds breakpoints. We just return the plan and let others handle the error
+            return Ok(plan.clone());


shouldn't we return error here instead if the plan is not correct?

The comment is poorly written. What I was trying to say is that we allow the compiler to not return an error on poorly designed plans, in which case it just does not perform any optimization. This way, the user is likely to receive a better error message.

This is a design decision that we need to take wrt to optimizers (error or ignore?). I have no strong opinion about it either: we can also return an error.

Let me know what you prefer that I will change it.

houqp · 2020-08-16T05:33:07Z

Something that can be left for future optimization: we can also go the other direction, i.e. break And filters into into individual boolean expressions so these filters can be partially pushed further down the plan.

jorgecarleitao · 2020-08-16T05:58:34Z

Something that can be left for future optimization: we can also go the other direction, i.e. break And filters into into individual boolean expressions so these filters can be partially pushed further down the plan.

Yeap, good idea. AFAI experienced, spark is not doing this - at least up to spark 2.4.5.

This implements filter pushdown optimization with double pass algorithm. Currently, "limit" and aggregates block the push down of the filter. This currently does not push the filter to the scan as we currently do not support filtered scans.

@alamb

Big kudos to @alamb for identifying this error.

alamb

I think this is looking good. I spent a bunch of time trying to come up with counter examples / fool the pushdown logic and I could not. Nice work @jorgecarleitao

Some example plans it handled without issue:

---- optimizer::filter_push_down::tests::filter_2_aggs stdout ----
********Original plan:
Selection: #a Eq Int64(1)
  Projection: #a AS b, #b AS a
    Selection: #a GtEq Int64(1)
      Projection: #a AS b, #b AS a
        Selection: #a LtEq Int64(1)
          Aggregate: groupBy=[[#a, #b]], aggr=[[MIN(#b)]]
            TableScan: test projection=None
********Optimized plan plan:
Projection: #a AS b, #b AS a
  Projection: #a AS b, #b AS a
    Aggregate: groupBy=[[#a, #b]], aggr=[[MIN(#b)]]
      Selection: #a Eq Int64(1) And #b GtEq Int64(1) And #a LtEq Int64(1)
        TableScan: test projection=None

********Original plan:
Selection: #b GtEq Int64(1)
  Selection: #b LtEq Int64(1)
    Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
      Selection: #a GtEq Int64(1)
        Projection: #a AS b, #b AS a
          Selection: #a LtEq Int64(1)
            Aggregate: groupBy=[[#a, #b]], aggr=[[MIN(#b)]]
              TableScan: test projection=None
********Optimized plan:
Selection: #b GtEq Int64(1) And #b LtEq Int64(1)
  Aggregate: groupBy=[[#a]], aggr=[[MIN(#b)]]
    Projection: #a AS b, #b AS a
      Aggregate: groupBy=[[#a, #b]], aggr=[[MIN(#b)]]
        Selection: #b GtEq Int64(1) And #a LtEq Int64(1)
          TableScan: test projection=None

alamb · 2020-08-16T11:47:46Z

rust/datafusion/src/optimizer/filter_push_down.rs

+        Projection: #a\
+        \n  Selection: #a GtEq Int64(1)\
+        \n    Limit: 1\
+        \n      Projection: #a\


Something doesn't quite seem right here: I am surprised that one Projection is left in the plan while another is not (it is fine given that this pass is just supposed to push Selections), this just seems odd

Sorry, I did not understand your comment. I thought that both projections were left in the plan (line 590 and line 593).

You are right -- I apologize I misread the diff. This looks good to me

alamb · 2020-08-17T14:42:46Z

Something that can be left for future optimization: we can also go the other direction, i.e. break And filters into into individual boolean expressions so these filters can be partially pushed further down the plan.

Yeap, good idea. AFAI experienced, spark is not doing this - at least up to spark 2.4.5.

I filed https://issues.apache.org/jira/browse/ARROW-9771 to track this suggestion

andygrove

It might also be useful to reference the Apache Spark optimizer rules as we implement new rules in this project. Their PredicatePushDown rules starts around line 1100 here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

I haven't had time to review this PR fully but am happy to approve based on the other reviews.

houqp · 2020-08-18T05:28:14Z

One thing I do like about spark's optimizer is all optimization rules share a common plan tree traversal and mutation routine, which made individual optimization rule easier to reason about. I can see us adopting the same pattern in the future to simplify the existing code base.

@andygrove

This PR adds a new optimizer to push filters down. For example, a plan of the form ``` Selection: #SUM(c) Gt Int64(10)\ Selection: #b Gt Int64(10)\ Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\ Projection: #a AS b, #c\ TableScan: test projection=None" ``` is converted to ``` Selection: #SUM(c) Gt Int64(10)\ Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\ Projection: #a AS b, #c\ Selection: #a Gt Int64(10)\ TableScan: test projection=None"; ``` (note how the filter expression changed from `#b Gt Int64(10)` to `#a Gt Int64(10)`, and how only the filter on the key of the aggregate was pushed) This works by performing two passes on the plan. On the first pass (analyze), it identifies: 1. all filters on the plan (selections) 2. all projections on the plan (projections) 3. all places where a filter on a column cannot be pushed down (break_points) After this pass, it computes the maximum depth that a filter can be pushed down as well as the new expression that the filter should have, given all the projections that exist. On the second pass (optimize), it: * removes all old filters * adds all new filters See comments on the code for details. This PR is built on top of apache#7879 (first two commits). FYI @andygrove @sunchao Closes apache#7880 from jorgecarleitao/filter_push Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@andygrove

This PR adds a new optimizer to push filters down. For example, a plan of the form ``` Selection: #SUM(c) Gt Int64(10)\ Selection: #b Gt Int64(10)\ Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\ Projection: #a AS b, #c\ TableScan: test projection=None" ``` is converted to ``` Selection: #SUM(c) Gt Int64(10)\ Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\ Projection: #a AS b, #c\ Selection: #a Gt Int64(10)\ TableScan: test projection=None"; ``` (note how the filter expression changed from `#b Gt Int64(10)` to `#a Gt Int64(10)`, and how only the filter on the key of the aggregate was pushed) This works by performing two passes on the plan. On the first pass (analyze), it identifies: 1. all filters on the plan (selections) 2. all projections on the plan (projections) 3. all places where a filter on a column cannot be pushed down (break_points) After this pass, it computes the maximum depth that a filter can be pushed down as well as the new expression that the filter should have, given all the projections that exist. On the second pass (optimize), it: * removes all old filters * adds all new filters See comments on the code for details. This PR is built on top of apache#7879 (first two commits). FYI @andygrove @sunchao Closes apache#7880 from jorgecarleitao/filter_push Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@andygrove

This PR adds a new optimizer to push filters down. For example, a plan of the form ``` Selection: #SUM(c) Gt Int64(10)\ Selection: #b Gt Int64(10)\ Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\ Projection: #a AS b, #c\ TableScan: test projection=None" ``` is converted to ``` Selection: #SUM(c) Gt Int64(10)\ Aggregate: groupBy=[[#b]], aggr=[[SUM(#c)]]\ Projection: #a AS b, #c\ Selection: #a Gt Int64(10)\ TableScan: test projection=None"; ``` (note how the filter expression changed from `#b Gt Int64(10)` to `#a Gt Int64(10)`, and how only the filter on the key of the aggregate was pushed) This works by performing two passes on the plan. On the first pass (analyze), it identifies: 1. all filters on the plan (selections) 2. all projections on the plan (projections) 3. all places where a filter on a column cannot be pushed down (break_points) After this pass, it computes the maximum depth that a filter can be pushed down as well as the new expression that the filter should have, given all the projections that exist. On the second pass (optimize), it: * removes all old filters * adds all new filters See comments on the code for details. This PR is built on top of apache#7879 (first two commits). FYI @andygrove @sunchao Closes apache#7880 from jorgecarleitao/filter_push Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

andygrove added Component: Rust Component: Rust - DataFusion labels Aug 2, 2020

alamb reviewed Aug 13, 2020

View reviewed changes

alamb mentioned this pull request Aug 13, 2020

ARROW-9654: [Rust][DataFusion] Add EXPLAIN <SQL> statement #7959

Closed

alamb reviewed Aug 14, 2020

View reviewed changes

rust/datafusion/src/optimizer/type_coercion.rs Outdated Show resolved Hide resolved

jorgecarleitao mentioned this pull request Aug 14, 2020

ARROW-9618: [Rust] [DataFusion] Made it easier to write optimizers #7879

Closed

alamb reviewed Aug 14, 2020

View reviewed changes

houqp reviewed Aug 15, 2020

View reviewed changes

houqp approved these changes Aug 15, 2020

View reviewed changes

jorgecarleitao marked this pull request as draft August 15, 2020 18:39

jorgecarleitao marked this pull request as ready for review August 16, 2020 03:10

houqp reviewed Aug 16, 2020

View reviewed changes

jorgecarleitao added 5 commits August 16, 2020 21:54

Added filter pushdown optimizer.

da97f55

This implements filter pushdown optimization with double pass algorithm. Currently, "limit" and aggregates block the push down of the filter. This currently does not push the filter to the scan as we currently do not support filtered scans.

Fixed error in how selection placing was constructed.

2d92fae

Big kudos to @alamb for identifying this error.

Fixed error when two filters are to be placed on the same depth.

06bc641

Improved stability of the plan.

54e8cf2

Fixed integration errors.

b12172a

alamb approved these changes Aug 17, 2020

View reviewed changes

andygrove approved these changes Aug 17, 2020

View reviewed changes

andygrove closed this in 197f903 Aug 19, 2020

jorgecarleitao deleted the filter_push branch October 28, 2020 04:17

This was referenced Apr 26, 2021

[DataFusion] Add constant folding to expressions during logically planning apache/arrow-rs#96

Closed

[Rust] Add constant folding to expressions during logically planning apache/datafusion#98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9619: [Rust] [DataFusion] Add predicate push-down #7880

ARROW-9619: [Rust] [DataFusion] Add predicate push-down #7880

jorgecarleitao commented Aug 2, 2020 •

edited

github-actions bot commented Aug 2, 2020

jorgecarleitao commented Aug 12, 2020

alamb commented Aug 12, 2020 via email

alamb left a comment

alamb Aug 13, 2020

jorgecarleitao commented Aug 14, 2020

alamb left a comment

jorgecarleitao commented Aug 14, 2020

alamb left a comment

houqp left a comment

alamb commented Aug 15, 2020

jorgecarleitao commented Aug 15, 2020

jorgecarleitao commented Aug 16, 2020

houqp Aug 16, 2020

jorgecarleitao Aug 16, 2020

houqp commented Aug 16, 2020

jorgecarleitao commented Aug 16, 2020

alamb left a comment

alamb Aug 16, 2020

jorgecarleitao Aug 18, 2020

alamb Aug 18, 2020

alamb commented Aug 17, 2020

andygrove left a comment

houqp commented Aug 18, 2020

ARROW-9619: [Rust] [DataFusion] Add predicate push-down #7880

ARROW-9619: [Rust] [DataFusion] Add predicate push-down #7880

Conversation

jorgecarleitao commented Aug 2, 2020 • edited

github-actions bot commented Aug 2, 2020

jorgecarleitao commented Aug 12, 2020

alamb commented Aug 12, 2020 via email

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 13, 2020

Choose a reason for hiding this comment

jorgecarleitao commented Aug 14, 2020

alamb left a comment

Choose a reason for hiding this comment

jorgecarleitao commented Aug 14, 2020

alamb left a comment

Choose a reason for hiding this comment

houqp left a comment

Choose a reason for hiding this comment

alamb commented Aug 15, 2020

jorgecarleitao commented Aug 15, 2020

jorgecarleitao commented Aug 16, 2020

houqp Aug 16, 2020

Choose a reason for hiding this comment

jorgecarleitao Aug 16, 2020

Choose a reason for hiding this comment

houqp commented Aug 16, 2020

jorgecarleitao commented Aug 16, 2020

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 16, 2020

Choose a reason for hiding this comment

jorgecarleitao Aug 18, 2020

Choose a reason for hiding this comment

alamb Aug 18, 2020

Choose a reason for hiding this comment

alamb commented Aug 17, 2020

andygrove left a comment

Choose a reason for hiding this comment

houqp commented Aug 18, 2020

jorgecarleitao commented Aug 2, 2020 •

edited