[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

holdenk · 2024-04-02T00:00:51Z

What changes were proposed in this pull request?

Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.

This introduces an "expectedCost" mechanism which we may or may not want. Previous filter ordering work used filter pushdowns as an approximation of expression cost but here we need more granularity. As an alternative we could introduce a flag for expensive rather than numeric operations.

Future Work / What else remains to do?

Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.

Why are the changes needed?

Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.

Does this PR introduce any user-facing change?

SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.

How was this patch tested?

New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist

Was this patch authored or co-authored using generative AI tooling?

No

…n an 'expensive' projected operation (rlike)

… we can't do pawrtial on the ||

… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns.

…ct to 'save'

…ushDown

cloud-fan · 2024-04-05T02:29:32Z

oh this is a hard one. The cost of predicates is hard to estimate, and also the benefit as we need to estimate the selectivity and the input data volume.

cc @kelvinjian-db @jchen5

holdenk · 2024-04-05T03:06:27Z

It is. In general I think since we still apply the filter post projection if a user has created a projection with a named field and then filtered on that field the user is probably doing that intentionally since they don't want to double eval the named field. That plus some basic cost heuristics (simple math is cheap udfs can be expensive and so can regexes) should be a net win.

mridulm · 2024-04-05T03:55:02Z

+CC @shardulm94

holdenk · 2024-04-11T00:24:49Z

Another possible solution would be to also break up the projection and move the part of the projection which is used in the filter down with the filter unless the only thing the projection is adding is the filter field in which case we'd leave it as is.

This logic starts to get more complex, but I think in that case it's probably more of a "pure" win (e.g. no downsides). WDYT @cloud-fan ?

holdenk · 2024-05-06T17:56:44Z

Do folks have a preference between this approach & the one in #46143 ?

holdenk · 2024-05-08T18:54:18Z

CC @cloud-fan do you have thoughts / cycles?

cloud-fan · 2024-05-09T01:28:35Z

I've been thinking hard about it. Filter pushdown should always be beneficial if we don't duplicate expressions, and the new With expression can avoid expression duplication.

So my proposal is: when we push down filter, and we are about to duplicate some expressions, let's use With to avoid it. At the end of the optimizer, we run the rule RewriteWithExpression to rewrite With and pull out common expressions into a Project below. Data source pushdown rule doesn't require the scan node to be the direct child of Filter, so everything should work as before.

holdenk · 2024-05-09T03:17:19Z

Let me take a look at the with functionality but that sounds potentially reasonable.

github-actions bot added the SQL label Apr 2, 2024

holdenk added 5 commits April 2, 2024 14:53

Start adding a test for relationship pushdown where we don't push dow…

626882b

…n an 'expensive' projected operation (rlike)

Break up the filter pushdown test into two parts with || and && since…

256ca79

… we can't do pawrtial on the ||

Add an expectedCost to keep track of if elements are expensive or not…

b835142

… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns.

Get the filter push down/not-push down working.

de7424c

A few more "expensive" type operations we would want to use the proje…

19bc6d7

…ct to 'save'

holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown branch from 5c6a780 to 19bc6d7 Compare April 2, 2024 21:53

Fix push to do alias replace with a non-empty stay up and non-empty p…

3293776

…ushDown

holdenk requested a review from cloud-fan April 4, 2024 23:10

holdenk mentioned this pull request Apr 20, 2024

[SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown #46143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

holdenk commented Apr 2, 2024

cloud-fan commented Apr 5, 2024

holdenk commented Apr 5, 2024 •

edited

mridulm commented Apr 5, 2024

holdenk commented Apr 11, 2024

holdenk commented May 6, 2024

holdenk commented May 8, 2024

cloud-fan commented May 9, 2024

holdenk commented May 9, 2024

[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

Are you sure you want to change the base?

[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

Conversation

holdenk commented Apr 2, 2024

What changes were proposed in this pull request?

Future Work / What else remains to do?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented Apr 5, 2024

holdenk commented Apr 5, 2024 • edited

mridulm commented Apr 5, 2024

holdenk commented Apr 11, 2024

holdenk commented May 6, 2024

holdenk commented May 8, 2024

cloud-fan commented May 9, 2024

holdenk commented May 9, 2024

holdenk commented Apr 5, 2024 •

edited