[SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown #46143

holdenk · 2024-04-20T00:14:53Z

What changes were proposed in this pull request?

Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.

This is a slightly complex alternative to #45802 which also moves parts of projections down so that the filters can move further down.

An expression can indicate if it is too expensive to be worth the potential savings of being double evaluated as a result of pushdown (by default we do this for all UDFs).

Future Work / What else remains to do?

Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.

Why are the changes needed?

Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.

Does this PR introduce any user-facing change?

SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.

How was this patch tested?

New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist

Was this patch authored or co-authored using generative AI tooling?

Used claude to generate more test coverage.

holdenk · 2024-05-08T21:40:57Z

CC @cloud-fan do you have thoughts / cycles?

mridulm · 2024-05-09T17:26:01Z

+CC @shardulm94

github-actions · 2024-08-21T00:21:31Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

holdenk · 2024-12-05T23:30:06Z

Hi @cloud-fan looks like the "with" suggestion ended up being more complicated than originally suggested( see #46499 (comment) ). In the interest of progress and avoiding double evaluation of a lot of really expensive things we don't need I intend to update this PR and merge it. We can still circle back to the with approach eventually.

cloud-fan · 2024-12-06T14:26:41Z

Sorry for the late response to this project. I think the With approach is not that complicated and I'm fixing the nested With limitation here: #49093 . After this is merged, I can followup with the actual pushdown implementation if @zml1206 can't continue his work.

github-actions · 2025-03-22T00:25:21Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

holdenk · 2025-11-03T18:24:00Z

Just wondering if we have a consesus on the best way to go about this @zml1206 / @cloud-fan ? I'm thinking based on the > 1 year since with change it might be more complicated than we originally thought. I can re-explore as well is @zml1206 is busy but we could also go for the simpler solution in the meantime since double UDF evaluation is bad.

cloud-fan · 2025-11-03T19:15:28Z

Hi @holdenk , we tried very hard to solve this issue efficiently but failed. The idea was to let filter carry a project list and push them down together, but when we push through Project/Aggregate which also contains a project list, we may still hit expression duplication and need to make a decision based on cost.

Sorry I should have moved back to this PR earlier. I think we can simplify it a bit as we will likely never have a practical cost model for Spark expressions. Let's just avoid UDF expression (extends marker expression UserDefinedExpression) duplication during a filter pushdown and add a config to enable it.

holdenk · 2025-11-03T20:33:26Z

Sounds like a plan, I'll work on simplifying this code.

…n an 'expensive' projected operation (rlike) Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… we can't do pawrtial on the || Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…ct to 'save' Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…ushDown Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… projection that we are using Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…o aliases Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… || where we can't Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…pushdown-split-projection Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk · 2025-12-22T20:24:48Z

Assuming this still passes CI (fingers crossed), that I've updated the docs and simplified the flow of the optimization rule a little bit, I intend to merge this next week.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

cloud-fan · 2025-12-23T14:58:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+            // We have at least one filter that we can split the projection around.
+            // We're going to now add projections one at a time for the expensive components for
+            // each group of filters. We'll keep track of what we added for the previous filter(s)
+            // so we don't double add anything.


A different view to look at this problem and design the algorithm: A Project with N named expressions can at most be split into N Projects. e.g. Project(expr1 AS c1, expr2 AS c2, expr3 AS c3) can be split into

Project(c1, c2, expr3 AS c2) Project(leaf.output, c1, expr2 AS c2) Project(leaf.output, expr1 AS c1) leaf

Each condition references a subset or all of the N named expressions, let's group the conditions by the number of referenced named expressions, group1 means only reference one named expression, and so on.

For conditions in group1, we rank the named expressions by "how many conditions can be evaluated with it" and pick the best, let's say it's e1, we create the first Project to evaluate e1, and move conditions that can't be evaluated to group2.

For conditions in group2, again we rank the rest of the named expressions (exclude e1) by "how many conditions can be evaluated with it" and pick the best. For named expressions with the same rank, we sort by ref count and pick the best, let's say it's e2, and we repeat the same process and repeat it for group3, group4, ..., groupN.

This algorithm tries to run as many filters earlier as possible.

Yes that's true, but given your previous statement around how adding projections is not free I don't think that's the right way to structure this.

let's group the conditions by the number of referenced named expressions, group1 means only reference one named expression, and so on.
...
For conditions in group1, we rank the named expressions by "how many conditions can be evaluated with it" and pick the best

This seems like a good heuristics to me to split the projections around.

…xpressions Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…pushdown-split-projection

@cloud-fan

…k equality by size (great suggestiong @cloud-fan :D)

…ck in, and re-order the if check so it makes more sense.

…pushdown-split-projection

peter-toth · 2026-01-01T15:20:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+        if (!SQLConf.get.avoidDoubleFilterEval) {
+          (cond, AttributeMap.empty[Alias])
+        } else {
+          val (replaced, usedAliases) = replaceAliasWhileTracking(cond, aliasMap)


Seems like you don't use replaced anywere or am I missing something?

It's a good point, we currentl re-calculate replaced again further downstream when we need it.

cloud-fan · 2026-01-05T13:27:49Z

To discuss #46143 (comment) further:

Yes that's true, but given your previous statement around how adding projections is not free I don't think that's the right way to structure this.

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

holdenk · 2026-01-05T21:04:38Z

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

So just always leave up complex filters and don't don't attempt to split them if needed? I think that's sub-optimal for fairly self evident reasons but if you still find the current implementation too complex I could move it into a follow-on PR so there's less to review here and we just fix the perf regression introduced in 3.0

cloud-fan · 2026-01-06T16:33:52Z

A followup SGTM, at least we can fix the perf regression first.

holdenk · 2026-01-06T18:45:26Z

Awesome, I'll rework this then :)

github-actions bot added the SQL label Apr 20, 2024

holdenk mentioned this pull request May 6, 2024

[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

Closed

holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from a97d56f to f1d1ddd Compare May 6, 2024 17:58

holdenk changed the title ~~[WIP][SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown~~ [SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown May 8, 2024

github-actions bot added the Stale label Aug 21, 2024

github-actions bot closed this Aug 22, 2024

holdenk removed the Stale label Dec 2, 2024

holdenk reopened this Dec 2, 2024

holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from ac85ead to a5d8400 Compare December 5, 2024 23:28

github-actions bot added the Stale label Mar 22, 2025

github-actions bot closed this Mar 23, 2025

holdenk reopened this Nov 3, 2025

holdenk removed the Stale label Nov 3, 2025

holdenk and others added 8 commits November 3, 2025 15:50

Start adding a test for relationship pushdown where we don't push dow…

816d27d

…n an 'expensive' projected operation (rlike) Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Break up the filter pushdown test into two parts with || and && since…

4bf7342

… we can't do pawrtial on the || Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Get the filter push down/not-push down working.

8f44a4b

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

A few more "expensive" type operations we would want to use the proje…

3408090

…ct to 'save' Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fix push to do alias replace with a non-empty stay up and non-empty p…

0ab43c0

…ushDown Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fill out how we do out filter push down to also push the parts of the…

fa486c9

… projection that we are using Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fix up tests, add a bit more coverage, fix the case where there are n…

b2f5d80

…o aliases Co-authored-by: Holden Karau <holden@pigscanfly.ca>

sfc-gh-hkarau and others added 6 commits December 18, 2025 15:25

That auto generated test was incorrect, we can split on && so swap to…

be63017

… || where we can't Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Drop the printlns

1eb9e42

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Oh wait nvm, since we're working at the f#0 not 'f' level we're fine.

8a80e98

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Merge branch 'master' into SPARK-47672-avoid-double-eval-from-filter-…

e0275ff

…pushdown-split-projection Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Work on wording of the comments.

e0e233a

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

A few more comments.

9c83e99

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from 6716982 to 9c83e99 Compare December 22, 2025 20:23