Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Apr 20, 2024

What changes were proposed in this pull request?

Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.

This is a slightly complex alternative to #45802 which also moves parts of projections down so that the filters can move further down.

An expression can indicate if it is too expensive to be worth the potential savings of being double evaluated as a result of pushdown (by default we do this for all UDFs).

Future Work / What else remains to do?

Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.

Why are the changes needed?

Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.

Does this PR introduce any user-facing change?

SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.

How was this patch tested?

New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist

Was this patch authored or co-authored using generative AI tooling?

Used claude to generate more test coverage.

@github-actions github-actions bot added the SQL label Apr 20, 2024
@holdenk holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from a97d56f to f1d1ddd Compare May 6, 2024 17:58
@holdenk holdenk changed the title [WIP][SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown [SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown May 8, 2024
@holdenk
Copy link
Contributor Author

holdenk commented May 8, 2024

CC @cloud-fan do you have thoughts / cycles?

@mridulm
Copy link
Contributor

mridulm commented May 9, 2024

+CC @shardulm94

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Aug 21, 2024
@github-actions github-actions bot closed this Aug 22, 2024
@holdenk holdenk removed the Stale label Dec 2, 2024
@holdenk holdenk reopened this Dec 2, 2024
@holdenk holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from ac85ead to a5d8400 Compare December 5, 2024 23:28
@holdenk
Copy link
Contributor Author

holdenk commented Dec 5, 2024

Hi @cloud-fan looks like the "with" suggestion ended up being more complicated than originally suggested( see #46499 (comment) ). In the interest of progress and avoiding double evaluation of a lot of really expensive things we don't need I intend to update this PR and merge it. We can still circle back to the with approach eventually.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 6, 2024

Sorry for the late response to this project. I think the With approach is not that complicated and I'm fixing the nested With limitation here: #49093 . After this is merged, I can followup with the actual pushdown implementation if @zml1206 can't continue his work.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 22, 2025
@github-actions github-actions bot closed this Mar 23, 2025
@holdenk
Copy link
Contributor Author

holdenk commented Nov 3, 2025

Just wondering if we have a consesus on the best way to go about this @zml1206 / @cloud-fan ? I'm thinking based on the > 1 year since with change it might be more complicated than we originally thought. I can re-explore as well is @zml1206 is busy but we could also go for the simpler solution in the meantime since double UDF evaluation is bad.

@holdenk holdenk reopened this Nov 3, 2025
@holdenk holdenk removed the Stale label Nov 3, 2025
@cloud-fan
Copy link
Contributor

Hi @holdenk , we tried very hard to solve this issue efficiently but failed. The idea was to let filter carry a project list and push them down together, but when we push through Project/Aggregate which also contains a project list, we may still hit expression duplication and need to make a decision based on cost.

Sorry I should have moved back to this PR earlier. I think we can simplify it a bit as we will likely never have a practical cost model for Spark expressions. Let's just avoid UDF expression (extends marker expression UserDefinedExpression) duplication during a filter pushdown and add a config to enable it.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 3, 2025

Sounds like a plan, I'll work on simplifying this code.

holdenk and others added 8 commits November 3, 2025 15:50
…n an 'expensive' projected operation (rlike)

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… we can't do pawrtial on the ||

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…ct to 'save'

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…ushDown

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… projection that we are using

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…o aliases

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
sfc-gh-hkarau and others added 6 commits December 18, 2025 15:25
… || where we can't

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…pushdown-split-projection

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
@holdenk holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown-split-projection branch from 6716982 to 9c83e99 Compare December 22, 2025 20:23
@holdenk
Copy link
Contributor Author

holdenk commented Dec 22, 2025

Assuming this still passes CI (fingers crossed), that I've updated the docs and simplified the flow of the optimization rule a little bit, I intend to merge this next week.

// We have at least one filter that we can split the projection around.
// We're going to now add projections one at a time for the expensive components for
// each group of filters. We'll keep track of what we added for the previous filter(s)
// so we don't double add anything.
Copy link
Contributor

@cloud-fan cloud-fan Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different view to look at this problem and design the algorithm: A Project with N named expressions can at most be split into N Projects. e.g. Project(expr1 AS c1, expr2 AS c2, expr3 AS c3) can be split into

Project(c1, c2, expr3 AS c2)
  Project(leaf.output, c1, expr2 AS c2)
    Project(leaf.output, expr1 AS c1)
      leaf

Each condition references a subset or all of the N named expressions, let's group the conditions by the number of referenced named expressions, group1 means only reference one named expression, and so on.

For conditions in group1, we rank the named expressions by "how many conditions can be evaluated with it" and pick the best, let's say it's e1, we create the first Project to evaluate e1, and move conditions that can't be evaluated to group2.

For conditions in group2, again we rank the rest of the named expressions (exclude e1) by "how many conditions can be evaluated with it" and pick the best. For named expressions with the same rank, we sort by ref count and pick the best, let's say it's e2, and we repeat the same process and repeat it for group3, group4, ..., groupN.

This algorithm tries to run as many filters earlier as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true, but given your previous statement around how adding projections is not free I don't think that's the right way to structure this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's group the conditions by the number of referenced named expressions, group1 means only reference one named expression, and so on.
...
For conditions in group1, we rank the named expressions by "how many conditions can be evaluated with it" and pick the best

This seems like a good heuristics to me to split the projections around.

if (!SQLConf.get.avoidDoubleFilterEval) {
(cond, AttributeMap.empty[Alias])
} else {
val (replaced, usedAliases) = replaceAliasWhileTracking(cond, aliasMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you don't use replaced anywere or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, we currentl re-calculate replaced again further downstream when we need it.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jan 5, 2026

To discuss #46143 (comment) further:

Yes that's true, but given your previous statement around how adding projections is not free I don't think that's the right way to structure this.

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 5, 2026

That's why my initial suggestion was to not do this optimization at all. We just keep the Filter above the Project. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in Project now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code.

So just always leave up complex filters and don't don't attempt to split them if needed? I think that's sub-optimal for fairly self evident reasons but if you still find the current implementation too complex I could move it into a follow-on PR so there's less to review here and we just fix the perf regression introduced in 3.0

@cloud-fan
Copy link
Contributor

A followup SGTM, at least we can fix the perf regression first.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 6, 2026

Awesome, I'll rework this then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants