-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown #46143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-47672][SQL] Avoid double eval from filter pushDown w/ projection pushdown #46143
Conversation
a97d56f to
f1d1ddd
Compare
|
CC @cloud-fan do you have thoughts / cycles? |
|
+CC @shardulm94 |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
ac85ead to
a5d8400
Compare
|
Hi @cloud-fan looks like the "with" suggestion ended up being more complicated than originally suggested( see #46499 (comment) ). In the interest of progress and avoiding double evaluation of a lot of really expensive things we don't need I intend to update this PR and merge it. We can still circle back to the with approach eventually. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Just wondering if we have a consesus on the best way to go about this @zml1206 / @cloud-fan ? I'm thinking based on the > 1 year since with change it might be more complicated than we originally thought. I can re-explore as well is @zml1206 is busy but we could also go for the simpler solution in the meantime since double UDF evaluation is bad. |
|
Hi @holdenk , we tried very hard to solve this issue efficiently but failed. The idea was to let filter carry a project list and push them down together, but when we push through Project/Aggregate which also contains a project list, we may still hit expression duplication and need to make a decision based on cost. Sorry I should have moved back to this PR earlier. I think we can simplify it a bit as we will likely never have a practical cost model for Spark expressions. Let's just avoid UDF expression (extends marker expression |
|
Sounds like a plan, I'll work on simplifying this code. |
…n an 'expensive' projected operation (rlike) Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… we can't do pawrtial on the || Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns. Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…ct to 'save' Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…ushDown Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… projection that we are using Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…o aliases Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… || where we can't Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…pushdown-split-projection Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
6716982 to
9c83e99
Compare
|
Assuming this still passes CI (fingers crossed), that I've updated the docs and simplified the flow of the optimization rule a little bit, I intend to merge this next week. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
| // We have at least one filter that we can split the projection around. | ||
| // We're going to now add projections one at a time for the expensive components for | ||
| // each group of filters. We'll keep track of what we added for the previous filter(s) | ||
| // so we don't double add anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A different view to look at this problem and design the algorithm: A Project with N named expressions can at most be split into N Projects. e.g. Project(expr1 AS c1, expr2 AS c2, expr3 AS c3) can be split into
Project(c1, c2, expr3 AS c2)
Project(leaf.output, c1, expr2 AS c2)
Project(leaf.output, expr1 AS c1)
leaf
Each condition references a subset or all of the N named expressions, let's group the conditions by the number of referenced named expressions, group1 means only reference one named expression, and so on.
For conditions in group1, we rank the named expressions by "how many conditions can be evaluated with it" and pick the best, let's say it's e1, we create the first Project to evaluate e1, and move conditions that can't be evaluated to group2.
For conditions in group2, again we rank the rest of the named expressions (exclude e1) by "how many conditions can be evaluated with it" and pick the best. For named expressions with the same rank, we sort by ref count and pick the best, let's say it's e2, and we repeat the same process and repeat it for group3, group4, ..., groupN.
This algorithm tries to run as many filters earlier as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true, but given your previous statement around how adding projections is not free I don't think that's the right way to structure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's group the conditions by the number of referenced named expressions,
group1means only reference one named expression, and so on.
...
For conditions in group1, we rank the named expressions by "how many conditions can be evaluated with it" and pick the best
This seems like a good heuristics to me to split the projections around.
…xpressions Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
…pushdown-split-projection
…k equality by size (great suggestiong @cloud-fan :D)
…ck in, and re-order the if check so it makes more sense.
…pushdown-split-projection
| if (!SQLConf.get.avoidDoubleFilterEval) { | ||
| (cond, AttributeMap.empty[Alias]) | ||
| } else { | ||
| val (replaced, usedAliases) = replaceAliasWhileTracking(cond, aliasMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you don't use replaced anywere or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, we currentl re-calculate replaced again further downstream when we need it.
|
To discuss #46143 (comment) further:
That's why my initial suggestion was to not do this optimization at all. We just keep the |
So just always leave up complex filters and don't don't attempt to split them if needed? I think that's sub-optimal for fairly self evident reasons but if you still find the current implementation too complex I could move it into a follow-on PR so there's less to review here and we just fix the perf regression introduced in 3.0 |
|
A followup SGTM, at least we can fix the perf regression first. |
|
Awesome, I'll rework this then :) |
What changes were proposed in this pull request?
Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.
This is a slightly complex alternative to #45802 which also moves parts of projections down so that the filters can move further down.
An expression can indicate if it is too expensive to be worth the potential savings of being double evaluated as a result of pushdown (by default we do this for all UDFs).
Future Work / What else remains to do?
Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.
Why are the changes needed?
Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.
Does this PR introduce any user-facing change?
SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.
How was this patch tested?
New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist
Was this patch authored or co-authored using generative AI tooling?
Used claude to generate more test coverage.