New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802
base: master
Are you sure you want to change the base?
[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802
Conversation
…n an 'expensive' projected operation (rlike)
… we can't do pawrtial on the ||
… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns.
5c6a780
to
19bc6d7
Compare
oh this is a hard one. The cost of predicates is hard to estimate, and also the benefit as we need to estimate the selectivity and the input data volume. |
It is. In general I think since we still apply the filter post projection if a user has created a projection with a named field and then filtered on that field the user is probably doing that intentionally since they don't want to double eval the named field. That plus some basic cost heuristics (simple math is cheap udfs can be expensive and so can regexes) should be a net win. |
+CC @shardulm94 |
Another possible solution would be to also break up the projection and move the part of the projection which is used in the filter down with the filter unless the only thing the projection is adding is the filter field in which case we'd leave it as is. This logic starts to get more complex, but I think in that case it's probably more of a "pure" win (e.g. no downsides). WDYT @cloud-fan ? |
Do folks have a preference between this approach & the one in #46143 ? |
CC @cloud-fan do you have thoughts / cycles? |
I've been thinking hard about it. Filter pushdown should always be beneficial if we don't duplicate expressions, and the new So my proposal is: when we push down filter, and we are about to duplicate some expressions, let's use |
Let me take a look at the with functionality but that sounds potentially reasonable. |
What changes were proposed in this pull request?
Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.
This introduces an "expectedCost" mechanism which we may or may not want. Previous filter ordering work used filter pushdowns as an approximation of expression cost but here we need more granularity. As an alternative we could introduce a flag for expensive rather than numeric operations.
Future Work / What else remains to do?
Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.
Why are the changes needed?
Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.
Does this PR introduce any user-facing change?
SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.
How was this patch tested?
New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist
Was this patch authored or co-authored using generative AI tooling?
No