Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47672][SQL] Avoid double eval from filter pushDown #45802

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Apr 2, 2024

What changes were proposed in this pull request?

Changes the filter pushDown optimizer to not push down past projections of the same element if we reasonable expect that computing that element is likely to be expensive.

This introduces an "expectedCost" mechanism which we may or may not want. Previous filter ordering work used filter pushdowns as an approximation of expression cost but here we need more granularity. As an alternative we could introduce a flag for expensive rather than numeric operations.

Future Work / What else remains to do?

Right now if a cond is expensive and it references something in the projection we don't push-down. We could probably do better and gate this on if the thing we are reference is expensive rather than the condition it's self. We could do this as a follow up item or as part of this PR.

Why are the changes needed?

Currently Spark may double compute expensive operations (like json parsing, UDF eval, etc.) as a result of filter pushdown past projections.

Does this PR introduce any user-facing change?

SQL optimizer change may impact some user queries, results should be the same and hopefully a little faster.

How was this patch tested?

New tests were added to the FilterPushDownSuite, and the initial problem of double evaluation was confirmed with a github gist

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Apr 2, 2024
…n an 'expensive' projected operation (rlike)
… (idk seems better than the hack of determining if something is eligible for pushdown, but we could also use that maybe? idk) & start updating optimizer filter pushdown past a project for _partial_ pushdowns.
@holdenk holdenk force-pushed the SPARK-47672-avoid-double-eval-from-filter-pushdown branch from 5c6a780 to 19bc6d7 Compare April 2, 2024 21:53
@holdenk holdenk requested a review from cloud-fan April 4, 2024 23:10
@cloud-fan
Copy link
Contributor

oh this is a hard one. The cost of predicates is hard to estimate, and also the benefit as we need to estimate the selectivity and the input data volume.

cc @kelvinjian-db @jchen5

@holdenk
Copy link
Contributor Author

holdenk commented Apr 5, 2024

It is. In general I think since we still apply the filter post projection if a user has created a projection with a named field and then filtered on that field the user is probably doing that intentionally since they don't want to double eval the named field. That plus some basic cost heuristics (simple math is cheap udfs can be expensive and so can regexes) should be a net win.

@mridulm
Copy link
Contributor

mridulm commented Apr 5, 2024

+CC @shardulm94

@holdenk
Copy link
Contributor Author

holdenk commented Apr 11, 2024

Another possible solution would be to also break up the projection and move the part of the projection which is used in the filter down with the filter unless the only thing the projection is adding is the filter field in which case we'd leave it as is.

This logic starts to get more complex, but I think in that case it's probably more of a "pure" win (e.g. no downsides). WDYT @cloud-fan ?

@holdenk
Copy link
Contributor Author

holdenk commented May 6, 2024

Do folks have a preference between this approach & the one in #46143 ?

@holdenk
Copy link
Contributor Author

holdenk commented May 8, 2024

CC @cloud-fan do you have thoughts / cycles?

@cloud-fan
Copy link
Contributor

I've been thinking hard about it. Filter pushdown should always be beneficial if we don't duplicate expressions, and the new With expression can avoid expression duplication.

So my proposal is: when we push down filter, and we are about to duplicate some expressions, let's use With to avoid it. At the end of the optimizer, we run the rule RewriteWithExpression to rewrite With and pull out common expressions into a Project below. Data source pushdown rule doesn't require the scan node to be the direct child of Filter, so everything should work as before.

@holdenk
Copy link
Contributor Author

holdenk commented May 9, 2024

Let me take a look at the with functionality but that sounds potentially reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants