[SPARK-28217][SQL] Allow a pluggable statistics plan visitor for a logical plan.#25015
[SPARK-28217][SQL] Allow a pluggable statistics plan visitor for a logical plan.#25015imback82 wants to merge 3 commits intoapache:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
[SPARK-27602] (https://issues.apache.org/jira/browse/SPARK-27602) I've thought about doing this. You can make it more extensible. |
|
@AngersZhuuuu: Thank you for sharing the JIRA! I could not find a PR in the JIRA you linked to. Did you already work on this? Is there a PR you could share? Thanks! |
|
@srowen do you have any feedbacks on this PR? Thanks. |
|
This one really isn't my area. |
|
@srowen do you know anyone who can help? |
|
I usually check with git to see who wrote or touched the code. Maybe @rxin? |
|
Thank you @srowen! Also CCing @gatorsmile in case he has any suggestions. |
|
What's the use case here? How does one use this without having fields to store stats? |
My goal is to get the real scan data size of a partition table, since current framework just get table level statistic. To fix this , it will have a problem can't avoid . That is column statistic of multi partition.I haven't found elegant method. |
|
That seems like something we should fix in the data source API, rather than going through a specific stats plan visitor though. |
Today, cost/stats calculation in Catalyst is hard-coded and difficult to extend/customize (i.e., it only supports "size in bytes" and "basic stats" plan visitor). Cost/stats estimation/calculation has been known as a hard problem for decades, and people have been trying numerous approaches in both literature and practice. Indeed, some of our own customers have requested flexibility that allows them to plug-in their own cost/stats calculation mechanisms. This PR provides an extension point where a user can plug in a custom statistics plan visitor which can estimate/calculate stats/costs differently from the built-in ones, without of course, disrupting the existing use cases. |
|
That seems like something you'd want to be able to specify using some hint, rather than a whole framework that depends on all of the internals of Spark SQL, doesn't it? |
|
To the extent that I understand, query hints are usually used for giving the optimizer additional information about how it should handle a given query (e.g., use broadcast join). However, the computation of costs/stats requires a traversal over the entire plan. For example, how can one easily override |
|
But if you are building a custom query optimizer, why don't you just compute stats there? |
|
Ah, sorry for the confusion. I think my PR description was misleading. I meant to say "However, this is a bit limited since there is no way to plug in a custom plan visitor from which a query optimizer can benefit" (not a custom query optimizer). I will update the original description. |
|
@rxin, does my previous comment address your question? To reiterate, this PR allows a custom plan visitor to be plugged in for the existing optimizer. Please let me know what you think. |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
Spark currently has two built-in statistics plan visitor:
SizeInBytesOnlyStatsPlanVisitorandBasicStatsPlanVisitor. However, this is a bit limited since there is no way to plug in a custom plan visitor from which a query optimizer can benefit.This PR allowers the user to specify a custom stat plan visitor via a Spark conf to override the built-in one:
How was this patch tested?
Existing and new unit tests.
cc: @rapoth