Skip to content

[SPARK-28217][SQL] Allow a pluggable statistics plan visitor for a logical plan.#25015

Closed
imback82 wants to merge 3 commits intoapache:masterfrom
imback82:logical_plan_stats
Closed

[SPARK-28217][SQL] Allow a pluggable statistics plan visitor for a logical plan.#25015
imback82 wants to merge 3 commits intoapache:masterfrom
imback82:logical_plan_stats

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Jun 29, 2019

What changes were proposed in this pull request?

Spark currently has two built-in statistics plan visitor: SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor. However, this is a bit limited since there is no way to plug in a custom plan visitor from which a query optimizer can benefit.

This PR allowers the user to specify a custom stat plan visitor via a Spark conf to override the built-in one:

// First create your custom stat plan visitor.
class MyStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
  // Implement LogicalPlanVisitor[Statistics] trait
}

// Set the visitor via Spark conf.
spark.conf.set("spark.sql.catalyst.statsPlanVisitorClass", "MyStatsPlanVisitor")

// Now, stat() on a LogicalPlan object will use MyStatsPlanVisitor as the stat plan visitor.

How was this patch tested?

Existing and new unit tests.

cc: @rapoth

@imback82 imback82 changed the title [SPARK-28217][SQL] Allow a custom statistics logical plan visitor to be plugged in. [SPARK-28217][SQL] Allow a pluggable statistics plan visitor for a logical plan. Jun 29, 2019
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AngersZhuuuu
Copy link
Contributor

[SPARK-27602] (https://issues.apache.org/jira/browse/SPARK-27602)

I've thought about doing this. You can make it more extensible.

@rapoth
Copy link

rapoth commented Jul 1, 2019

@AngersZhuuuu: Thank you for sharing the JIRA! I could not find a PR in the JIRA you linked to. Did you already work on this? Is there a PR you could share? Thanks!

@imback82
Copy link
Contributor Author

imback82 commented Jul 2, 2019

@srowen do you have any feedbacks on this PR? Thanks.

@srowen
Copy link
Member

srowen commented Jul 2, 2019

This one really isn't my area.

@imback82
Copy link
Contributor Author

imback82 commented Jul 2, 2019

@srowen do you know anyone who can help?

@srowen
Copy link
Member

srowen commented Jul 2, 2019

I usually check with git to see who wrote or touched the code. Maybe @rxin?

@rapoth
Copy link

rapoth commented Jul 2, 2019

Thank you @srowen! Also CCing @gatorsmile in case he has any suggestions.

@rxin
Copy link
Contributor

rxin commented Jul 2, 2019

What's the use case here? How does one use this without having fields to store stats?

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu: Thank you for sharing the JIRA! I could not find a PR in the JIRA you linked to. Did you already work on this? Is there a PR you could share? Thanks!

My goal is to get the real scan data size of a partition table, since current framework just get table level statistic. To fix this , it will have a problem can't avoid . That is column statistic of multi partition.I haven't found elegant method.

@rxin
Copy link
Contributor

rxin commented Jul 3, 2019

That seems like something we should fix in the data source API, rather than going through a specific stats plan visitor though.

@imback82
Copy link
Contributor Author

imback82 commented Jul 3, 2019

What's the use case here? How does one use this without having fields to store stats?

Today, cost/stats calculation in Catalyst is hard-coded and difficult to extend/customize (i.e., it only supports "size in bytes" and "basic stats" plan visitor). Cost/stats estimation/calculation has been known as a hard problem for decades, and people have been trying numerous approaches in both literature and practice. Indeed, some of our own customers have requested flexibility that allows them to plug-in their own cost/stats calculation mechanisms. This PR provides an extension point where a user can plug in a custom statistics plan visitor which can estimate/calculate stats/costs differently from the built-in ones, without of course, disrupting the existing use cases.

@rxin
Copy link
Contributor

rxin commented Jul 3, 2019

That seems like something you'd want to be able to specify using some hint, rather than a whole framework that depends on all of the internals of Spark SQL, doesn't it?

@imback82
Copy link
Contributor Author

imback82 commented Jul 3, 2019

To the extent that I understand, query hints are usually used for giving the optimizer additional information about how it should handle a given query (e.g., use broadcast join). However, the computation of costs/stats requires a traversal over the entire plan. For example, how can one easily override LogicalPlanVisitor[T].visitorJoin behavior with a "hint"?

@rxin
Copy link
Contributor

rxin commented Jul 3, 2019

But if you are building a custom query optimizer, why don't you just compute stats there?

@imback82
Copy link
Contributor Author

imback82 commented Jul 3, 2019

Ah, sorry for the confusion. I think my PR description was misleading. I meant to say "However, this is a bit limited since there is no way to plug in a custom plan visitor from which a query optimizer can benefit" (not a custom query optimizer).

I will update the original description.

@imback82
Copy link
Contributor Author

imback82 commented Jul 8, 2019

@rxin, does my previous comment address your question? To reiterate, this PR allows a custom plan visitor to be plugged in for the existing optimizer. Please let me know what you think.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 28, 2019
@github-actions github-actions bot closed this Dec 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants