[SPARK-28217][SQL] Allow a pluggable statistics plan visitor for a logical plan. by imback82 · Pull Request #25015 · apache/spark

imback82 · 2019-06-29T20:18:53Z

What changes were proposed in this pull request?

Spark currently has two built-in statistics plan visitor: SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor. However, this is a bit limited since there is no way to plug in a custom plan visitor from which a query optimizer can benefit.

This PR allowers the user to specify a custom stat plan visitor via a Spark conf to override the built-in one:

// First create your custom stat plan visitor.
class MyStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
  // Implement LogicalPlanVisitor[Statistics] trait
}

// Set the visitor via Spark conf.
spark.conf.set("spark.sql.catalyst.statsPlanVisitorClass", "MyStatsPlanVisitor")

// Now, stat() on a LogicalPlan object will use MyStatsPlanVisitor as the stat plan visitor.

How was this patch tested?

Existing and new unit tests.

cc: @rapoth

AmplabJenkins · 2019-06-29T20:25:23Z

Can one of the admins verify this patch?

AngersZhuuuu · 2019-07-01T05:47:50Z

[SPARK-27602] (https://issues.apache.org/jira/browse/SPARK-27602)

I've thought about doing this. You can make it more extensible.

rapoth · 2019-07-01T17:35:49Z

@AngersZhuuuu: Thank you for sharing the JIRA! I could not find a PR in the JIRA you linked to. Did you already work on this? Is there a PR you could share? Thanks!

imback82 · 2019-07-02T13:55:14Z

@srowen do you have any feedbacks on this PR? Thanks.

srowen · 2019-07-02T14:31:04Z

This one really isn't my area.

imback82 · 2019-07-02T14:35:27Z

@srowen do you know anyone who can help?

srowen · 2019-07-02T14:45:47Z

I usually check with git to see who wrote or touched the code. Maybe @rxin?

rapoth · 2019-07-02T16:27:49Z

Thank you @srowen! Also CCing @gatorsmile in case he has any suggestions.

rxin · 2019-07-02T17:12:53Z

What's the use case here? How does one use this without having fields to store stats?

AngersZhuuuu · 2019-07-03T01:35:21Z

@AngersZhuuuu: Thank you for sharing the JIRA! I could not find a PR in the JIRA you linked to. Did you already work on this? Is there a PR you could share? Thanks!

My goal is to get the real scan data size of a partition table, since current framework just get table level statistic. To fix this , it will have a problem can't avoid . That is column statistic of multi partition.I haven't found elegant method.

rxin · 2019-07-03T03:52:17Z

That seems like something we should fix in the data source API, rather than going through a specific stats plan visitor though.

imback82 · 2019-07-03T07:45:48Z

What's the use case here? How does one use this without having fields to store stats?

Today, cost/stats calculation in Catalyst is hard-coded and difficult to extend/customize (i.e., it only supports "size in bytes" and "basic stats" plan visitor). Cost/stats estimation/calculation has been known as a hard problem for decades, and people have been trying numerous approaches in both literature and practice. Indeed, some of our own customers have requested flexibility that allows them to plug-in their own cost/stats calculation mechanisms. This PR provides an extension point where a user can plug in a custom statistics plan visitor which can estimate/calculate stats/costs differently from the built-in ones, without of course, disrupting the existing use cases.

rxin · 2019-07-03T16:59:59Z

That seems like something you'd want to be able to specify using some hint, rather than a whole framework that depends on all of the internals of Spark SQL, doesn't it?

imback82 · 2019-07-03T22:00:56Z

To the extent that I understand, query hints are usually used for giving the optimizer additional information about how it should handle a given query (e.g., use broadcast join). However, the computation of costs/stats requires a traversal over the entire plan. For example, how can one easily override LogicalPlanVisitor[T].visitorJoin behavior with a "hint"?

rxin · 2019-07-03T22:04:08Z

But if you are building a custom query optimizer, why don't you just compute stats there?

imback82 · 2019-07-03T22:48:39Z

Ah, sorry for the confusion. I think my PR description was misleading. I meant to say "However, this is a bit limited since there is no way to plug in a custom plan visitor from which a query optimizer can benefit" (not a custom query optimizer).

I will update the original description.

imback82 · 2019-07-08T17:33:11Z

@rxin, does my previous comment address your question? To reiterate, this PR allows a custom plan visitor to be plugged in for the existing optimizer. Please let me know what you think.

github-actions · 2019-12-28T00:07:27Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

imback82 added 3 commits June 28, 2019 15:58

initial commit

b65eae0

Add unit tests

d0f5058

Add more comments.

467f522

imback82 changed the title ~~[SPARK-28217][SQL] Allow a custom statistics logical plan visitor to be plugged in.~~ [SPARK-28217][SQL] Allow a pluggable statistics plan visitor for a logical plan. Jun 29, 2019

dongjoon-hyun added the SQL label Jul 1, 2019

github-actions bot added the Stale label Dec 28, 2019

github-actions bot closed this Dec 29, 2019

Conversation

imback82 commented Jun 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jun 29, 2019

Uh oh!

AngersZhuuuu commented Jul 1, 2019

Uh oh!

rapoth commented Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 commented Jul 2, 2019

Uh oh!

srowen commented Jul 2, 2019

Uh oh!

imback82 commented Jul 2, 2019

Uh oh!

srowen commented Jul 2, 2019

Uh oh!

rapoth commented Jul 2, 2019

Uh oh!

rxin commented Jul 2, 2019

Uh oh!

AngersZhuuuu commented Jul 3, 2019

Uh oh!

rxin commented Jul 3, 2019

Uh oh!

imback82 commented Jul 3, 2019

Uh oh!

rxin commented Jul 3, 2019

Uh oh!

imback82 commented Jul 3, 2019

Uh oh!

rxin commented Jul 3, 2019

Uh oh!

imback82 commented Jul 3, 2019

Uh oh!

imback82 commented Jul 8, 2019

Uh oh!

github-actions bot commented Dec 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

imback82 commented Jun 29, 2019 •

edited

Loading

rapoth commented Jul 1, 2019 •

edited

Loading