Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16669][SQL]Adding partition prunning to Metastore statistics f… #14655

Closed
wants to merge 1 commit into from

Conversation

Parth-Brahmbhatt
Copy link
Contributor

What changes were proposed in this pull request?

Adding partition prunning to Metastore statistics for better join selection.

Currently the metastore statistics returns the size of entire table which results in Join selection stretagy to not use broadcast joins even when only a single partition from a large table is selected.This PR addresses that issue by only estimating the size of the partition by applying partition pruning during size estimation. Currently it only works with partition columns used with equality checks under AND,OR,IN Operators. If a partition column is used in any other operator, it defaults back to total table size. We have also introduced a configuration to enable this optimization which will be off by default. Instead of trying to calculate the path we could make a metastore query to get all the valid paths but for simplicity we are just building the path in code.

How was this patch tested?

Unit tests added.

…or better join selection.

Currently the metastore statistics returns the size of entire table which results in Join selection stretagy to not use broadcast joins even when only a single partition from a large table is selected.This PR addresses that issue by only estimating the size of the partition by applying partition pruning during size estimation. Currently it only works with partition columns used with equality checks under AND,OR,IN Operators. If a partition column is used in any other operator, it defaults back to total table size. We have also introduced a configuration to enable this optimization which will be off by default. Instead of trying to calculate the path we could make a metastore query to get all the valid paths but for simplicity we are just building the path in code.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Aug 16, 2016

cc @cloud-fan and @gatorsmile - both are working on refactoring some of these code.

@cloud-fan
Copy link
Contributor

If we gonna do this, I'd like to have a more general approach, which should also work for data source tables.

@gatorsmile
Copy link
Member

Will this be part of the CBO work? The size estimation or statistics collection is being re-designed for CBO, right?

@Parth-Brahmbhatt
Copy link
Contributor Author

@cloud-fan How do you suggest to change this? I started with Metastore as internally that is the most used datasource and will benefit from partition pruning at planning stage. I am open to any suggestions and will modify the code accordingly.

@gatorsmile
Copy link
Member

Found a related JIRA: https://issues.apache.org/jira/browse/SPARK-17129

@Parth-Brahmbhatt
Copy link
Contributor Author

@gatorsmile not sure if its the same issue. The issue you are pointing at talks about storing the actual partition level stats, which could be used by this PR but until its available we could rely on file system level statistics. Also given this is config driven which is disabled by default it should have no perf impact.

@gatorsmile
Copy link
Member

How about waiting for a few days until that is delivered? Let us see whether that might simplify your PR.

@Parth-Brahmbhatt
Copy link
Contributor Author

@gatorsmile not sure if it will simplify much in this case as most of the complexity is in figuring out what partitions can be pruned which I don't think will go away. We will rely on hive metastore instead of hdfs for size calculation whenever partition level stats are stored and available but that part of the code is not really complex.

I am fine waiting for the patch to be delivered.

@gatorsmile
Copy link
Member

Thank you!

@gatorsmile
Copy link
Member

@Parth-Brahmbhatt Are you still interested in this PR? Our stats refactoring has been finished in the release of 2.2. Thank you!

@Parth-Brahmbhatt
Copy link
Contributor Author

I will re-evaluate and update or close the PR.

@gatorsmile
Copy link
Member

@Parth-Brahmbhatt Thank you!

@wzhfy
Copy link
Contributor

wzhfy commented Jun 13, 2017

Seems this PR aims to solve similar problem as SPARK-15616?

@lianhuiwang
Copy link
Contributor

@wzhfy Yes, I think this is same with SPARK-15616.

@Parth-Brahmbhatt
Copy link
Contributor Author

Closing this PR given its a duplicate at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants