[SPARK-16669][SQL]Adding partition prunning to Metastore statistics f… #14655

Parth-Brahmbhatt · 2016-08-15T22:15:51Z

What changes were proposed in this pull request?

Adding partition prunning to Metastore statistics for better join selection.

Currently the metastore statistics returns the size of entire table which results in Join selection stretagy to not use broadcast joins even when only a single partition from a large table is selected.This PR addresses that issue by only estimating the size of the partition by applying partition pruning during size estimation. Currently it only works with partition columns used with equality checks under AND,OR,IN Operators. If a partition column is used in any other operator, it defaults back to total table size. We have also introduced a configuration to enable this optimization which will be off by default. Instead of trying to calculate the path we could make a metastore query to get all the valid paths but for simplicity we are just building the path in code.

How was this patch tested?

Unit tests added.

…or better join selection. Currently the metastore statistics returns the size of entire table which results in Join selection stretagy to not use broadcast joins even when only a single partition from a large table is selected.This PR addresses that issue by only estimating the size of the partition by applying partition pruning during size estimation. Currently it only works with partition columns used with equality checks under AND,OR,IN Operators. If a partition column is used in any other operator, it defaults back to total table size. We have also introduced a configuration to enable this optimization which will be off by default. Instead of trying to calculate the path we could make a metastore query to get all the valid paths but for simplicity we are just building the path in code.

AmplabJenkins · 2016-08-15T22:17:14Z

Can one of the admins verify this patch?

rxin · 2016-08-16T07:46:18Z

cc @cloud-fan and @gatorsmile - both are working on refactoring some of these code.

cloud-fan · 2016-08-16T07:56:12Z

If we gonna do this, I'd like to have a more general approach, which should also work for data source tables.

gatorsmile · 2016-08-16T08:20:53Z

Will this be part of the CBO work? The size estimation or statistics collection is being re-designed for CBO, right?

Parth-Brahmbhatt · 2016-08-16T16:25:21Z

@cloud-fan How do you suggest to change this? I started with Metastore as internally that is the most used datasource and will benefit from partition pruning at planning stage. I am open to any suggestions and will modify the code accordingly.

gatorsmile · 2016-08-18T23:46:00Z

Found a related JIRA: https://issues.apache.org/jira/browse/SPARK-17129

Parth-Brahmbhatt · 2016-08-19T17:52:16Z

@gatorsmile not sure if its the same issue. The issue you are pointing at talks about storing the actual partition level stats, which could be used by this PR but until its available we could rely on file system level statistics. Also given this is config driven which is disabled by default it should have no perf impact.

gatorsmile · 2016-08-19T18:19:57Z

How about waiting for a few days until that is delivered? Let us see whether that might simplify your PR.

Parth-Brahmbhatt · 2016-08-25T22:17:17Z

@gatorsmile not sure if it will simplify much in this case as most of the complexity is in figuring out what partitions can be pruned which I don't think will go away. We will rely on hive metastore instead of hdfs for size calculation whenever partition level stats are stored and available but that part of the code is not really complex.

I am fine waiting for the patch to be delivered.

gatorsmile · 2016-08-25T22:26:57Z

Thank you!

gatorsmile · 2017-06-13T19:05:13Z

@Parth-Brahmbhatt Are you still interested in this PR? Our stats refactoring has been finished in the release of 2.2. Thank you!

Parth-Brahmbhatt · 2017-06-13T19:37:00Z

I will re-evaluate and update or close the PR.

gatorsmile · 2017-06-13T19:46:39Z

@Parth-Brahmbhatt Thank you!

wzhfy · 2017-06-13T21:13:37Z

Seems this PR aims to solve similar problem as SPARK-15616?

lianhuiwang · 2017-06-14T15:42:35Z

@wzhfy Yes, I think this is same with SPARK-15616.

Parth-Brahmbhatt · 2017-06-14T18:34:06Z

Closing this PR given its a duplicate at this point.

gatorsmile mentioned this pull request Jun 13, 2017

[SPARK-20986] [SQL] Reset table's statistics after PruneFileSourcePartitions rule. #18205

Closed

Parth-Brahmbhatt closed this Jun 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16669][SQL]Adding partition prunning to Metastore statistics f… #14655

[SPARK-16669][SQL]Adding partition prunning to Metastore statistics f… #14655

Parth-Brahmbhatt commented Aug 15, 2016

AmplabJenkins commented Aug 15, 2016

rxin commented Aug 16, 2016

cloud-fan commented Aug 16, 2016

gatorsmile commented Aug 16, 2016

Parth-Brahmbhatt commented Aug 16, 2016

gatorsmile commented Aug 18, 2016

Parth-Brahmbhatt commented Aug 19, 2016

gatorsmile commented Aug 19, 2016

Parth-Brahmbhatt commented Aug 25, 2016

gatorsmile commented Aug 25, 2016

gatorsmile commented Jun 13, 2017

Parth-Brahmbhatt commented Jun 13, 2017

gatorsmile commented Jun 13, 2017

wzhfy commented Jun 13, 2017 •

edited

Loading

lianhuiwang commented Jun 14, 2017

Parth-Brahmbhatt commented Jun 14, 2017

[SPARK-16669][SQL]Adding partition prunning to Metastore statistics f… #14655

[SPARK-16669][SQL]Adding partition prunning to Metastore statistics f… #14655

Conversation

Parth-Brahmbhatt commented Aug 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Aug 15, 2016

rxin commented Aug 16, 2016

cloud-fan commented Aug 16, 2016

gatorsmile commented Aug 16, 2016

Parth-Brahmbhatt commented Aug 16, 2016

gatorsmile commented Aug 18, 2016

Parth-Brahmbhatt commented Aug 19, 2016

gatorsmile commented Aug 19, 2016

Parth-Brahmbhatt commented Aug 25, 2016

gatorsmile commented Aug 25, 2016

gatorsmile commented Jun 13, 2017

Parth-Brahmbhatt commented Jun 13, 2017

gatorsmile commented Jun 13, 2017

wzhfy commented Jun 13, 2017 • edited Loading

lianhuiwang commented Jun 14, 2017

Parth-Brahmbhatt commented Jun 14, 2017

wzhfy commented Jun 13, 2017 •

edited

Loading