Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19020] [SQL] Cardinality estimation of aggregate operator #16431

Closed
wants to merge 4 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Dec 29, 2016

What changes were proposed in this pull request?

Support cardinality estimation of aggregate operator

How was this patch tested?

Add test cases

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 29, 2016

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70710 has started for PR 16431 at commit c064595.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 29, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70714 has finished for PR 16431 at commit c064595.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70931 has finished for PR 16431 at commit fc229e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

// The number of output rows must not be larger than child's number of rows.
// Note that this also covers the case of uniqueness of column. If one of the group-by columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't get what this note means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the aggregate has three group-by columns, e.g. group by a, b, c, the number of output rows is estimated by ndv(a) * ndv(b) * ndv(c). It's an upper bound by assuming the data has every combination of values of a, b and c. But this product can become very large. So previously, I had two methods to set tighter bounds.

  1. #row of the aggregate must be <= #row of child.
  2. if one of the group-by columns is a primary key (e.g. column a), each distinct value of a can appear only once in records, then the number of possible combinations of a, b, c is equal to ndv(a), thus #row of the aggregate with group by a, b, c is equal to ndv(a).

But later, I noticed that since a is a primary key, ndv(a) is actually equal to #row of child. So case 2 is covered by case 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I don't think you need this explanation here -- it simply makes it more confusing. You are just putting an upper bound on cardinality, and that explains everything.

@rxin
Copy link
Contributor

rxin commented Jan 9, 2017

Can you update the pull request and the test cases to use the new test infra?

@wzhfy
Copy link
Contributor Author

wzhfy commented Jan 9, 2017

OK, I'll update this pr today.

class AggEstimationSuite extends StatsEstimationTestBase {

/** Column info: names and column stats for group-by columns */
val (key11, colStat11) = (attr("key11"), ColumnStat(2, Some(1), Some(2), 0, 4, 4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put these into a map so it is easier to read?

map from attribute to stat

also use named arguments when creating ColumnStat; otherwise it is too difficult to read what the 0 or 4 means.


/** Tables for testing */
/** Data for table1: (1, 10), (2, 10) */
val table1 = StatsTestPlan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put all the tables into the test cases? it is farther away from the test cases making it more difficult to read.

object AggregateEstimation {
import EstimationUtils._

def estimate(agg: Aggregate): Option[Statistics] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd document the algorithm in the javadoc

}
if (rowCountsExist(agg.child) && colStatsExist) {
// Initial value for agg without group expressions
var outputRows: BigInt = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write this using a reduceOption?


scala> Seq(1, 2, 3).map(i => BigInt(i)).reduceOption(_ * _).getOrElse(BigInt(1))
res5: scala.math.BigInt = 6

not sure if it is more clear

@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71074 has finished for PR 16431 at commit 41474d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71081 has finished for PR 16431 at commit c95067f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

expectedRowCount = 1)
}

test("there's a primary key in group-by columns") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test case is basically the same as the next one, isn't it?

private val nameToColInfo: Map[String, (Attribute, ColumnStat)] =
columnInfo.map(kv => kv._1.name -> kv)

test("empty group-by column") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a test case for empty output?

@rxin
Copy link
Contributor

rxin commented Jan 9, 2017

I'm going to merge this first. Please in your pr to update the project test, also address my comments for the aggregate tests.

import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._


class AggEstimationSuite extends StatsEstimationTestBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also rename this AggregateEstimationSuite?

@asfgit asfgit closed this in 15c2bd0 Jan 9, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Support cardinality estimation of aggregate operator

## How was this patch tested?

Add test cases

Author: Zhenhua Wang <wzh_zju@163.com>
Author: wangzhenhua <wangzhenhua@huawei.com>

Closes apache#16431 from wzhfy/aggEstimation.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Support cardinality estimation of aggregate operator

## How was this patch tested?

Add test cases

Author: Zhenhua Wang <wzh_zju@163.com>
Author: wangzhenhua <wangzhenhua@huawei.com>

Closes apache#16431 from wzhfy/aggEstimation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants