[SPARK-19020] [SQL] Cardinality estimation of aggregate operator #16431

wzhfy · 2016-12-29T06:39:02Z

What changes were proposed in this pull request?

Support cardinality estimation of aggregate operator

How was this patch tested?

Add test cases

wzhfy · 2016-12-29T06:40:49Z

cc @rxin @hvanhovell @cloud-fan @srinathshankar

SparkQA · 2016-12-29T06:42:35Z

Test build #70710 has started for PR 16431 at commit c064595.

wzhfy · 2016-12-29T08:12:19Z

retest this please

SparkQA · 2016-12-29T10:31:07Z

Test build #70714 has finished for PR 16431 at commit c064595.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-05T17:49:48Z

Test build #70931 has finished for PR 16431 at commit fc229e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-06T06:52:46Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/AggregateEstimation.scala

+      }
+
+      // The number of output rows must not be larger than child's number of rows.
+      // Note that this also covers the case of uniqueness of column. If one of the group-by columns


i don't get what this note means.

If the aggregate has three group-by columns, e.g. group by a, b, c, the number of output rows is estimated by ndv(a) * ndv(b) * ndv(c). It's an upper bound by assuming the data has every combination of values of a, b and c. But this product can become very large. So previously, I had two methods to set tighter bounds.

#row of the aggregate must be <= #row of child.

if one of the group-by columns is a primary key (e.g. column a), each distinct value of a can appear only once in records, then the number of possible combinations of a, b, c is equal to ndv(a), thus #row of the aggregate with group by a, b, c is equal to ndv(a).

But later, I noticed that since a is a primary key, ndv(a) is actually equal to #row of child. So case 2 is covered by case 1.

OK I don't think you need this explanation here -- it simply makes it more confusing. You are just putting an upper bound on cardinality, and that explains everything.

rxin · 2017-01-09T05:17:27Z

Can you update the pull request and the test cases to use the new test infra?

wzhfy · 2017-01-09T05:30:35Z

OK, I'll update this pr today.

rxin · 2017-01-09T09:05:37Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/AggEstimationSuite.scala

+class AggEstimationSuite extends StatsEstimationTestBase {
+
+  /** Column info: names and column stats for group-by columns */
+  val (key11, colStat11) = (attr("key11"), ColumnStat(2, Some(1), Some(2), 0, 4, 4))


can we put these into a map so it is easier to read?

map from attribute to stat

also use named arguments when creating ColumnStat; otherwise it is too difficult to read what the 0 or 4 means.

rxin · 2017-01-09T09:06:27Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/AggEstimationSuite.scala

+
+  /** Tables for testing */
+  /** Data for table1: (1, 10), (2, 10) */
+  val table1 = StatsTestPlan(


can we put all the tables into the test cases? it is farther away from the test cases making it more difficult to read.

rxin · 2017-01-09T09:08:14Z

.../scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala

+object AggregateEstimation {
+  import EstimationUtils._
+
+  def estimate(agg: Aggregate): Option[Statistics] = {


i'd document the algorithm in the javadoc

rxin · 2017-01-09T09:11:43Z

.../scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala

+    }
+    if (rowCountsExist(agg.child) && colStatsExist) {
+      // Initial value for agg without group expressions
+      var outputRows: BigInt = 1


Can you write this using a reduceOption?

scala> Seq(1, 2, 3).map(i => BigInt(i)).reduceOption(_ * _).getOrElse(BigInt(1)) res5: scala.math.BigInt = 6

not sure if it is more clear

SparkQA · 2017-01-09T11:01:33Z

Test build #71074 has finished for PR 16431 at commit 41474d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-09T15:23:48Z

Test build #71081 has finished for PR 16431 at commit c95067f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-09T19:26:09Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/AggEstimationSuite.scala

+      expectedRowCount = 1)
+  }
+
+  test("there's a primary key in group-by columns") {


this test case is basically the same as the next one, isn't it?

rxin · 2017-01-09T19:26:27Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/AggEstimationSuite.scala

+  private val nameToColInfo: Map[String, (Attribute, ColumnStat)] =
+    columnInfo.map(kv => kv._1.name -> kv)
+
+  test("empty group-by column") {


can you also add a test case for empty output?

rxin · 2017-01-09T19:28:46Z

I'm going to merge this first. Please in your pr to update the project test, also address my comments for the aggregate tests.

rxin · 2017-01-09T19:29:10Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/AggEstimationSuite.scala

+import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+
+
+class AggEstimationSuite extends StatsEstimationTestBase {


can you also rename this AggregateEstimationSuite?

## What changes were proposed in this pull request? Support cardinality estimation of aggregate operator ## How was this patch tested? Add test cases Author: Zhenhua Wang <wzh_zju@163.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#16431 from wzhfy/aggEstimation.

wzhfy force-pushed the aggEstimation branch from c064595 to fc229e4 Compare January 5, 2017 15:31

rxin reviewed Jan 6, 2017

View reviewed changes

wzhfy and others added 3 commits January 9, 2017 14:23

agg estimation

0a56f78

remove useless uniqueness check

94e855a

modify test cases based on new test infra

41474d0

wzhfy force-pushed the aggEstimation branch from fc229e4 to 41474d0 Compare January 9, 2017 08:38

rxin reviewed Jan 9, 2017

View reviewed changes

comments

c95067f

rxin reviewed Jan 9, 2017

View reviewed changes

asfgit closed this in 15c2bd0 Jan 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19020] [SQL] Cardinality estimation of aggregate operator #16431

[SPARK-19020] [SQL] Cardinality estimation of aggregate operator #16431

wzhfy commented Dec 29, 2016

wzhfy commented Dec 29, 2016

SparkQA commented Dec 29, 2016

wzhfy commented Dec 29, 2016

SparkQA commented Dec 29, 2016

SparkQA commented Jan 5, 2017

rxin Jan 6, 2017

wzhfy Jan 6, 2017

rxin Jan 9, 2017

rxin commented Jan 9, 2017

wzhfy commented Jan 9, 2017

rxin Jan 9, 2017

rxin Jan 9, 2017

rxin Jan 9, 2017

rxin Jan 9, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

rxin Jan 9, 2017

rxin Jan 9, 2017

rxin commented Jan 9, 2017

rxin Jan 9, 2017

		import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._


		class AggEstimationSuite extends StatsEstimationTestBase {

Navigation Menu

[SPARK-19020] [SQL] Cardinality estimation of aggregate operator #16431

[SPARK-19020] [SQL] Cardinality estimation of aggregate operator #16431

Conversation

wzhfy commented Dec 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

wzhfy commented Dec 29, 2016

SparkQA commented Dec 29, 2016

wzhfy commented Dec 29, 2016

SparkQA commented Dec 29, 2016

SparkQA commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jan 9, 2017

wzhfy commented Jan 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jan 9, 2017

Choose a reason for hiding this comment