New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19020] [SQL] Cardinality estimation of aggregate operator #16431
Conversation
Test build #70710 has started for PR 16431 at commit |
retest this please |
Test build #70714 has finished for PR 16431 at commit
|
Test build #70931 has finished for PR 16431 at commit
|
} | ||
|
||
// The number of output rows must not be larger than child's number of rows. | ||
// Note that this also covers the case of uniqueness of column. If one of the group-by columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't get what this note means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the aggregate has three group-by columns, e.g. group by a, b, c, the number of output rows is estimated by ndv(a) * ndv(b) * ndv(c)
. It's an upper bound by assuming the data has every combination of values of a, b and c. But this product can become very large. So previously, I had two methods to set tighter bounds.
- #row of the aggregate must be <= #row of child.
- if one of the group-by columns is a primary key (e.g. column a), each distinct value of a can appear only once in records, then the number of possible combinations of a, b, c is equal to ndv(a), thus #row of the aggregate with group by a, b, c is equal to ndv(a).
But later, I noticed that since a is a primary key, ndv(a) is actually equal to #row of child. So case 2 is covered by case 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I don't think you need this explanation here -- it simply makes it more confusing. You are just putting an upper bound on cardinality, and that explains everything.
Can you update the pull request and the test cases to use the new test infra? |
OK, I'll update this pr today. |
class AggEstimationSuite extends StatsEstimationTestBase { | ||
|
||
/** Column info: names and column stats for group-by columns */ | ||
val (key11, colStat11) = (attr("key11"), ColumnStat(2, Some(1), Some(2), 0, 4, 4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put these into a map so it is easier to read?
map from attribute to stat
also use named arguments when creating ColumnStat; otherwise it is too difficult to read what the 0 or 4 means.
|
||
/** Tables for testing */ | ||
/** Data for table1: (1, 10), (2, 10) */ | ||
val table1 = StatsTestPlan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put all the tables into the test cases? it is farther away from the test cases making it more difficult to read.
object AggregateEstimation { | ||
import EstimationUtils._ | ||
|
||
def estimate(agg: Aggregate): Option[Statistics] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd document the algorithm in the javadoc
} | ||
if (rowCountsExist(agg.child) && colStatsExist) { | ||
// Initial value for agg without group expressions | ||
var outputRows: BigInt = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write this using a reduceOption?
scala> Seq(1, 2, 3).map(i => BigInt(i)).reduceOption(_ * _).getOrElse(BigInt(1))
res5: scala.math.BigInt = 6
not sure if it is more clear
Test build #71074 has finished for PR 16431 at commit
|
Test build #71081 has finished for PR 16431 at commit
|
expectedRowCount = 1) | ||
} | ||
|
||
test("there's a primary key in group-by columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test case is basically the same as the next one, isn't it?
private val nameToColInfo: Map[String, (Attribute, ColumnStat)] = | ||
columnInfo.map(kv => kv._1.name -> kv) | ||
|
||
test("empty group-by column") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add a test case for empty output?
I'm going to merge this first. Please in your pr to update the project test, also address my comments for the aggregate tests. |
import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ | ||
|
||
|
||
class AggEstimationSuite extends StatsEstimationTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also rename this AggregateEstimationSuite?
## What changes were proposed in this pull request? Support cardinality estimation of aggregate operator ## How was this patch tested? Add test cases Author: Zhenhua Wang <wzh_zju@163.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#16431 from wzhfy/aggEstimation.
## What changes were proposed in this pull request? Support cardinality estimation of aggregate operator ## How was this patch tested? Add test cases Author: Zhenhua Wang <wzh_zju@163.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#16431 from wzhfy/aggEstimation.
What changes were proposed in this pull request?
Support cardinality estimation of aggregate operator
How was this patch tested?
Add test cases