Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upopt: Propagate null counts through statsbuilder #30827
Conversation
itsbilal
self-assigned this
Oct 1, 2018
itsbilal
requested a review
from cockroachdb/sql-opt-prs
as a
code owner
Oct 1, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rytaft
requested changes
Oct 1, 2018
This is a great start!
I did a first pass -- my main comments are:
- For multi-column stats, I believe we are determining null count based on whether at least one column in the column set is null. I think your code assumes that all columns must be null.
- We are not including null values in the distinct count. So if all values in a column are null, distinct count is 0.
- We've already determined which columns are not null in the
logicalPropsBuilder. So you should be able to just userelProps.NotNullCols. - There are a few places where you are setting
NullCountto 0 where I think that might be a bit premature. I marked a couple of them, but it might be good to cross-reference againstrelProps.NotNullCols.
@RaduBerinde - can you confirm points 1 and 2?
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/metadata.go, line 169 at r1 (raw file):
// notnull is whether this column is guaranteed to not hold nulls. notnull bool
I don't think we need to store this in the metadata. We are already storing it in the logical properties.
pkg/sql/opt/memo/statistics_builder.go, line 332 at r1 (raw file):
// duplicates where all columns are NULL. if fd.ColsAreLaxKey(colSet) { colStat.DistinctCount = s.RowCount
I think you should update this to be:
colStat.DistinctCount = s.RowCount - colStat.NullCount
(and move it below calculation of colStat.NullCount)
pkg/sql/opt/memo/statistics_builder.go, line 333 at r1 (raw file):
if fd.ColsAreLaxKey(colSet) { colStat.DistinctCount = s.RowCount colStat.NullCount = 1
Why is this equal to 1? I don't think we can make any assumptions about the null count for lax keys. You probably want to calculate this using unknownNullCountRatio.
pkg/sql/opt/memo/statistics_builder.go, line 337 at r1 (raw file):
} if s.RowCount == 0 {
Are we ever hitting this case? I thought we handled this case elsewhere (but I'm probably mistaken). Either way, you should move this case up above, so we don't call ColsAreLaxKey if it's not necessary.
pkg/sql/opt/memo/statistics_builder.go, line 359 at r1 (raw file):
colStatLeaf := sb.colStatLeaf(util.MakeFastIntSet(i), s, fd) distinctCount *= colStatLeaf.DistinctCount nullFactor *= colStatLeaf.NullCount / s.RowCount
I think we are defining null count to be at least one column in the column set is null. This seems to be calculating the null count assuming all columns must be null.
pkg/sql/opt/memo/statistics_builder.go, line 447 at r1 (raw file):
// Calculate row count and selectivity // ----------------------------------- savedRowCount := s.RowCount
[nit] I think inputRowCount would be a bit more consistent with the naming in the rest of this file. You could also just use inputStats.RowCount to avoid creating a new variable.
pkg/sql/opt/memo/statistics_builder.go, line 841 at r1 (raw file):
colStat.DistinctCount = s.RowCount } // Similarly, the null count should be no larger than (RowCount - DistinctCount + 1).
I believe by convention we are not treating NULL as part of the distinct count. So if all values are NULL, distinct count should be 0.
So this should be no larger than (RowCount - DistinctCount)
pkg/sql/opt/memo/statistics_builder.go, line 925 at r1 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in both column sets.
See comment above - I think it should be nulls in either column set
pkg/sql/opt/memo/statistics_builder.go, line 933 at r1 (raw file):
colStat.DistinctCount = s.RowCount } // Similarly, the null count should be no larger than (RowCount - DistinctCount + 1).
See comment above - I think it should be (RowCount - DistinctCount)
pkg/sql/opt/memo/statistics_builder.go, line 973 at r1 (raw file):
colStat, _ := s.ColStats.Add(colSet) colStat.DistinctCount = 1 colStat.NullCount = 0
It's possible that the output of the scalar group by could be NULL (feel free to just add a TODO here so we don't forget about it)
pkg/sql/opt/memo/statistics_builder.go, line 1048 at r1 (raw file):
leftNullCount := min(1, leftColStat.NullCount) rightNullCount := min(1, rightColStat.NullCount)
Why are you reducing to 1 here? I think this is only valid for union, intersect and except, not for the ALL variants.
pkg/sql/opt/memo/statistics_builder.go, line 1055 at r1 (raw file):
case opt.UnionOp, opt.UnionAllOp: colStat.DistinctCount = leftColStat.DistinctCount + rightColStat.DistinctCount colStat.NullCount = min(leftNullCount, rightNullCount)
I think this should be the sum here
pkg/sql/opt/memo/statistics_builder.go, line 1256 at r1 (raw file):
// The ordinality column is a key, so every row is distinct. colStat.DistinctCount = s.RowCount colStat.NullCount = 0
Unless def.ColID is the only column, I'm not sure we can say NullCount is definitely zero
pkg/sql/opt/memo/statistics_builder.go, line 1593 at r1 (raw file):
// This function should be called before selectivityFromNullCounts. // func (sb *statisticsBuilder) updateNullCountsFromConstraint(
Is there any reason you can't just use the NotNullCols logical property? I don't think you need to look at the constraint at all (that was already done in logicalPropsBuilder).
pkg/sql/opt/memo/statistics_builder.go, line 1877 at r1 (raw file):
// Short circuit if no rows if rowCount <= 0 { return selectivity
Do we ever hit this case?
itsbilal
reviewed
Oct 2, 2018
Thanks for the review! I should have addressed everything you pointed out. The logictest failure that existed in the first revision should also be fixed now.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/metadata.go, line 169 at r1 (raw file):
Previously, rytaft wrote…
I don't think we need to store this in the metadata. We are already storing it in the logical properties.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 332 at r1 (raw file):
Previously, rytaft wrote…
I think you should update this to be:
colStat.DistinctCount = s.RowCount - colStat.NullCount(and move it below calculation of
colStat.NullCount)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 333 at r1 (raw file):
Previously, rytaft wrote…
Why is this equal to 1? I don't think we can make any assumptions about the null count for lax keys. You probably want to calculate this using
unknownNullCountRatio.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 337 at r1 (raw file):
Previously, rytaft wrote…
Are we ever hitting this case? I thought we handled this case elsewhere (but I'm probably mistaken). Either way, you should move this case up above, so we don't call
ColsAreLaxKeyif it's not necessary.
Done. Removed it since we weren't hitting it.
pkg/sql/opt/memo/statistics_builder.go, line 359 at r1 (raw file):
Previously, rytaft wrote…
I think we are defining null count to be at least one column in the column set is null. This seems to be calculating the null count assuming all columns must be null.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 447 at r1 (raw file):
Previously, rytaft wrote…
[nit] I think
inputRowCountwould be a bit more consistent with the naming in the rest of this file. You could also just useinputStats.RowCountto avoid creating a new variable.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 841 at r1 (raw file):
Previously, rytaft wrote…
I believe by convention we are not treating
NULLas part of the distinct count. So if all values areNULL, distinct count should be 0.So this should be no larger than
(RowCount - DistinctCount)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 925 at r1 (raw file):
Previously, rytaft wrote…
See comment above - I think it should be nulls in either column set
Done.
pkg/sql/opt/memo/statistics_builder.go, line 933 at r1 (raw file):
Previously, rytaft wrote…
See comment above - I think it should be (RowCount - DistinctCount)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 973 at r1 (raw file):
Previously, rytaft wrote…
It's possible that the output of the scalar group by could be
NULL(feel free to just add a TODO here so we don't forget about it)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1048 at r1 (raw file):
Previously, rytaft wrote…
Why are you reducing to 1 here? I think this is only valid for union, intersect and except, not for the
ALLvariants.
Done. Special-cased the set and bag operations separately.
pkg/sql/opt/memo/statistics_builder.go, line 1055 at r1 (raw file):
Previously, rytaft wrote…
I think this should be the sum here
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1256 at r1 (raw file):
Previously, rytaft wrote…
Unless
def.ColIDis the only column, I'm not sure we can sayNullCountis definitely zero
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1593 at r1 (raw file):
Previously, rytaft wrote…
Is there any reason you can't just use the
NotNullColslogical property? I don't think you need to look at the constraint at all (that was already done inlogicalPropsBuilder).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1877 at r1 (raw file):
Previously, rytaft wrote…
Do we ever hit this case?
Yes, we do hit it in some cases it seems. And since we divide by rowCount here, it's best to short circuit it early on.
rytaft
requested changes
Oct 2, 2018
Still looking, but added a few comments to get started
Reviewed 2 of 37 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):
query TTTTT EXPLAIN (TYPES) SELECT min(y) FROM xyz WHERE x = 1
Why did these plans change? It seems like the new plans are worse than the old ones.
pkg/sql/opt/memo/statistics_builder.go, line 329 at r2 (raw file):
// If some of the columns are a lax key, the distinct count equals the row // count.
Update comment to mention null count.
pkg/sql/opt/memo/statistics_builder.go, line 330 at r2 (raw file):
// If some of the columns are a lax key, the distinct count equals the row // count. if fd.ColsAreLaxKey(colSet) {
I think ColsAreLaxKey returns true for both strict and lax keys. So you should probably add a check if colSet.SubsetOf(notNullCols) { colStat.NullCount = 0 } (similar to below).
pkg/sql/opt/memo/statistics_builder.go, line 618 at r2 (raw file):
// There are no columns in this expression, so it must be a constant. colStat.DistinctCount = 1 colStat.NullCount = 0
Seems like this expression could also be NULL. Maybe you could check if it's a NullOp, and if so set colStat.NullCount to the row count?
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
colStat, _ = s.ColStats.Add(colSet) colStat.DistinctCount = leftColStat.DistinctCount * rightColStat.DistinctCount colStat.NullCount = leftColStat.NullCount * rightColStat.NullCount
Is this right? I'm having a hard time convincing myself what the right value should be.
pkg/sql/opt/memo/statistics_builder.go, line 919 at r2 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in either column sets.
[nit] column set
pkg/sql/opt/memo/statistics_builder.go, line 1123 at r2 (raw file):
colStat.DistinctCount = float64(len(distinct)) // We cannot read values, so just use the default // null count ratio here.
I think you can check if the operator is NullOp
pkg/sql/opt/memo/statistics_builder.go, line 1228 at r2 (raw file):
colStat, _ := ev.Logical().Relational.Stats.ColStats.Add(colSet) colStat.DistinctCount = 1 colStat.NullCount = 0
Another one I'm not convinced should be 0
pkg/sql/opt/memo/statistics_builder.go, line 1325 at r2 (raw file):
if s.RowCount == 1 { colStat.DistinctCount = 1 colStat.NullCount = 0
ditto
pkg/sql/opt/memo/statistics_builder.go, line 1328 at r2 (raw file):
} else { colStat.DistinctCount = s.RowCount * unknownDistinctCountRatio colStat.NullCount = s.RowCount * math.Pow(unknownNullCountRatio, float64(colSet.Len()))
This needs to be updated
pkg/sql/opt/memo/statistics_builder.go, line 1554 at r2 (raw file):
} sb.updateNullCountsFromProps(ev, relProps)
I think this should be called outside applyConstraint so it will get called if a scan is unconstrained.
pkg/sql/opt/memo/statistics_builder.go, line 1571 at r2 (raw file):
numUnappliedConjuncts = 0 for i := 0; i < cs.Length(); i++ { sb.updateNullCountsFromProps(ev, relProps)
same here - this should be moved out of applyConstraintSet
pkg/sql/opt/memo/statistics_builder.go, line 1588 at r2 (raw file):
// updateNullCountsFromProps zeroes null counts for columns that cannot // have nulls in them, usually due to a column property or an application. // of a null-excluding filter. The actual determinatino of non-nullable
determinatio -> determination
pkg/sql/opt/memo/statistics_builder.go, line 1589 at r2 (raw file):
// have nulls in them, usually due to a column property or an application. // of a null-excluding filter. The actual determinatino of non-nullable // columns is done in the logical prop builder.
[nit] logical props builder
rytaft
requested changes
Oct 3, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):
}) colStat.DistinctCount = min(distinctCount, s.RowCount) colStat.NullCount = min(nullCount, s.RowCount)
In other places you are setting this to min(nullCount, s.RowCount-colStat.DistinctCount). I don't have a strong opinion about which is better, but at least we should be consistent.
pkg/sql/opt/memo/statistics_builder.go, line 473 at r2 (raw file):
} return colStat
It may be worth adding the following check at the end of every colStatXXX function: if colSet.SubsetOf(relProps.NotNullCols) { colStat.NullCount = 0 }. (See if it changes any test output -- alternatively, just check the code that's calculating NotNullCols inside logicalPropsBuilder, and see which operations can possibly change it from their input. Maybe these cases are already covered by the filters in Scan, Select and Join.)
pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):
If colSet contains a single column, the null count equals min(1, inputColStat.NullCount). For example:
> create table t (x int, y int);
> insert into t values (null, 1), (null, 1), (1, 2), (1, null), (1, 2);
> select count(*), x from t group by x;
count | x
+-------+------+
2 | NULL
3 | 1
(2 rows)
Multi-column stats are more complicated, and I'm not really sure what the best approach is. See if you can come up with something that seems reasonable.... (happy to discuss ideas offline)
pkg/sql/opt/memo/statistics_builder.go, line 1058 at r2 (raw file):
// (the non-ALL variants). leftNullCount := min(1, leftColStat.NullCount) rightNullCount := min(1, rightColStat.NullCount)
This only works if each side has one column. I think you should be consistent with whatever you come up with for multi-column stats for grouping columns and apply that here.
pkg/sql/opt/memo/statistics_builder.go, line 1062 at r2 (raw file):
switch ev.Operator() { case opt.UnionOp: colStat.NullCount = max(leftNullCount, rightNullCount)
I think this should probably be leftNullCount + rightNullCount. I think you can actually use the same formulas for the regular and ALL variants, and put them in the switch above. The difference will be that for the ALL variants, use:
leftNullCount = leftColStat.NullCount
rightNullCount = rightColStat.NullCount
And for the regular variants, use the multi-column stats for grouping columns estimation.
pkg/sql/opt/memo/statistics_builder.go, line 1070 at r2 (raw file):
colStat.NullCount = min(leftColStat.NullCount, rightColStat.NullCount) case opt.ExceptOp: colStat.NullCount = max(0, leftNullCount-rightNullCount)
Seems like you might be under-estimating null count here, especially in the case of more than one column. I'd probably just use leftNullCount, similar to the distinct count calculation above. It's definitely not perfect, but at least it's consistent.
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
// This is the ratio of null column values to number of rows for nullable // columns, which is used in the absence of any real statistics for non-key
whether or not the column is a key is not relevant for null count - only whether the column is nullable.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
) { relProps.NotNullCols.ForEach(func(col int) { colStat := sb.ensureColStat(util.MakeFastIntSet(col), math.MaxFloat64, ev, relProps)
There is a potential problem here, which is that if a particular column is not already in s.ColStats, we are just copying the distinct count from its input. The distinct count should probably be scaled based on the selectivity of the filter, like we do in colStatScan, colStatSelect, etc. This also means that we may need to move this function call after any call to selectivityFromDistinctCounts. (But that kind of breaks the flow of the calling functions -- play around with it and see if you can fix the logic without hurting the flow too much)
pkg/sql/opt/memo/statistics_builder.go, line 1733 at r2 (raw file):
// Find the minimum distinct and null counts for all columns in this equivalency // group.
I'm not sure it's necessary to update null counts here. Equivalency groups are created based on predicates like x = y, which are null-rejecting. So both x and y will have their null counts set to 0 inside updateNullCountsFromProps. Although maybe I'm missing an edge case where x and y can both be null... it would be good to confirm before you rip this out.
pkg/sql/opt/memo/statistics_builder.go, line 1788 at r2 (raw file):
if inputStat.DistinctCount != 0 && colStat.DistinctCount < inputStat.DistinctCount { newDistinct := colStat.DistinctCount oldDistinct := inputStat.DistinctCount
[nit] Why don't you move these definitions outside the if statement so we can use them in the if statement as well.
pkg/sql/opt/memo/statistics_builder.go, line 1805 at r2 (raw file):
// i in // {constrained // columns}
Nice formula!
pkg/sql/opt/memo/statistics_builder.go, line 1828 at r2 (raw file):
panic("rowCount passed in was too small") } if inputStat.NullCount != 0 && colStat.NullCount < inputStat.NullCount {
Isn't it sufficient to check if colStat.NullCount < inputStat.NullCount?
pkg/sql/opt/memo/statistics_builder.go, line 1969 at r2 (raw file):
// Cases of NULL in a constraint should be ignored. For example, // without knowledge of the data distribution, /a: (/NULL - /10] should // have the same estimated selectivity as /a: [/10 - ].
You should also mention that we are handing the selectivity of NULL constraints separately in selectivityFromNullCounts
RaduBerinde
reviewed
Oct 3, 2018
Yeah what Rebecca said above is correct.
[nit] Just a tip for future changes - I found that for bigger changes, it helps to separate the code changes (and new tests) from the existing test changes (in different commits). It's easier to review, and it's easier to rebase (if you have a conflict in a test file, you can just nuke the entire commit and re-run tests with -rewrite)
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in either column sets. colStat.NullCount += lookupColStat.NullCount
I was looking at r1 and was going to suggest the formula rowcount * (f1 + f2 - f1 * f2) where f1,f2 are the two null/rowcount fractions.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rytaft
Oct 3, 2018
Contributor
pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):
Previously, RaduBerinde wrote…
I was looking at
r1and was going to suggest the formularowcount * (f1 + f2 - f1 * f2)wheref1,f2are the twonull/rowcountfractions.
Great idea!
|
pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file): Previously, RaduBerinde wrote…
Great idea! |
itsbilal
reviewed
Oct 4, 2018
Thanks for all your help so far @rytaft and @RaduBerinde! I should have addressed mostly everything major that was pointed out - the one major exception being the regression with indexed min(x) 's not being replaced with a limit anymore that I'm still looking into.
I've pushed my work so far but feel free to hold off on the review until I've addressed that.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):
Previously, rytaft wrote…
Why did these plans change? It seems like the new plans are worse than the old ones.
Looking into this.
pkg/sql/opt/memo/statistics_builder.go, line 329 at r2 (raw file):
Previously, rytaft wrote…
Update comment to mention null count.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 330 at r2 (raw file):
Previously, rytaft wrote…
I think
ColsAreLaxKeyreturns true for both strict and lax keys. So you should probably add a checkif colSet.SubsetOf(notNullCols) { colStat.NullCount = 0 }(similar to below).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):
Previously, rytaft wrote…
In other places you are setting this to
min(nullCount, s.RowCount-colStat.DistinctCount). I don't have a strong opinion about which is better, but at least we should be consistent.
Done - moved to RowCount only.
pkg/sql/opt/memo/statistics_builder.go, line 618 at r2 (raw file):
Previously, rytaft wrote…
Seems like this expression could also be
NULL. Maybe you could check if it's aNullOp, and if so setcolStat.NullCountto the row count?
Done.
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
Is this right? I'm having a hard time convincing myself what the right value should be.
Any ideas of alternatives? It's hard to deduce this without understanding how the rows interleave - but there may be a decent heuristic.
pkg/sql/opt/memo/statistics_builder.go, line 919 at r2 (raw file):
Previously, rytaft wrote…
[nit] column set
Done.
pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):
Previously, rytaft wrote…
Great idea!
Done.
pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):
Previously, rytaft wrote…
If
colSetcontains a single column, the null count equalsmin(1, inputColStat.NullCount). For example:> create table t (x int, y int); > insert into t values (null, 1), (null, 1), (1, 2), (1, null), (1, 2); > select count(*), x from t group by x; count | x +-------+------+ 2 | NULL 3 | 1 (2 rows)Multi-column stats are more complicated, and I'm not really sure what the best approach is. See if you can come up with something that seems reasonable.... (happy to discuss ideas offline)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1058 at r2 (raw file):
Previously, rytaft wrote…
This only works if each side has one column. I think you should be consistent with whatever you come up with for multi-column stats for grouping columns and apply that here.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1062 at r2 (raw file):
Previously, rytaft wrote…
I think this should probably be
leftNullCount + rightNullCount. I think you can actually use the same formulas for the regular and ALL variants, and put them in the switch above. The difference will be that for the ALL variants, use:leftNullCount = leftColStat.NullCount rightNullCount = rightColStat.NullCountAnd for the regular variants, use the multi-column stats for grouping columns estimation.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1070 at r2 (raw file):
Previously, rytaft wrote…
Seems like you might be under-estimating null count here, especially in the case of more than one column. I'd probably just use
leftNullCount, similar to the distinct count calculation above. It's definitely not perfect, but at least it's consistent.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1123 at r2 (raw file):
Previously, rytaft wrote…
I think you can check if the operator is
NullOp
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1228 at r2 (raw file):
Previously, rytaft wrote…
Another one I'm not convinced should be 0
Done - went with unknownNullCountRatio
pkg/sql/opt/memo/statistics_builder.go, line 1325 at r2 (raw file):
Previously, rytaft wrote…
ditto
Done - went with unknownNullCountRatio
pkg/sql/opt/memo/statistics_builder.go, line 1328 at r2 (raw file):
Previously, rytaft wrote…
This needs to be updated
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
Previously, rytaft wrote…
whether or not the column is a key is not relevant for null count - only whether the column is nullable.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1554 at r2 (raw file):
Previously, rytaft wrote…
I think this should be called outside
applyConstraintso it will get called if a scan is unconstrained.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1571 at r2 (raw file):
Previously, rytaft wrote…
same here - this should be moved out of
applyConstraintSet
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1588 at r2 (raw file):
Previously, rytaft wrote…
determinatio -> determination
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1589 at r2 (raw file):
Previously, rytaft wrote…
[nit] logical props builder
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
Previously, rytaft wrote…
There is a potential problem here, which is that if a particular column is not already in
s.ColStats, we are just copying the distinct count from its input. The distinct count should probably be scaled based on the selectivity of the filter, like we do incolStatScan,colStatSelect, etc. This also means that we may need to move this function call after any call toselectivityFromDistinctCounts. (But that kind of breaks the flow of the calling functions -- play around with it and see if you can fix the logic without hurting the flow too much)
Is this an issue though? It seems like we only call this function after we've already called applyConstraint or applyConstraintSet - so either the distinct count has already been updated to match the constraint, or that particular row is unconstrained and will later have its counts modified when we call ApplySelectivity. I may need to dive in deeper to be confident with this solution though.
pkg/sql/opt/memo/statistics_builder.go, line 1788 at r2 (raw file):
Previously, rytaft wrote…
[nit] Why don't you move these definitions outside the
ifstatement so we can use them in theifstatement as well.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1828 at r2 (raw file):
Previously, rytaft wrote…
Isn't it sufficient to check
if colStat.NullCount < inputStat.NullCount?
Done.
rytaft
requested changes
Oct 5, 2018
Nice changes! I've added a few more comments inline, but my big comment now is that I think you need to add some more tests in memo/testdata/stats that explicitly exercise all the changes you've made. Also, take a look at the stats in the existing tests and make sure you can convince yourself that they make sense.
Reviewed 2 of 37 files at r1, 6 of 37 files at r2, 33 of 33 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 332 at r2 (raw file):
if fd.ColsAreLaxKey(colSet) { colStat.NullCount = unknownNullCountRatio * s.RowCount colStat.DistinctCount = s.RowCount - colStat.NullCount
I think this old formula was better for this particular case (sorry if my comment below was confusing -- I didn't mean for you to change this one).
pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done - moved to
RowCountonly.
This is probably fine, but maybe we should see which one produces better plans. It's possible we should actually be doing something in between these two extremes.... It would be good to add some tests that explicitly exercise this case.
pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done.
Nice!
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
I'd just say:
// This is the ratio of null column values to number of rows for nullable // columns, which is used in the absence of any real statistics.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Is this an issue though? It seems like we only call this function after we've already called
applyConstraintorapplyConstraintSet- so either the distinct count has already been updated to match the constraint, or that particular row is unconstrained and will later have its counts modified when we callApplySelectivity. I may need to dive in deeper to be confident with this solution though.
The problem is you're not calling ApplySelectivity on this particular colStat. Once it's in the cache it doesn't get updated again. So I think you need to add a call to ApplySelectivity here, but make sure that doesn't cause the final selectivity to change.
pkg/sql/opt/memo/statistics_builder.go, line 195 at r3 (raw file):
// statsFromChild retrieves the main statistics struct from a specific child // of the given expression. func (sb *statisticsBuilder) statFromChild(ev ExprView, childIdx int) *props.Statistics {
[nit] statFromChild -> statsFromChild
pkg/sql/opt/memo/statistics_builder.go, line 940 at r3 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in either column set. f1 := lookupColStat.NullCount / tableStats.RowCount
Since ApplySelectivity updates null count, I think this denominator should be inputStats.RowCount
pkg/sql/opt/memo/statistics_builder.go, line 1106 at r3 (raw file):
case opt.ExceptOp, opt.ExceptAllOp: colStat.DistinctCount = leftColStat.DistinctCount colStat.NullCount = leftColStat.NullCount
leftColStat.NullCount -> leftNullCount
pkg/sql/opt/memo/statistics_builder_test.go, line 177 at r3 (raw file):
statsFunc( cs2, "[rows=3.33333333e+09, distinct(2)=500, null(2)=0]",
This is an example of the issue where the distinct count of (2) is just copied from the input as-is. It should be scaled with a call to ApplySelectivity.
pkg/sql/opt/memo/testdata/stats/select, line 199 at r3 (raw file):
# Hide stats because when run with testrace, more stats are populated than # otherwise. See comment in MemoizeDenormExpr in memo.go. opt format=hide-stats
One thing you can do to get around this is specifically request certain column stats using the colstats directive. I think that will allow you to ensure you get the same results for test and testrace.
pkg/sql/opt/props/statistics.go, line 84 at r3 (raw file):
s.RowCount = 0 for i, n := 0, s.ColStats.Count(); i < n; i++ { s.ColStats.Get(i).DistinctCount = 0
update null count here too
pkg/sql/opt/props/statistics.go, line 165 at r3 (raw file):
c.DistinctCount = d - d*math.Pow(1-selectivity, n/d) // Since the null count is a more simpler count of all null rows, we can
[nit] more simpler -> simple
rytaft
requested changes
Oct 6, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Any ideas of alternatives? It's hard to deduce this without understanding how the rows interleave - but there may be a decent heuristic.
I think we should follow a similar approach to how we calculated the row count. So there are 3 steps:
- Estimate the null count for the cross join
- Scale the null count based on the selectivity of the filter
- Adjust if needed for outer joins
For 1, I think the formula is something like:
crossJoinNullCount := leftColStat.NullCount * s.RowCount + rightColStat.NullCount * s.RowCount - leftColStat.NullCount * rightColStat.NullCount
For 2, you can just call colStat.ApplySelectivity
For 3, take a look at what we're doing in buildJoin. For example, for left join, you should adjust the null count like this:
colStat.NullCount = max(innerJoinNullCount, leftColStat.NullCount)
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
} else { inputRowCount := sb.statFromChild(ev, 0 /* childIdx */).RowCount colStat.NullCount = ((colStat.DistinctCount + 1) / inputRowCount) * inputColStat.NullCount
Thinking about this more, I think this formula makes more sense if we include null values in the distinct count. I think that would make a lot of things here easier to reason about. @RaduBerinde, is there any reason we can't treat NULL as another distinct value? For example, consider the following data:
a | b
-----+-----
NULL | 1
NULL | 1
NULL | NULL
1 | 2
3 | 4
I think we should treat this as having DistinctCount=4 and NullCount=3. Does that seem reasonable?
RaduBerinde
reviewed
Oct 8, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
Previously, rytaft wrote…
Thinking about this more, I think this formula makes more sense if we include null values in the distinct count. I think that would make a lot of things here easier to reason about. @RaduBerinde, is there any reason we can't treat NULL as another distinct value? For example, consider the following data:
a | b -----+----- NULL | 1 NULL | 1 NULL | NULL 1 | 2 3 | 4I think we should treat this as having DistinctCount=4 and NullCount=3. Does that seem reasonable?
Sure. We'll need to update the sampler.go to put the NULLs in the sketch too but it's a simple change.
We can also change what "null" means in the multi-column case if the other definition makes more sense. We aren't even calculating multi-column stats yet so only some comments would need to be updated.
itsbilal
reviewed
Oct 9, 2018
Should have addressed all comments and issues pointed out so far. Still going through the test output for anything I may have missed, and gonna be writing more tests in opt/memo/testdata/stats next to exercise the non-zero null count cases. There's no observable difference in pkg/sql/opt/bench benchmarks, at least after benchstat-ing 5 runs.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 332 at r2 (raw file):
Previously, rytaft wrote…
I think this old formula was better for this particular case (sorry if my comment below was confusing -- I didn't mean for you to change this one).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 473 at r2 (raw file):
Previously, rytaft wrote…
It may be worth adding the following check at the end of every
colStatXXXfunction:if colSet.SubsetOf(relProps.NotNullCols) { colStat.NullCount = 0 }. (See if it changes any test output -- alternatively, just check the code that's calculatingNotNullColsinsidelogicalPropsBuilder, and see which operations can possibly change it from their input. Maybe these cases are already covered by the filters inScan,SelectandJoin.)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
I think we should follow a similar approach to how we calculated the row count. So there are 3 steps:
- Estimate the null count for the cross join
- Scale the null count based on the selectivity of the filter
- Adjust if needed for outer joins
For 1, I think the formula is something like:
crossJoinNullCount := leftColStat.NullCount * s.RowCount + rightColStat.NullCount * s.RowCount - leftColStat.NullCount * rightColStat.NullCount
For 2, you can just callcolStat.ApplySelectivity
For 3, take a look at what we're doing inbuildJoin. For example, for left join, you should adjust the null count like this:
colStat.NullCount = max(innerJoinNullCount, leftColStat.NullCount)
Done - but feel free to suggest more changes since it's a fairly complex expression. The test stat output generated by it seems reasonable though.
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
Previously, rytaft wrote…
I'd just say:
// This is the ratio of null column values to number of rows for nullable // columns, which is used in the absence of any real statistics.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
Previously, rytaft wrote…
The problem is you're not calling
ApplySelectivityon this particularcolStat. Once it's in the cache it doesn't get updated again. So I think you need to add a call toApplySelectivityhere, but make sure that doesn't cause the final selectivity to change.
Done. Required moving updateNullCountsFromProps after really every other selectivityFrom* call - not sure if that's an ideal idiom going forward but it works reliably.
pkg/sql/opt/memo/statistics_builder.go, line 1733 at r2 (raw file):
Previously, rytaft wrote…
I'm not sure it's necessary to update null counts here. Equivalency groups are created based on predicates like
x = y, which are null-rejecting. So bothxandywill have their null counts set to 0 insideupdateNullCountsFromProps. Although maybe I'm missing an edge case where x and y can both be null... it would be good to confirm before you rip this out.
Would that still work for null counts? If I add a panic for the case where one of the equivGroup cols is not in NotNullCols, it does panic - so maybe we do need to propagate non-zero null counts in some cases?
pkg/sql/opt/memo/statistics_builder.go, line 1969 at r2 (raw file):
Previously, rytaft wrote…
You should also mention that we are handing the selectivity of NULL constraints separately in
selectivityFromNullCounts
Done.
pkg/sql/opt/memo/statistics_builder.go, line 195 at r3 (raw file):
Previously, rytaft wrote…
[nit] statFromChild -> statsFromChild
Done.
pkg/sql/opt/memo/statistics_builder.go, line 940 at r3 (raw file):
Previously, rytaft wrote…
Since
ApplySelectivityupdates null count, I think this denominator should beinputStats.RowCount
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
Previously, RaduBerinde wrote…
Sure. We'll need to update the
sampler.goto put the NULLs in the sketch too but it's a simple change.We can also change what "null" means in the multi-column case if the other definition makes more sense. We aren't even calculating multi-column stats yet so only some comments would need to be updated.
Can this be part of a follow up change? It's not necessary for the introduction of null counts, and it can just be a simple reorganization once this behemoth is in.
pkg/sql/opt/memo/statistics_builder.go, line 1106 at r3 (raw file):
Previously, rytaft wrote…
leftColStat.NullCount -> leftNullCount
Done.
pkg/sql/opt/memo/statistics_builder_test.go, line 177 at r3 (raw file):
Previously, rytaft wrote…
This is an example of the issue where the distinct count of (2) is just copied from the input as-is. It should be scaled with a call to
ApplySelectivity.
Done. The exact output here didn't change even after the fix because the distinct count is so much smaller than the row count, that the ApplySelectivity formula barely modifies it.
pkg/sql/opt/memo/testdata/stats/select, line 199 at r3 (raw file):
Previously, rytaft wrote…
One thing you can do to get around this is specifically request certain column stats using the
colstatsdirective. I think that will allow you to ensure you get the same results fortestandtestrace.
That didn't really help, since that only works on the root expression. If exploration rules swap out a child expression, colstats has no impact on that output. I added an additional call to logPropsBuilder.buildProps in the tester that's run on every child to fix this.
pkg/sql/opt/props/statistics.go, line 84 at r3 (raw file):
Previously, rytaft wrote…
update null count here too
Done.
pkg/sql/opt/props/statistics.go, line 165 at r3 (raw file):
Previously, rytaft wrote…
[nit] more simpler -> simple
Done.
RaduBerinde
reviewed
Oct 10, 2018
Hi Bilal, great work on a tedious change! I have some comments, I'll do a final review after your next update.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/memo.go, line 433 at r4 (raw file):
// DeriveLogicalProps derives logical props for the specified expression, // usually resulting in population of child stats in the test output tree. func DeriveLogicalProps(evalCtx *tree.EvalContext, ev ExprView) {
[nit] The comment should explain why/when this should be used. It's a very narrow use-case and it should be explicit that it's just for testing.
pkg/sql/opt/memo/memo.go, line 507 at r4 (raw file):
m.groups = append(m.groups, makeMemoGroup(tmpGroupID, denorm)) ev := MakeNormExprView(m, tmpGroupID) // Building out logical props could lead to more lazy population of
One possibility is to set a flag in logPropsBuilder to skip the building of stats (this code doesn't care abut stats).
pkg/sql/opt/memo/memo.go, line 511 at r4 (raw file):
// when run with and without -race. // // TODO: Figure out a way to either run this for all tests, or
[nit] usually TODO(yourname) (btw the name isn't indicative of who is supposed to work on it, it's really indicative of who one should ask for clarifications)
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done - but feel free to suggest more changes since it's a fairly complex expression. The test stat output generated by it seems reasonable though.
I don't understand the formula (it doesn't look right to me).. Can you add some comments explaining it?
s.RowCount is the number of estimated rows of the join right? Say rightNullCount is 0 and leftNullCount is 100. We would have 100 times more nulls than rows? Or say s.RowCount is very small, this whole thing can easily turn negative.
I don't understand why we're multiplying s.RowCount with the count of the nulls (and not the fraction of rows that have nulls). The formula I would use would be similar to the one in colStatIndexJoin (s.RowCount * (f1 + f2 - f1 * f2)).
Also, what test stat output are you looking at specifically? This is only for multi-column stats which are unused unless we explicitly request them in tests using colstat=(1,2,3)
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Can this be part of a follow up change? It's not necessary for the introduction of null counts, and it can just be a simple reorganization once this behemoth is in.
Absolutely. You can also file an issue if it's not something you want to work on right away.
pkg/sql/opt/props/statistics.go, line 135 at r4 (raw file):
// DistinctCount is the estimated number of distinct values of this // set of columns for this expression.
[nit] Mention here how this interacts with NullCount (specifically mention that any rows that have at least a null don't contribute to this count). And maybe leave a TODO to consider changing things so DistinctCount counts these as well (as Becca suggested).
pkg/sql/opt/testutils/opt_tester.go, line 341 at r4 (raw file):
// Derive logical props for all child expressions, even // denormalized ones - to make statistics look complete
[nit] This could use a bit more explanation, it's not enough for someone looking at this for the first time. Explain that we're not interested in the logical props, but calculating them (specifically the statistics) will trigger lazy calculation of column stats in child expressions.
itsbilal
reviewed
Oct 10, 2018
Thanks for the review Radu! I've added more tests that exercise many of the complex parts of this. Also addressed the panics we were seeing.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, RaduBerinde wrote…
I don't understand the formula (it doesn't look right to me).. Can you add some comments explaining it?
s.RowCountis the number of estimated rows of the join right? Say rightNullCount is 0 and leftNullCount is 100. We would have 100 times more nulls than rows? Or says.RowCountis very small, this whole thing can easily turn negative.I don't understand why we're multiplying
s.RowCountwith the count of the nulls (and not the fraction of rows that have nulls). The formula I would use would be similar to the one incolStatIndexJoin(s.RowCount * (f1 + f2 - f1 * f2)).Also, what test stat output are you looking at specifically? This is only for multi-column stats which are unused unless we explicitly request them in tests using
colstat=(1,2,3)
That makes more sense, actually, to use the independent probability rule. I've added a simple explanation in the comments.
I've also added a basic test case with multi-col stats across join columns so you can see the output, but it doesn't seem right just yet. I'll keep looking into this tomorrow.
rytaft
requested changes
Oct 11, 2018
Reviewed 1 of 31 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
That makes more sense, actually, to use the independent probability rule. I've added a simple explanation in the comments.
I've also added a basic test case with multi-col stats across join columns so you can see the output, but it doesn't seem right just yet. I'll keep looking into this tomorrow.
Sorry, I screwed up the formula for the null count in the cross join in my formula. I think it should be:
crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount
Does that make more sense? Maybe Radu's solution is still better - haven't had time to think about it much, but I think this new formula should also work.
RaduBerinde
reviewed
Oct 11, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
Sorry, I screwed up the formula for the null count in the cross join in my formula. I think it should be:
crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount
Does that make more sense? Maybe Radu's solution is still better - haven't had time to think about it much, but I think this new formula should also work.
Your formula is equivalent to leftRowCount * rightRowCount * (f1 + f2 - f1 * f2), which becomes s.RowCount * (f1 + f2 - f1 * f2) (my formula) after multiplying with Selectivity (right?) So either one works, just don't use my formula and then also apply Selectivity :)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rytaft
Oct 11, 2018
Contributor
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, RaduBerinde wrote…
Your formula is equivalent to
leftRowCount * rightRowCount * (f1 + f2 - f1 * f2), which becomess.RowCount * (f1 + f2 - f1 * f2)(my formula) after multiplying withSelectivity(right?) So either one works, just don't use my formula and then also applySelectivity:)
You’re totally right - thanks, Radu!
|
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file): Previously, RaduBerinde wrote…
You’re totally right - thanks, Radu! |
itsbilal
reviewed
Oct 11, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
You’re totally right - thanks, Radu!
👍
Whoops, my bad about the double application of selectivity. Fixed - please take another look! The stats in test files (specifically the ones at the bottom of memo/testdata/stats/join) look better now.
pkg/sql/opt/testutils/opt_tester.go, line 341 at r4 (raw file):
Previously, RaduBerinde wrote…
[nit] This could use a bit more explanation, it's not enough for someone looking at this for the first time. Explain that we're not interested in the logical props, but calculating them (specifically the statistics) will trigger lazy calculation of column stats in child expressions.
Done
pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Looking into this.
Done. We were not processing selectivity for one of the nullable-but-null-rejected columns due to an input FD.
rytaft
requested changes
Oct 11, 2018
You're making great progress! This is tricky/subtle stuff.
Related to one of your comments below, if there's any change I suggested that feels overwhelming or somewhat orthogonal to this work, feel free to write a TODO and open an issue. You don't need to make null counts perfect in one PR (it took many PRs to get the distinct count code to where it is right now, and it's still not perfect).
Reviewed 4 of 31 files at r4, 2 of 12 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done. Required moving
updateNullCountsFromPropsafter really every otherselectivityFrom*call - not sure if that's an ideal idiom going forward but it works reliably.
Yea, I think we're going to need to do some refactoring at some point, but this is fine for this PR.
pkg/sql/opt/memo/statistics_builder.go, line 1733 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Would that still work for null counts? If I add a panic for the case where one of the equivGroup cols is not in
NotNullCols, it does panic - so maybe we do need to propagate non-zero null counts in some cases?
Ok thanks for checking! I don't think this logic is hurting anything, so best to leave it as-is.
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
Previously, RaduBerinde wrote…
Absolutely. You can also file an issue if it's not something you want to work on right away.
Yes, this can definitely wait until another PR. I'm not even 100% sure this is something we should do since it may make the selectivity calculations more difficult. Filing an issue sounds like a good idea.
pkg/sql/opt/memo/statistics_builder.go, line 566 at r5 (raw file):
// Note that for null count selectivity calculations, we use the un-reduced constraint // columns. This is so we don't miss any null-rejecting filters. s.ApplySelectivity(sb.selectivityFromNullCounts(nonReducedCols, ev, s, inputRowCount))
Can you add an example to the comment of a query where this matters?
pkg/sql/opt/memo/statistics_builder.go, line 737 at r5 (raw file):
s.RowCount = leftStats.RowCount * rightStats.RowCount // If one of the two sides has 0 rows, pick the other side. inputRowCount := max(s.RowCount, max(leftStats.RowCount, rightStats.RowCount))
I think you probably just want to use s.RowCount here. Then you'll make adjustments for outer joins below where we're doing similar adjustments for row count. (If one of the sides has 0 rows, the join output will have 0 rows unless it's an outer join)
pkg/sql/opt/memo/statistics_builder.go, line 857 at r5 (raw file):
rightColStat := *sb.colStatFromJoinRight(rightCols, ev) // Null count estimation - assume a cross join and then bump the null count later
With Radu's formula you're assuming an inner join, not a cross join
pkg/sql/opt/memo/statistics_builder.go, line 859 at r5 (raw file):
// Null count estimation - assume a cross join and then bump the null count later // based on the type of join. // Here, f1 and f2 are probabilities of nulls in either sides of the join.
[nit] in either sides -> on either side
pkg/sql/opt/memo/statistics_builder.go, line 895 at r5 (raw file):
leftJoinNullCount := max(colStat.NullCount, leftColStat.NullCount) rightJoinNullCount := max(colStat.NullCount, rightColStat.NullCount) colStat.NullCount = max(leftJoinNullCount, rightJoinNullCount)
I don't think this logic for full join is right -- take a look at the logic for row count in buildJoin. Following that logic it would be colStat.NullCount = leftJoinNullCount + rightJoinNullCount - colStat.NullCount. You can see why this is necessary if you think about the case where no columns match the join condition (e.g., SELECT a FULL OUTER JOIN b ON False).
In addition, there is one thing that is different for null count v. row count in the case of outer joins: we also need to take into account the null-extended rows. For example, in the case of a left join, the right columns have an additional null count which is equal to leftJoinRowCount - innerJoinRowCount. For right join, the left columns will have additional nulls, and for full join, both sides will have additional nulls. So you'll need to add one or two extra terms to all of the outer join calculations here.
Sorry I forgot to mention this earlier!
pkg/sql/opt/memo/statistics_builder.go, line 896 at r5 (raw file):
rightJoinNullCount := max(colStat.NullCount, rightColStat.NullCount) colStat.NullCount = max(leftJoinNullCount, rightJoinNullCount) }
All of this logic also needs to apply when the requested columns only come from the left or right side of the join (in the cases above where rightCols.Empty() or leftCols.Empty()). Also, make sure that for those cases you're not applying selectivity twice since we need to call colStat.ApplySelectivity for the distinct count.
Finally, I think you'll want to copy most -- if not all -- of this logic to buildJoin as well (or make a helper function and call it in both places).
pkg/sql/opt/memo/statistics_builder.go, line 1730 at r5 (raw file):
colStat = sb.copyColStat(colSet, s, sb.colStatFromInput(colSet, ev)) if s.Selectivity != 1.0 { colStat.ApplySelectivity(s.Selectivity, s.RowCount)
Should we be using inputRowCount instead of s.RowCount?
pkg/sql/opt/memo/statistics_builder.go, line 1955 at r5 (raw file):
if inputStat.NullCount > rowCount { fmt.Println(rowCount) fmt.Println(inputStat.NullCount)
excess debug stuff here
pkg/sql/opt/memo/testdata/logprops/constraints, line 695 at r5 (raw file):
├── columns: k:1(int!null) u:2(int!null) v:3(int!null) ├── constraint: /3/2/1: [/1/2 - /1/2] [/3/2 - /3/2] [/5/2 - /5/2] ├── stats: [rows=0.29403, distinct(1)=9.00134986e-05, null(1)=0, distinct(2)=0.29403, null(2)=0, distinct(3)=0.29403, null(3)=0]
I think this distinct count is wrong. If you look at the query, there is no reason the distinct count of k should be very different from the row count. Do you know what is causing this distinct count to be so tiny?
pkg/sql/opt/memo/testdata/stats/groupby, line 174 at r5 (raw file):
└── select ├── columns: y:2(int) z:3(float!null) s:4(string!null) max:5(int) ├── stats: [rows=120, distinct(3)=25.0620682, null(3)=0, distinct(4)=2, null(4)=0]
Here's another case of a distinct count that seems smaller than it should be.
itsbilal
reviewed
Oct 15, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 566 at r5 (raw file):
Previously, rytaft wrote…
Can you add an example to the comment of a query where this matters?
Done.
pkg/sql/opt/memo/statistics_builder.go, line 737 at r5 (raw file):
Previously, rytaft wrote…
I think you probably just want to use
s.RowCounthere. Then you'll make adjustments for outer joins below where we're doing similar adjustments for row count. (If one of the sides has 0 rows, the join output will have 0 rows unless it's an outer join)
It's a problem if one of the RowCounts is a fraction - since it results in the null count selectivity formula not working out. It's a corner case and I've added a comment why a max works.
pkg/sql/opt/memo/statistics_builder.go, line 857 at r5 (raw file):
Previously, rytaft wrote…
With Radu's formula you're assuming an inner join, not a cross join
Done.
pkg/sql/opt/memo/statistics_builder.go, line 859 at r5 (raw file):
Previously, rytaft wrote…
[nit] in either sides -> on either side
Done.
pkg/sql/opt/memo/statistics_builder.go, line 895 at r5 (raw file):
Previously, rytaft wrote…
I don't think this logic for full join is right -- take a look at the logic for row count in
buildJoin. Following that logic it would becolStat.NullCount = leftJoinNullCount + rightJoinNullCount - colStat.NullCount. You can see why this is necessary if you think about the case where no columns match the join condition (e.g.,SELECT a FULL OUTER JOIN b ON False).In addition, there is one thing that is different for null count v. row count in the case of outer joins: we also need to take into account the null-extended rows. For example, in the case of a left join, the right columns have an additional null count which is equal to
leftJoinRowCount - innerJoinRowCount. For right join, the left columns will have additional nulls, and for full join, both sides will have additional nulls. So you'll need to add one or two extra terms to all of the outer join calculations here.Sorry I forgot to mention this earlier!
Done.
pkg/sql/opt/memo/statistics_builder.go, line 896 at r5 (raw file):
Previously, rytaft wrote…
All of this logic also needs to apply when the requested columns only come from the left or right side of the join (in the cases above where
rightCols.Empty()orleftCols.Empty()). Also, make sure that for those cases you're not applying selectivity twice since we need to callcolStat.ApplySelectivityfor the distinct count.Finally, I think you'll want to copy most -- if not all -- of this logic to
buildJoinas well (or make a helper function and call it in both places).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1730 at r5 (raw file):
Previously, rytaft wrote…
Should we be using
inputRowCountinstead ofs.RowCount?
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1955 at r5 (raw file):
Previously, rytaft wrote…
excess debug stuff here
Done.
pkg/sql/opt/memo/testdata/logprops/constraints, line 695 at r5 (raw file):
Previously, rytaft wrote…
I think this distinct count is wrong. If you look at the query, there is no reason the distinct count of
kshould be very different from the row count. Do you know what is causing this distinct count to be so tiny?
Done. It was the inputRowCount change you outlined earlier.
pkg/sql/opt/memo/testdata/stats/groupby, line 174 at r5 (raw file):
Previously, rytaft wrote…
Here's another case of a distinct count that seems smaller than it should be.
Done.
rytaft
requested changes
Oct 15, 2018
Reviewed 3 of 12 files at r5, 3 of 22 files at r6, 1 of 1 files at r7.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 566 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done.
Nice example. I think it would help to say that the filter y IS NOT NULL is inferred in this case due to the aggregate (it's not obvious where it's coming from, otherwise).
pkg/sql/opt/memo/statistics_builder.go, line 737 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
It's a problem if one of the
RowCountsis a fraction - since it results in the null count selectivity formula not working out. It's a corner case and I've added a comment why amaxworks.
I'm still a bit confused about how this is a problem for null count but not for distinct count. I can see what you mean that fractions could cause an issue, but the way you're using max here feels kind of arbitrary.
pkg/sql/opt/memo/statistics_builder.go, line 908 at r6 (raw file):
} sb.adjustNullCountsForOuterJoins(colSet, ev, relProps, leftCols, rightCols)
I think it would be good to add a switch statement similar to the one you added in buildJoin so we only call this for outer joins.
pkg/sql/opt/memo/statistics_builder.go, line 946 at r6 (raw file):
rightStats = sb.makeTableStatistics(lookupJoinDef.Table) } else { rightStats = sb.statsFromChild(ev, 1 /* childIdx */)
I think it would be better to pass in inputRowCount instead of redoing all this calculation. You already have inputRowCount available in both calling functions.
pkg/sql/opt/memo/statistics_builder.go, line 969 at r6 (raw file):
} rightNullCount = colStat.NullCount }
Seems like this could be a separate helper function. You already have this logic in colStatJoin, so you could just call it for buildJoin, and pass in leftNullCount and rightNullCount as parameters to this function.
pkg/sql/opt/memo/statistics_builder.go, line 982 at r6 (raw file):
colStat.NullCount = max(colStat.NullCount, leftNullCount) if !rightCols.Empty() { colStat.NullCount += s.RowCount - innerJoinRowCount
It feels like we might be double-counting some nulls here and below, since some rows that have null values on the left could be the same rows that are null-extended on the right. Perhaps we should subtract the expected number of collisions between nulls from the left and the null-extended values on the right (similar to what you did above with Radu's formula).
pkg/sql/opt/memo/statistics_builder.go, line 993 at r6 (raw file):
case opt.FullJoinOp, opt.FullJoinApplyOp: // All rows from both sides should be in the result.
rows -> nulls
pkg/sql/opt/memo/testdata/stats/groupby, line 284 at r7 (raw file):
"created_at": "2018-01-01 2:10:00.00000+00:00", "row_count": 2000, "distinct_count": 600
I think this should have a non-zero null count in order for these tests to make sense
pkg/sql/opt/memo/testdata/stats/groupby, line 310 at r7 (raw file):
└── scan a ├── columns: x:1(int!null) y:2(int) z:3(float!null) s:4(string) ├── stats: [rows=2000, distinct(1,2)=1980, null(1,2)=20]
Hmm why didn't these values change after you updated the null count stats?
itsbilal
reviewed
Oct 16, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 737 at r5 (raw file):
Previously, rytaft wrote…
I'm still a bit confused about how this is a problem for null count but not for distinct count. I can see what you mean that fractions could cause an issue, but the way you're using
maxhere feels kind of arbitrary.
Distinct count calculations don't use the rowcount though - and so any multiplications of newDistinct/oldDistinct to the selectivity always reduce the selectivity value (i.e. make it more selective). For null counts, we have rowCount in the denominator, and so we're trying to ensure it's never less than the null counts - otherwise we may unintentionally multiply the selectivity against a fraction outside [0,1].
I guess one more intuitive and less arbitrary alternative is to cap the fraction in selectivityFromNullCounts at [0,1] - but that might hide away other places where we're sending a non-ideal row count.
pkg/sql/opt/memo/statistics_builder.go, line 908 at r6 (raw file):
Previously, rytaft wrote…
I think it would be good to add a switch statement similar to the one you added in
buildJoinso we only call this for outer joins.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 946 at r6 (raw file):
Previously, rytaft wrote…
I think it would be better to pass in
inputRowCountinstead of redoing all this calculation. You already haveinputRowCountavailable in both calling functions.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 969 at r6 (raw file):
Previously, rytaft wrote…
Seems like this could be a separate helper function. You already have this logic in
colStatJoin, so you could just call it forbuildJoin, and pass inleftNullCountandrightNullCountas parameters to this function.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 982 at r6 (raw file):
Previously, rytaft wrote…
It feels like we might be double-counting some nulls here and below, since some rows that have null values on the left could be the same rows that are null-extended on the right. Perhaps we should subtract the expected number of collisions between nulls from the left and the null-extended values on the right (similar to what you did above with Radu's formula).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 993 at r6 (raw file):
Previously, rytaft wrote…
rows -> nulls
Done.
pkg/sql/opt/memo/testdata/stats/groupby, line 284 at r7 (raw file):
Previously, rytaft wrote…
I think this should have a non-zero null count in order for these tests to make sense
Done.
pkg/sql/opt/memo/testdata/stats/groupby, line 310 at r7 (raw file):
Previously, rytaft wrote…
Hmm why didn't these values change after you updated the null count stats?
It was a bug in the ColsAreLaxKey case in colStatLeaf - where we'd ignore column statistics even if we had them. Fixed.
pkg/sql/opt/props/statistics.go, line 135 at r4 (raw file):
Previously, RaduBerinde wrote…
[nit] Mention here how this interacts with NullCount (specifically mention that any rows that have at least a null don't contribute to this count). And maybe leave a
TODOto consider changing things soDistinctCountcounts these as well (as Becca suggested).
Done.
rytaft
requested changes
Oct 16, 2018
Reviewed 1 of 7 files at r8.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 737 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Distinct count calculations don't use the rowcount though - and so any multiplications of
newDistinct/oldDistinctto the selectivity always reduce the selectivity value (i.e. make it more selective). For null counts, we have rowCount in the denominator, and so we're trying to ensure it's never less than the null counts - otherwise we may unintentionally multiply the selectivity against a fraction outside [0,1].I guess one more intuitive and less arbitrary alternative is to cap the fraction in
selectivityFromNullCountsat [0,1] - but that might hide away other places where we're sending a non-ideal row count.
Ah I see the problem. selectivityFromNullCounts is correct for Select, but not for Join.
This general formula from your comment is right:
// ┬-┬ ⎛ old null(i) - new null(i)⎞
// selectivity = │ │ ⎜ 1 - ------------------------ ⎟
// ┴ ┴ ⎝ old row count(i) ⎠
// i in
// {constrained
// columns}
but the problem is that the values for old null(i) and new null(i) need to be different for join. The values you are using -- inputStat.NullCount and colStat.NullCount -- work for Select, but they need to change for Join. I think you'll want to use something like crossJoinNullCount instead of inputStat.NullCount for old null(i) (from my revised formula: crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount). Not sure exactly what new null(i) should be, but it seems like it's either equal to 0 or old null(i), right?
pkg/sql/opt/memo/statistics_builder.go, line 977 at r8 (raw file):
if joinType == opt.LookupJoinOp { lookupJoinDef := ev.Private().(*LookupJoinDef) joinType = lookupJoinDef.JoinType
I think you should pass in joinType
pkg/sql/opt/memo/statistics_builder.go, line 1012 at r8 (raw file):
relProps *props.Relational, leftCols opt.ColSet, rightCols opt.ColSet,
I don't think you need all these parameters anymore. I'd also pass in colStat and s.RowCount -- then I think you can get rid of this other stuff.
pkg/sql/opt/memo/statistics_builder.go, line 1028 at r8 (raw file):
lookupJoinDef := ev.Private().(*LookupJoinDef) joinType = lookupJoinDef.JoinType }
Same here - pass in joinType and get rid of ev
pkg/sql/opt/memo/statistics_builder.go, line 1039 at r8 (raw file):
colStat.NullCount = max(colStat.NullCount, leftNullCount) if !rightCols.Empty() { f1 := colStat.NullCount / s.RowCount
I think this should be leftNullCount instead of colStat.NullCount
pkg/sql/opt/memo/statistics_builder.go, line 1048 at r8 (raw file):
colStat.NullCount = max(colStat.NullCount, rightNullCount) if !leftCols.Empty() { f1 := colStat.NullCount / s.RowCount
And I think this should be rightNullCount
pkg/sql/opt/memo/statistics_builder.go, line 1061 at r8 (raw file):
f1 := colStat.NullCount / s.RowCount f2 := (s.RowCount - innerJoinRowCount) / s.RowCount colStat.NullCount += s.RowCount * (f2 - f1*f2)
And here it gets tricky -- I think you want to do something like:
if !leftCols.Empty() {
f1 := rightNullCount / rightJoinRowCount
f2 := (rightJoinRowCount - innerJoinRowCount) / rightJoinRowCount
colStat.NullCount += rightJoinRowCount * (f2 - f1*f2)
}
if !rightCols.Empty() {
...
}
Where
leftJoinRowCount := max(innerJoinRowCount, leftStats.RowCount)
rightJoinRowCount := max(innerJoinRowCount, rightStats.RowCount)
(Does that seem right? I may have made a mistake somewhere...)
rytaft
requested changes
Oct 16, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 894 at r8 (raw file):
colStat.ApplySelectivity(s.Selectivity, inputRowCount) } leftNullCount = colStat.NullCount
As is, I don't think the null count calculation in colStat.ApplySelectivity is correct for joins, so I don't think you should be using this value. The problem is that s.Selectivity is in relation to the inputRowCount (cross join row count), but colStat.NullCount is just the null count for the left table.
I think the correct formula here is leftNullCount = colStat.NullCount * rightRowCount * s.Selectivity, and you should do it for all join types before you call ApplySelectivity (right after the call to copyColStat). This is basically calculating the number of null values for this column set after an inner join, which is what you want for your formulas below. (Also, make sure to set colStat.NullCount = leftNullCount after the call to ApplySelectivity. You'll fix it for outer joins in your code below.)
Alternatively, a (probably) better approach would be to write a function -- perhaps a method on ColumnStatistic -- called ApplyJoinSelectivity, which takes the join type as a parameter, and encapsulates all this complexity. Happy to help think through the details of that offline.
Similar logic applies to the cases below.
(Also fix in the leftRightNullCounts function)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rytaft
Oct 17, 2018
Contributor
pkg/sql/opt/memo/statistics_builder.go, line 894 at r8 (raw file):
Previously, rytaft wrote…
As is, I don't think the null count calculation in
colStat.ApplySelectivityis correct for joins, so I don't think you should be using this value. The problem is thats.Selectivityis in relation to theinputRowCount(cross join row count), butcolStat.NullCountis just the null count for the left table.I think the correct formula here is
leftNullCount = colStat.NullCount * rightRowCount * s.Selectivity, and you should do it for all join types before you callApplySelectivity(right after the call tocopyColStat). This is basically calculating the number of null values for this column set after an inner join, which is what you want for your formulas below. (Also, make sure to setcolStat.NullCount = leftNullCountafter the call toApplySelectivity. You'll fix it for outer joins in your code below.)Alternatively, a (probably) better approach would be to write a function -- perhaps a method on
ColumnStatistic-- calledApplyJoinSelectivity, which takes the join type as a parameter, and encapsulates all this complexity. Happy to help think through the details of that offline.Similar logic applies to the cases below.
(Also fix in the
leftRightNullCountsfunction)
Sorry, thinking about this more, I'm making it more complicated than it needs to be. There are basically three things you need for each of these cases to pass into adjustNullCountsForOuterJoin:
leftNullCount, which is the null count for the left side columns fromcolSetbefore the join and before any selectivity has been applied. To fix this, you just need to move this lineleftNullCount = colStat.NullCountabove the switch statement, so it's before the call toApplySelectivity. (Same with theelsecase.)rightNullCount, which is symmetric for the right side. (So moverightNullCount = colStat.NullCountabove the switch in the next case, and same with theelsecase.)colStat.NullCount, which should be updated to contain the estimated null count after an inner join for the full column set. You're already doing this correctly in the else case, but not forif rightCols.Empty()orif leftCols.Empty(). I think you can just move the logic that you already have in theelsecase below theifblock so it applies to all three cases.
So no need to change very much here... let me know if this doesn't make sense - thanks!
|
pkg/sql/opt/memo/statistics_builder.go, line 894 at r8 (raw file): Previously, rytaft wrote…
Sorry, thinking about this more, I'm making it more complicated than it needs to be. There are basically three things you need for each of these cases to pass into
So no need to change very much here... let me know if this doesn't make sense - thanks! |
itsbilal
reviewed
Oct 17, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 566 at r5 (raw file):
Previously, rytaft wrote…
Nice example. I think it would help to say that the filter
y IS NOT NULLis inferred in this case due to the aggregate (it's not obvious where it's coming from, otherwise).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 737 at r5 (raw file):
Previously, rytaft wrote…
Ah I see the problem.
selectivityFromNullCountsis correct forSelect, but not forJoin.This general formula from your comment is right:
// ┬-┬ ⎛ old null(i) - new null(i)⎞ // selectivity = │ │ ⎜ 1 - ------------------------ ⎟ // ┴ ┴ ⎝ old row count(i) ⎠ // i in // {constrained // columns}but the problem is that the values for
old null(i)andnew null(i)need to be different for join. The values you are using --inputStat.NullCountandcolStat.NullCount-- work forSelect, but they need to change forJoin. I think you'll want to use something likecrossJoinNullCountinstead ofinputStat.NullCountforold null(i)(from my revised formula:crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount). Not sure exactly whatnew null(i)should be, but it seems like it's either equal to 0 orold null(i), right?
Done. I've implemented a new selectivity function specifically for the join case (called joinSelectivityFromNullCounts) which implements the formula that you mentioned. Thanks so much for thinking through this!
pkg/sql/opt/memo/statistics_builder.go, line 894 at r8 (raw file):
Previously, rytaft wrote…
Sorry, thinking about this more, I'm making it more complicated than it needs to be. There are basically three things you need for each of these cases to pass into
adjustNullCountsForOuterJoin:
leftNullCount, which is the null count for the left side columns fromcolSetbefore the join and before any selectivity has been applied. To fix this, you just need to move this lineleftNullCount = colStat.NullCountabove the switch statement, so it's before the call toApplySelectivity. (Same with theelsecase.)rightNullCount, which is symmetric for the right side. (So moverightNullCount = colStat.NullCountabove the switch in the next case, and same with theelsecase.)colStat.NullCount, which should be updated to contain the estimated null count after an inner join for the full column set. You're already doing this correctly in the else case, but not forif rightCols.Empty()orif leftCols.Empty(). I think you can just move the logic that you already have in theelsecase below theifblock so it applies to all three cases.So no need to change very much here... let me know if this doesn't make sense - thanks!
Done. That makes sense - thanks again! It looks like this change also helped improve null count generation from the join operator in cases where we had high null counts down the tree.
pkg/sql/opt/memo/statistics_builder.go, line 977 at r8 (raw file):
Previously, rytaft wrote…
I think you should pass in
joinType
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1012 at r8 (raw file):
Previously, rytaft wrote…
I don't think you need all these parameters anymore. I'd also pass in
colStatands.RowCount-- then I think you can get rid of this other stuff.
I think having leftCols and rightCols is very much valuable in checking which side the cols fall on. We can calculate it fairly cheaply from ev, but it's still good to have it passed in. That said I've eliminated ev, relProps, and cols.
pkg/sql/opt/memo/statistics_builder.go, line 1028 at r8 (raw file):
Previously, rytaft wrote…
Same here - pass in
joinTypeand get rid ofev
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1039 at r8 (raw file):
Previously, rytaft wrote…
I think this should be
leftNullCountinstead ofcolStat.NullCount
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1048 at r8 (raw file):
Previously, rytaft wrote…
And I think this should be
rightNullCount
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1061 at r8 (raw file):
Previously, rytaft wrote…
And here it gets tricky -- I think you want to do something like:
if !leftCols.Empty() { f1 := rightNullCount / rightJoinRowCount f2 := (rightJoinRowCount - innerJoinRowCount) / rightJoinRowCount colStat.NullCount += rightJoinRowCount * (f2 - f1*f2) } if !rightCols.Empty() { ... }Where
leftJoinRowCount := max(innerJoinRowCount, leftStats.RowCount) rightJoinRowCount := max(innerJoinRowCount, rightStats.RowCount)(Does that seem right? I may have made a mistake somewhere...)
Done.
rytaft
requested changes
Oct 17, 2018
Great stuff! Getting close... I'll take another look at the test output once you make the last few changes below
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 1012 at r8 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I think having
leftColsandrightColsis very much valuable in checking which side the cols fall on. We can calculate it fairly cheaply from ev, but it's still good to have it passed in. That said I've eliminated ev, relProps, and cols.
Ah right - makes sense. This parameter list is now kind of crazy long, which is something that @andy-kimball generally advises against. I'm not really sure what to do about it, because I think it's still preferable to recalculating all of this stuff.
I'm inclined to leave it as-is so as not to further delay this PR, but it's something to keep in mind. Sometimes you can create a simple struct or refactor a bit to clean things like this up.
pkg/sql/opt/memo/statistics_builder.go, line 1039 at r8 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done.
Something is still nagging me about this formula. I think what you actually want is:
colStat.NullCount += rowCount - innerJoinRowCount * (1 - leftNullCount/leftRowCount)
where leftNullCount/leftRowCount is f1 from your calculation in colStatJoin
Does this seem right? If so, update here and below for right and full join.
pkg/sql/opt/memo/statistics_builder.go, line 988 at r9 (raw file):
// specified join. If either leftCols or rightCols are empty, the corresponding // side's return value is zero. func (sb *statisticsBuilder) leftRightNullCounts(
You need to update this function based on what we discussed above. Should be pretty simple now, I think:
if !leftCols.Empty() {
leftColStat := sb.colStatFromJoinLeft(leftCols, ev)
leftNullCount = leftColStat.NullCount
}
if !rightCols.Empty() {
rightColStat := sb.colStatFromJoinRight(rightCols, ev)
rightNullCount = rightColStat.NullCount
}
return leftNullCount, rightNullCount
pkg/sql/opt/memo/statistics_builder.go, line 2145 at r9 (raw file):
// This selectivity will be used later to update the row count. // func (sb *statisticsBuilder) joinSelectivityFromNullCounts(
Nice function!
pkg/sql/opt/memo/statistics_builder.go, line 2149 at r9 (raw file):
ev ExprView, s *props.Statistics, rowCount float64,
I'd rename this crossJoinRowCount or inputRowCount for clarity.
rytaft
requested changes
Oct 17, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 767 at r9 (raw file):
// Update null counts for non-nullable columns. sb.updateNullCountsFromProps(ev, relProps, inputRowCount)
Just realized - null counts that are not getting set to 0 in updateNullCountsFromProps probably need to be updated. I'm guessing that they are currently equal to leftNullCount or rightNullCount, and they need to instead use the formula you're using below in colStatJoin to find the inner join null count.
rytaft
requested changes
Oct 17, 2018
Reviewed 2 of 31 files at r4, 1 of 12 files at r5, 5 of 22 files at r6, 3 of 7 files at r8, 6 of 8 files at r9.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/memo.go, line 514 at r9 (raw file):
// // TODO(itsbilal): Figure out a way to either run this for all tests, or // cleanly restore all column stats after this step.
Do you still need this TODO? Is this not handled by DeriveLogicalProps?
pkg/sql/opt/memo/statistics_builder.go, line 1197 at r9 (raw file):
// Estimate the row count based on the distinct count of the grouping // columns. colStat := sb.copyColStatFromChild(groupingColSet, ev, s)
Null count for colStat needs to be updated here. (Feel free to just add a TODO for now)
itsbilal commentedOct 1, 2018
NOTE: This is a large diff with mostly minor test changes. Implementation work is confined to just these files:
This change takes null counts that are already collected in table
statistics, and propagates them through the different operators in
statistics_builder. It also uses these null counts to generate
selectivities which are then applied to the row count.
The expectation is that this change will lead to much better cardinality
estimation, especially in workloads involving lots of null values that
get filtered out by an operator or constraint. Previously, we treated
null counts like any other distinct value.
Fixes #30289
Release note: None