-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: Propagate null counts through statsbuilder #30827
Conversation
e6f2e47
to
fa3e9fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start!
I did a first pass -- my main comments are:
- For multi-column stats, I believe we are determining null count based on whether at least one column in the column set is null. I think your code assumes that all columns must be null.
- We are not including null values in the distinct count. So if all values in a column are null, distinct count is 0.
- We've already determined which columns are not null in the
logicalPropsBuilder
. So you should be able to just userelProps.NotNullCols
. - There are a few places where you are setting
NullCount
to 0 where I think that might be a bit premature. I marked a couple of them, but it might be good to cross-reference againstrelProps.NotNullCols
.
@RaduBerinde - can you confirm points 1 and 2?
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/metadata.go, line 169 at r1 (raw file):
// notnull is whether this column is guaranteed to not hold nulls. notnull bool
I don't think we need to store this in the metadata. We are already storing it in the logical properties.
pkg/sql/opt/memo/statistics_builder.go, line 332 at r1 (raw file):
// duplicates where all columns are NULL. if fd.ColsAreLaxKey(colSet) { colStat.DistinctCount = s.RowCount
I think you should update this to be:
colStat.DistinctCount = s.RowCount - colStat.NullCount
(and move it below calculation of colStat.NullCount
)
pkg/sql/opt/memo/statistics_builder.go, line 333 at r1 (raw file):
if fd.ColsAreLaxKey(colSet) { colStat.DistinctCount = s.RowCount colStat.NullCount = 1
Why is this equal to 1? I don't think we can make any assumptions about the null count for lax keys. You probably want to calculate this using unknownNullCountRatio
.
pkg/sql/opt/memo/statistics_builder.go, line 337 at r1 (raw file):
} if s.RowCount == 0 {
Are we ever hitting this case? I thought we handled this case elsewhere (but I'm probably mistaken). Either way, you should move this case up above, so we don't call ColsAreLaxKey
if it's not necessary.
pkg/sql/opt/memo/statistics_builder.go, line 359 at r1 (raw file):
colStatLeaf := sb.colStatLeaf(util.MakeFastIntSet(i), s, fd) distinctCount *= colStatLeaf.DistinctCount nullFactor *= colStatLeaf.NullCount / s.RowCount
I think we are defining null count to be at least one column in the column set is null. This seems to be calculating the null count assuming all columns must be null.
pkg/sql/opt/memo/statistics_builder.go, line 447 at r1 (raw file):
// Calculate row count and selectivity // ----------------------------------- savedRowCount := s.RowCount
[nit] I think inputRowCount
would be a bit more consistent with the naming in the rest of this file. You could also just use inputStats.RowCount
to avoid creating a new variable.
pkg/sql/opt/memo/statistics_builder.go, line 841 at r1 (raw file):
colStat.DistinctCount = s.RowCount } // Similarly, the null count should be no larger than (RowCount - DistinctCount + 1).
I believe by convention we are not treating NULL
as part of the distinct count. So if all values are NULL
, distinct count should be 0.
So this should be no larger than (RowCount - DistinctCount)
pkg/sql/opt/memo/statistics_builder.go, line 925 at r1 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in both column sets.
See comment above - I think it should be nulls in either column set
pkg/sql/opt/memo/statistics_builder.go, line 933 at r1 (raw file):
colStat.DistinctCount = s.RowCount } // Similarly, the null count should be no larger than (RowCount - DistinctCount + 1).
See comment above - I think it should be (RowCount - DistinctCount)
pkg/sql/opt/memo/statistics_builder.go, line 973 at r1 (raw file):
colStat, _ := s.ColStats.Add(colSet) colStat.DistinctCount = 1 colStat.NullCount = 0
It's possible that the output of the scalar group by could be NULL
(feel free to just add a TODO here so we don't forget about it)
pkg/sql/opt/memo/statistics_builder.go, line 1048 at r1 (raw file):
leftNullCount := min(1, leftColStat.NullCount) rightNullCount := min(1, rightColStat.NullCount)
Why are you reducing to 1 here? I think this is only valid for union, intersect and except, not for the ALL
variants.
pkg/sql/opt/memo/statistics_builder.go, line 1055 at r1 (raw file):
case opt.UnionOp, opt.UnionAllOp: colStat.DistinctCount = leftColStat.DistinctCount + rightColStat.DistinctCount colStat.NullCount = min(leftNullCount, rightNullCount)
I think this should be the sum here
pkg/sql/opt/memo/statistics_builder.go, line 1256 at r1 (raw file):
// The ordinality column is a key, so every row is distinct. colStat.DistinctCount = s.RowCount colStat.NullCount = 0
Unless def.ColID
is the only column, I'm not sure we can say NullCount
is definitely zero
pkg/sql/opt/memo/statistics_builder.go, line 1593 at r1 (raw file):
// This function should be called before selectivityFromNullCounts. // func (sb *statisticsBuilder) updateNullCountsFromConstraint(
Is there any reason you can't just use the NotNullCols
logical property? I don't think you need to look at the constraint at all (that was already done in logicalPropsBuilder
).
pkg/sql/opt/memo/statistics_builder.go, line 1877 at r1 (raw file):
// Short circuit if no rows if rowCount <= 0 { return selectivity
Do we ever hit this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I should have addressed everything you pointed out. The logictest failure that existed in the first revision should also be fixed now.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/metadata.go, line 169 at r1 (raw file):
Previously, rytaft wrote…
I don't think we need to store this in the metadata. We are already storing it in the logical properties.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 332 at r1 (raw file):
Previously, rytaft wrote…
I think you should update this to be:
colStat.DistinctCount = s.RowCount - colStat.NullCount
(and move it below calculation of
colStat.NullCount
)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 333 at r1 (raw file):
Previously, rytaft wrote…
Why is this equal to 1? I don't think we can make any assumptions about the null count for lax keys. You probably want to calculate this using
unknownNullCountRatio
.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 337 at r1 (raw file):
Previously, rytaft wrote…
Are we ever hitting this case? I thought we handled this case elsewhere (but I'm probably mistaken). Either way, you should move this case up above, so we don't call
ColsAreLaxKey
if it's not necessary.
Done. Removed it since we weren't hitting it.
pkg/sql/opt/memo/statistics_builder.go, line 359 at r1 (raw file):
Previously, rytaft wrote…
I think we are defining null count to be at least one column in the column set is null. This seems to be calculating the null count assuming all columns must be null.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 447 at r1 (raw file):
Previously, rytaft wrote…
[nit] I think
inputRowCount
would be a bit more consistent with the naming in the rest of this file. You could also just useinputStats.RowCount
to avoid creating a new variable.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 841 at r1 (raw file):
Previously, rytaft wrote…
I believe by convention we are not treating
NULL
as part of the distinct count. So if all values areNULL
, distinct count should be 0.So this should be no larger than
(RowCount - DistinctCount)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 925 at r1 (raw file):
Previously, rytaft wrote…
See comment above - I think it should be nulls in either column set
Done.
pkg/sql/opt/memo/statistics_builder.go, line 933 at r1 (raw file):
Previously, rytaft wrote…
See comment above - I think it should be (RowCount - DistinctCount)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 973 at r1 (raw file):
Previously, rytaft wrote…
It's possible that the output of the scalar group by could be
NULL
(feel free to just add a TODO here so we don't forget about it)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1048 at r1 (raw file):
Previously, rytaft wrote…
Why are you reducing to 1 here? I think this is only valid for union, intersect and except, not for the
ALL
variants.
Done. Special-cased the set and bag operations separately.
pkg/sql/opt/memo/statistics_builder.go, line 1055 at r1 (raw file):
Previously, rytaft wrote…
I think this should be the sum here
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1256 at r1 (raw file):
Previously, rytaft wrote…
Unless
def.ColID
is the only column, I'm not sure we can sayNullCount
is definitely zero
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1593 at r1 (raw file):
Previously, rytaft wrote…
Is there any reason you can't just use the
NotNullCols
logical property? I don't think you need to look at the constraint at all (that was already done inlogicalPropsBuilder
).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1877 at r1 (raw file):
Previously, rytaft wrote…
Do we ever hit this case?
Yes, we do hit it in some cases it seems. And since we divide by rowCount
here, it's best to short circuit it early on.
fa3e9fb
to
2690406
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking, but added a few comments to get started
Reviewed 2 of 37 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):
query TTTTT EXPLAIN (TYPES) SELECT min(y) FROM xyz WHERE x = 1
Why did these plans change? It seems like the new plans are worse than the old ones.
pkg/sql/opt/memo/statistics_builder.go, line 329 at r2 (raw file):
// If some of the columns are a lax key, the distinct count equals the row // count.
Update comment to mention null count.
pkg/sql/opt/memo/statistics_builder.go, line 330 at r2 (raw file):
// If some of the columns are a lax key, the distinct count equals the row // count. if fd.ColsAreLaxKey(colSet) {
I think ColsAreLaxKey
returns true for both strict and lax keys. So you should probably add a check if colSet.SubsetOf(notNullCols) { colStat.NullCount = 0 }
(similar to below).
pkg/sql/opt/memo/statistics_builder.go, line 618 at r2 (raw file):
// There are no columns in this expression, so it must be a constant. colStat.DistinctCount = 1 colStat.NullCount = 0
Seems like this expression could also be NULL
. Maybe you could check if it's a NullOp
, and if so set colStat.NullCount
to the row count?
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
colStat, _ = s.ColStats.Add(colSet) colStat.DistinctCount = leftColStat.DistinctCount * rightColStat.DistinctCount colStat.NullCount = leftColStat.NullCount * rightColStat.NullCount
Is this right? I'm having a hard time convincing myself what the right value should be.
pkg/sql/opt/memo/statistics_builder.go, line 919 at r2 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in either column sets.
[nit] column set
pkg/sql/opt/memo/statistics_builder.go, line 1123 at r2 (raw file):
colStat.DistinctCount = float64(len(distinct)) // We cannot read values, so just use the default // null count ratio here.
I think you can check if the operator is NullOp
pkg/sql/opt/memo/statistics_builder.go, line 1228 at r2 (raw file):
colStat, _ := ev.Logical().Relational.Stats.ColStats.Add(colSet) colStat.DistinctCount = 1 colStat.NullCount = 0
Another one I'm not convinced should be 0
pkg/sql/opt/memo/statistics_builder.go, line 1325 at r2 (raw file):
if s.RowCount == 1 { colStat.DistinctCount = 1 colStat.NullCount = 0
ditto
pkg/sql/opt/memo/statistics_builder.go, line 1328 at r2 (raw file):
} else { colStat.DistinctCount = s.RowCount * unknownDistinctCountRatio colStat.NullCount = s.RowCount * math.Pow(unknownNullCountRatio, float64(colSet.Len()))
This needs to be updated
pkg/sql/opt/memo/statistics_builder.go, line 1554 at r2 (raw file):
} sb.updateNullCountsFromProps(ev, relProps)
I think this should be called outside applyConstraint
so it will get called if a scan is unconstrained.
pkg/sql/opt/memo/statistics_builder.go, line 1571 at r2 (raw file):
numUnappliedConjuncts = 0 for i := 0; i < cs.Length(); i++ { sb.updateNullCountsFromProps(ev, relProps)
same here - this should be moved out of applyConstraintSet
pkg/sql/opt/memo/statistics_builder.go, line 1588 at r2 (raw file):
// updateNullCountsFromProps zeroes null counts for columns that cannot // have nulls in them, usually due to a column property or an application. // of a null-excluding filter. The actual determinatino of non-nullable
determinatio -> determination
pkg/sql/opt/memo/statistics_builder.go, line 1589 at r2 (raw file):
// have nulls in them, usually due to a column property or an application. // of a null-excluding filter. The actual determinatino of non-nullable // columns is done in the logical prop builder.
[nit] logical props builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):
}) colStat.DistinctCount = min(distinctCount, s.RowCount) colStat.NullCount = min(nullCount, s.RowCount)
In other places you are setting this to min(nullCount, s.RowCount-colStat.DistinctCount)
. I don't have a strong opinion about which is better, but at least we should be consistent.
pkg/sql/opt/memo/statistics_builder.go, line 473 at r2 (raw file):
} return colStat
It may be worth adding the following check at the end of every colStatXXX
function: if colSet.SubsetOf(relProps.NotNullCols) { colStat.NullCount = 0 }
. (See if it changes any test output -- alternatively, just check the code that's calculating NotNullCols
inside logicalPropsBuilder
, and see which operations can possibly change it from their input. Maybe these cases are already covered by the filters in Scan
, Select
and Join
.)
pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):
If colSet
contains a single column, the null count equals min(1, inputColStat.NullCount)
. For example:
> create table t (x int, y int);
> insert into t values (null, 1), (null, 1), (1, 2), (1, null), (1, 2);
> select count(*), x from t group by x;
count | x
+-------+------+
2 | NULL
3 | 1
(2 rows)
Multi-column stats are more complicated, and I'm not really sure what the best approach is. See if you can come up with something that seems reasonable.... (happy to discuss ideas offline)
pkg/sql/opt/memo/statistics_builder.go, line 1058 at r2 (raw file):
// (the non-ALL variants). leftNullCount := min(1, leftColStat.NullCount) rightNullCount := min(1, rightColStat.NullCount)
This only works if each side has one column. I think you should be consistent with whatever you come up with for multi-column stats for grouping columns and apply that here.
pkg/sql/opt/memo/statistics_builder.go, line 1062 at r2 (raw file):
switch ev.Operator() { case opt.UnionOp: colStat.NullCount = max(leftNullCount, rightNullCount)
I think this should probably be leftNullCount + rightNullCount
. I think you can actually use the same formulas for the regular and ALL variants, and put them in the switch above. The difference will be that for the ALL variants, use:
leftNullCount = leftColStat.NullCount
rightNullCount = rightColStat.NullCount
And for the regular variants, use the multi-column stats for grouping columns estimation.
pkg/sql/opt/memo/statistics_builder.go, line 1070 at r2 (raw file):
colStat.NullCount = min(leftColStat.NullCount, rightColStat.NullCount) case opt.ExceptOp: colStat.NullCount = max(0, leftNullCount-rightNullCount)
Seems like you might be under-estimating null count here, especially in the case of more than one column. I'd probably just use leftNullCount
, similar to the distinct count calculation above. It's definitely not perfect, but at least it's consistent.
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
// This is the ratio of null column values to number of rows for nullable // columns, which is used in the absence of any real statistics for non-key
whether or not the column is a key is not relevant for null count - only whether the column is nullable.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
) { relProps.NotNullCols.ForEach(func(col int) { colStat := sb.ensureColStat(util.MakeFastIntSet(col), math.MaxFloat64, ev, relProps)
There is a potential problem here, which is that if a particular column is not already in s.ColStats
, we are just copying the distinct count from its input. The distinct count should probably be scaled based on the selectivity of the filter, like we do in colStatScan
, colStatSelect
, etc. This also means that we may need to move this function call after any call to selectivityFromDistinctCounts
. (But that kind of breaks the flow of the calling functions -- play around with it and see if you can fix the logic without hurting the flow too much)
pkg/sql/opt/memo/statistics_builder.go, line 1733 at r2 (raw file):
// Find the minimum distinct and null counts for all columns in this equivalency // group.
I'm not sure it's necessary to update null counts here. Equivalency groups are created based on predicates like x = y
, which are null-rejecting. So both x
and y
will have their null counts set to 0 inside updateNullCountsFromProps
. Although maybe I'm missing an edge case where x and y can both be null... it would be good to confirm before you rip this out.
pkg/sql/opt/memo/statistics_builder.go, line 1788 at r2 (raw file):
if inputStat.DistinctCount != 0 && colStat.DistinctCount < inputStat.DistinctCount { newDistinct := colStat.DistinctCount oldDistinct := inputStat.DistinctCount
[nit] Why don't you move these definitions outside the if
statement so we can use them in the if
statement as well.
pkg/sql/opt/memo/statistics_builder.go, line 1805 at r2 (raw file):
// i in // {constrained // columns}
Nice formula!
pkg/sql/opt/memo/statistics_builder.go, line 1828 at r2 (raw file):
panic("rowCount passed in was too small") } if inputStat.NullCount != 0 && colStat.NullCount < inputStat.NullCount {
Isn't it sufficient to check if colStat.NullCount < inputStat.NullCount
?
pkg/sql/opt/memo/statistics_builder.go, line 1969 at r2 (raw file):
// Cases of NULL in a constraint should be ignored. For example, // without knowledge of the data distribution, /a: (/NULL - /10] should // have the same estimated selectivity as /a: [/10 - ].
You should also mention that we are handing the selectivity of NULL constraints separately in selectivityFromNullCounts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah what Rebecca said above is correct.
[nit] Just a tip for future changes - I found that for bigger changes, it helps to separate the code changes (and new tests) from the existing test changes (in different commits). It's easier to review, and it's easier to rebase (if you have a conflict in a test file, you can just nuke the entire commit and re-run tests with -rewrite
)
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in either column sets. colStat.NullCount += lookupColStat.NullCount
I was looking at r1
and was going to suggest the formula rowcount * (f1 + f2 - f1 * f2)
where f1,f2
are the two null/rowcount
fractions.
pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file): Previously, RaduBerinde wrote…
Great idea! |
2690406
to
09d255e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your help so far @rytaft and @RaduBerinde! I should have addressed mostly everything major that was pointed out - the one major exception being the regression with indexed min(x)
's not being replaced with a limit anymore that I'm still looking into.
I've pushed my work so far but feel free to hold off on the review until I've addressed that.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):
Previously, rytaft wrote…
Why did these plans change? It seems like the new plans are worse than the old ones.
Looking into this.
pkg/sql/opt/memo/statistics_builder.go, line 329 at r2 (raw file):
Previously, rytaft wrote…
Update comment to mention null count.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 330 at r2 (raw file):
Previously, rytaft wrote…
I think
ColsAreLaxKey
returns true for both strict and lax keys. So you should probably add a checkif colSet.SubsetOf(notNullCols) { colStat.NullCount = 0 }
(similar to below).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):
Previously, rytaft wrote…
In other places you are setting this to
min(nullCount, s.RowCount-colStat.DistinctCount)
. I don't have a strong opinion about which is better, but at least we should be consistent.
Done - moved to RowCount
only.
pkg/sql/opt/memo/statistics_builder.go, line 618 at r2 (raw file):
Previously, rytaft wrote…
Seems like this expression could also be
NULL
. Maybe you could check if it's aNullOp
, and if so setcolStat.NullCount
to the row count?
Done.
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
Is this right? I'm having a hard time convincing myself what the right value should be.
Any ideas of alternatives? It's hard to deduce this without understanding how the rows interleave - but there may be a decent heuristic.
pkg/sql/opt/memo/statistics_builder.go, line 919 at r2 (raw file):
Previously, rytaft wrote…
[nit] column set
Done.
pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):
Previously, rytaft wrote…
Great idea!
Done.
pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):
Previously, rytaft wrote…
If
colSet
contains a single column, the null count equalsmin(1, inputColStat.NullCount)
. For example:> create table t (x int, y int); > insert into t values (null, 1), (null, 1), (1, 2), (1, null), (1, 2); > select count(*), x from t group by x; count | x +-------+------+ 2 | NULL 3 | 1 (2 rows)
Multi-column stats are more complicated, and I'm not really sure what the best approach is. See if you can come up with something that seems reasonable.... (happy to discuss ideas offline)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1058 at r2 (raw file):
Previously, rytaft wrote…
This only works if each side has one column. I think you should be consistent with whatever you come up with for multi-column stats for grouping columns and apply that here.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1062 at r2 (raw file):
Previously, rytaft wrote…
I think this should probably be
leftNullCount + rightNullCount
. I think you can actually use the same formulas for the regular and ALL variants, and put them in the switch above. The difference will be that for the ALL variants, use:leftNullCount = leftColStat.NullCount rightNullCount = rightColStat.NullCount
And for the regular variants, use the multi-column stats for grouping columns estimation.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1070 at r2 (raw file):
Previously, rytaft wrote…
Seems like you might be under-estimating null count here, especially in the case of more than one column. I'd probably just use
leftNullCount
, similar to the distinct count calculation above. It's definitely not perfect, but at least it's consistent.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1123 at r2 (raw file):
Previously, rytaft wrote…
I think you can check if the operator is
NullOp
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1228 at r2 (raw file):
Previously, rytaft wrote…
Another one I'm not convinced should be 0
Done - went with unknownNullCountRatio
pkg/sql/opt/memo/statistics_builder.go, line 1325 at r2 (raw file):
Previously, rytaft wrote…
ditto
Done - went with unknownNullCountRatio
pkg/sql/opt/memo/statistics_builder.go, line 1328 at r2 (raw file):
Previously, rytaft wrote…
This needs to be updated
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
Previously, rytaft wrote…
whether or not the column is a key is not relevant for null count - only whether the column is nullable.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1554 at r2 (raw file):
Previously, rytaft wrote…
I think this should be called outside
applyConstraint
so it will get called if a scan is unconstrained.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1571 at r2 (raw file):
Previously, rytaft wrote…
same here - this should be moved out of
applyConstraintSet
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1588 at r2 (raw file):
Previously, rytaft wrote…
determinatio -> determination
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1589 at r2 (raw file):
Previously, rytaft wrote…
[nit] logical props builder
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
Previously, rytaft wrote…
There is a potential problem here, which is that if a particular column is not already in
s.ColStats
, we are just copying the distinct count from its input. The distinct count should probably be scaled based on the selectivity of the filter, like we do incolStatScan
,colStatSelect
, etc. This also means that we may need to move this function call after any call toselectivityFromDistinctCounts
. (But that kind of breaks the flow of the calling functions -- play around with it and see if you can fix the logic without hurting the flow too much)
Is this an issue though? It seems like we only call this function after we've already called applyConstraint
or applyConstraintSet
- so either the distinct count has already been updated to match the constraint, or that particular row is unconstrained and will later have its counts modified when we call ApplySelectivity
. I may need to dive in deeper to be confident with this solution though.
pkg/sql/opt/memo/statistics_builder.go, line 1788 at r2 (raw file):
Previously, rytaft wrote…
[nit] Why don't you move these definitions outside the
if
statement so we can use them in theif
statement as well.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1828 at r2 (raw file):
Previously, rytaft wrote…
Isn't it sufficient to check
if colStat.NullCount < inputStat.NullCount
?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice changes! I've added a few more comments inline, but my big comment now is that I think you need to add some more tests in memo/testdata/stats
that explicitly exercise all the changes you've made. Also, take a look at the stats in the existing tests and make sure you can convince yourself that they make sense.
Reviewed 2 of 37 files at r1, 6 of 37 files at r2, 33 of 33 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 332 at r2 (raw file):
if fd.ColsAreLaxKey(colSet) { colStat.NullCount = unknownNullCountRatio * s.RowCount colStat.DistinctCount = s.RowCount - colStat.NullCount
I think this old formula was better for this particular case (sorry if my comment below was confusing -- I didn't mean for you to change this one).
pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done - moved to
RowCount
only.
This is probably fine, but maybe we should see which one produces better plans. It's possible we should actually be doing something in between these two extremes.... It would be good to add some tests that explicitly exercise this case.
pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done.
Nice!
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
I'd just say:
// This is the ratio of null column values to number of rows for nullable // columns, which is used in the absence of any real statistics.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Is this an issue though? It seems like we only call this function after we've already called
applyConstraint
orapplyConstraintSet
- so either the distinct count has already been updated to match the constraint, or that particular row is unconstrained and will later have its counts modified when we callApplySelectivity
. I may need to dive in deeper to be confident with this solution though.
The problem is you're not calling ApplySelectivity
on this particular colStat
. Once it's in the cache it doesn't get updated again. So I think you need to add a call to ApplySelectivity
here, but make sure that doesn't cause the final selectivity to change.
pkg/sql/opt/memo/statistics_builder.go, line 195 at r3 (raw file):
// statsFromChild retrieves the main statistics struct from a specific child // of the given expression. func (sb *statisticsBuilder) statFromChild(ev ExprView, childIdx int) *props.Statistics {
[nit] statFromChild -> statsFromChild
pkg/sql/opt/memo/statistics_builder.go, line 940 at r3 (raw file):
// Assuming null columns are completely independent, calculate // the expected value of having nulls in either column set. f1 := lookupColStat.NullCount / tableStats.RowCount
Since ApplySelectivity
updates null count, I think this denominator should be inputStats.RowCount
pkg/sql/opt/memo/statistics_builder.go, line 1106 at r3 (raw file):
case opt.ExceptOp, opt.ExceptAllOp: colStat.DistinctCount = leftColStat.DistinctCount colStat.NullCount = leftColStat.NullCount
leftColStat.NullCount -> leftNullCount
pkg/sql/opt/memo/statistics_builder_test.go, line 177 at r3 (raw file):
statsFunc( cs2, "[rows=3.33333333e+09, distinct(2)=500, null(2)=0]",
This is an example of the issue where the distinct count of (2) is just copied from the input as-is. It should be scaled with a call to ApplySelectivity
.
pkg/sql/opt/memo/testdata/stats/select, line 199 at r3 (raw file):
# Hide stats because when run with testrace, more stats are populated than # otherwise. See comment in MemoizeDenormExpr in memo.go. opt format=hide-stats
One thing you can do to get around this is specifically request certain column stats using the colstats
directive. I think that will allow you to ensure you get the same results for test
and testrace
.
pkg/sql/opt/props/statistics.go, line 84 at r3 (raw file):
s.RowCount = 0 for i, n := 0, s.ColStats.Count(); i < n; i++ { s.ColStats.Get(i).DistinctCount = 0
update null count here too
pkg/sql/opt/props/statistics.go, line 165 at r3 (raw file):
c.DistinctCount = d - d*math.Pow(1-selectivity, n/d) // Since the null count is a more simpler count of all null rows, we can
[nit] more simpler -> simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Any ideas of alternatives? It's hard to deduce this without understanding how the rows interleave - but there may be a decent heuristic.
I think we should follow a similar approach to how we calculated the row count. So there are 3 steps:
- Estimate the null count for the cross join
- Scale the null count based on the selectivity of the filter
- Adjust if needed for outer joins
For 1, I think the formula is something like:
crossJoinNullCount := leftColStat.NullCount * s.RowCount + rightColStat.NullCount * s.RowCount - leftColStat.NullCount * rightColStat.NullCount
For 2, you can just call colStat.ApplySelectivity
For 3, take a look at what we're doing in buildJoin
. For example, for left join, you should adjust the null count like this:
colStat.NullCount = max(innerJoinNullCount, leftColStat.NullCount)
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
} else { inputRowCount := sb.statFromChild(ev, 0 /* childIdx */).RowCount colStat.NullCount = ((colStat.DistinctCount + 1) / inputRowCount) * inputColStat.NullCount
Thinking about this more, I think this formula makes more sense if we include null values in the distinct count. I think that would make a lot of things here easier to reason about. @RaduBerinde, is there any reason we can't treat NULL as another distinct value? For example, consider the following data:
a | b
-----+-----
NULL | 1
NULL | 1
NULL | NULL
1 | 2
3 | 4
I think we should treat this as having DistinctCount=4 and NullCount=3. Does that seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
Previously, rytaft wrote…
Thinking about this more, I think this formula makes more sense if we include null values in the distinct count. I think that would make a lot of things here easier to reason about. @RaduBerinde, is there any reason we can't treat NULL as another distinct value? For example, consider the following data:
a | b -----+----- NULL | 1 NULL | 1 NULL | NULL 1 | 2 3 | 4
I think we should treat this as having DistinctCount=4 and NullCount=3. Does that seem reasonable?
Sure. We'll need to update the sampler.go
to put the NULLs in the sketch too but it's a simple change.
We can also change what "null" means in the multi-column case if the other definition makes more sense. We aren't even calculating multi-column stats yet so only some comments would need to be updated.
6383db0
to
b90a309
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have addressed all comments and issues pointed out so far. Still going through the test output for anything I may have missed, and gonna be writing more tests in opt/memo/testdata/stats
next to exercise the non-zero null count cases. There's no observable difference in pkg/sql/opt/bench
benchmarks, at least after benchstat-ing 5 runs.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 332 at r2 (raw file):
Previously, rytaft wrote…
I think this old formula was better for this particular case (sorry if my comment below was confusing -- I didn't mean for you to change this one).
Done.
pkg/sql/opt/memo/statistics_builder.go, line 473 at r2 (raw file):
Previously, rytaft wrote…
It may be worth adding the following check at the end of every
colStatXXX
function:if colSet.SubsetOf(relProps.NotNullCols) { colStat.NullCount = 0 }
. (See if it changes any test output -- alternatively, just check the code that's calculatingNotNullCols
insidelogicalPropsBuilder
, and see which operations can possibly change it from their input. Maybe these cases are already covered by the filters inScan
,Select
andJoin
.)
Done.
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
I think we should follow a similar approach to how we calculated the row count. So there are 3 steps:
- Estimate the null count for the cross join
- Scale the null count based on the selectivity of the filter
- Adjust if needed for outer joins
For 1, I think the formula is something like:
crossJoinNullCount := leftColStat.NullCount * s.RowCount + rightColStat.NullCount * s.RowCount - leftColStat.NullCount * rightColStat.NullCount
For 2, you can just callcolStat.ApplySelectivity
For 3, take a look at what we're doing inbuildJoin
. For example, for left join, you should adjust the null count like this:
colStat.NullCount = max(innerJoinNullCount, leftColStat.NullCount)
Done - but feel free to suggest more changes since it's a fairly complex expression. The test stat output generated by it seems reasonable though.
pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
Previously, rytaft wrote…
I'd just say:
// This is the ratio of null column values to number of rows for nullable // columns, which is used in the absence of any real statistics.
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):
Previously, rytaft wrote…
The problem is you're not calling
ApplySelectivity
on this particularcolStat
. Once it's in the cache it doesn't get updated again. So I think you need to add a call toApplySelectivity
here, but make sure that doesn't cause the final selectivity to change.
Done. Required moving updateNullCountsFromProps
after really every other selectivityFrom*
call - not sure if that's an ideal idiom going forward but it works reliably.
pkg/sql/opt/memo/statistics_builder.go, line 1733 at r2 (raw file):
Previously, rytaft wrote…
I'm not sure it's necessary to update null counts here. Equivalency groups are created based on predicates like
x = y
, which are null-rejecting. So bothx
andy
will have their null counts set to 0 insideupdateNullCountsFromProps
. Although maybe I'm missing an edge case where x and y can both be null... it would be good to confirm before you rip this out.
Would that still work for null counts? If I add a panic for the case where one of the equivGroup cols is not in NotNullCols
, it does panic - so maybe we do need to propagate non-zero null counts in some cases?
pkg/sql/opt/memo/statistics_builder.go, line 1969 at r2 (raw file):
Previously, rytaft wrote…
You should also mention that we are handing the selectivity of NULL constraints separately in
selectivityFromNullCounts
Done.
pkg/sql/opt/memo/statistics_builder.go, line 195 at r3 (raw file):
Previously, rytaft wrote…
[nit] statFromChild -> statsFromChild
Done.
pkg/sql/opt/memo/statistics_builder.go, line 940 at r3 (raw file):
Previously, rytaft wrote…
Since
ApplySelectivity
updates null count, I think this denominator should beinputStats.RowCount
Done.
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
Previously, RaduBerinde wrote…
Sure. We'll need to update the
sampler.go
to put the NULLs in the sketch too but it's a simple change.We can also change what "null" means in the multi-column case if the other definition makes more sense. We aren't even calculating multi-column stats yet so only some comments would need to be updated.
Can this be part of a follow up change? It's not necessary for the introduction of null counts, and it can just be a simple reorganization once this behemoth is in.
pkg/sql/opt/memo/statistics_builder.go, line 1106 at r3 (raw file):
Previously, rytaft wrote…
leftColStat.NullCount -> leftNullCount
Done.
pkg/sql/opt/memo/statistics_builder_test.go, line 177 at r3 (raw file):
Previously, rytaft wrote…
This is an example of the issue where the distinct count of (2) is just copied from the input as-is. It should be scaled with a call to
ApplySelectivity
.
Done. The exact output here didn't change even after the fix because the distinct count is so much smaller than the row count, that the ApplySelectivity
formula barely modifies it.
pkg/sql/opt/memo/testdata/stats/select, line 199 at r3 (raw file):
Previously, rytaft wrote…
One thing you can do to get around this is specifically request certain column stats using the
colstats
directive. I think that will allow you to ensure you get the same results fortest
andtestrace
.
That didn't really help, since that only works on the root expression. If exploration rules swap out a child expression, colstats has no impact on that output. I added an additional call to logPropsBuilder.buildProps
in the tester that's run on every child to fix this.
pkg/sql/opt/props/statistics.go, line 84 at r3 (raw file):
Previously, rytaft wrote…
update null count here too
Done.
pkg/sql/opt/props/statistics.go, line 165 at r3 (raw file):
Previously, rytaft wrote…
[nit] more simpler -> simple
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Bilal, great work on a tedious change! I have some comments, I'll do a final review after your next update.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/memo.go, line 433 at r4 (raw file):
// DeriveLogicalProps derives logical props for the specified expression, // usually resulting in population of child stats in the test output tree. func DeriveLogicalProps(evalCtx *tree.EvalContext, ev ExprView) {
[nit] The comment should explain why/when this should be used. It's a very narrow use-case and it should be explicit that it's just for testing.
pkg/sql/opt/memo/memo.go, line 507 at r4 (raw file):
m.groups = append(m.groups, makeMemoGroup(tmpGroupID, denorm)) ev := MakeNormExprView(m, tmpGroupID) // Building out logical props could lead to more lazy population of
One possibility is to set a flag in logPropsBuilder to skip the building of stats (this code doesn't care abut stats).
pkg/sql/opt/memo/memo.go, line 511 at r4 (raw file):
// when run with and without -race. // // TODO: Figure out a way to either run this for all tests, or
[nit] usually TODO(yourname)
(btw the name isn't indicative of who is supposed to work on it, it's really indicative of who one should ask for clarifications)
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done - but feel free to suggest more changes since it's a fairly complex expression. The test stat output generated by it seems reasonable though.
I don't understand the formula (it doesn't look right to me).. Can you add some comments explaining it?
s.RowCount
is the number of estimated rows of the join right? Say rightNullCount is 0 and leftNullCount is 100. We would have 100 times more nulls than rows? Or say s.RowCount
is very small, this whole thing can easily turn negative.
I don't understand why we're multiplying s.RowCount
with the count of the nulls (and not the fraction of rows that have nulls). The formula I would use would be similar to the one in colStatIndexJoin
(s.RowCount * (f1 + f2 - f1 * f2)
).
Also, what test stat output are you looking at specifically? This is only for multi-column stats which are unused unless we explicitly request them in tests using colstat=(1,2,3)
pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Can this be part of a follow up change? It's not necessary for the introduction of null counts, and it can just be a simple reorganization once this behemoth is in.
Absolutely. You can also file an issue if it's not something you want to work on right away.
pkg/sql/opt/props/statistics.go, line 135 at r4 (raw file):
// DistinctCount is the estimated number of distinct values of this // set of columns for this expression.
[nit] Mention here how this interacts with NullCount (specifically mention that any rows that have at least a null don't contribute to this count). And maybe leave a TODO
to consider changing things so DistinctCount
counts these as well (as Becca suggested).
pkg/sql/opt/testutils/opt_tester.go, line 341 at r4 (raw file):
// Derive logical props for all child expressions, even // denormalized ones - to make statistics look complete
[nit] This could use a bit more explanation, it's not enough for someone looking at this for the first time. Explain that we're not interested in the logical props, but calculating them (specifically the statistics) will trigger lazy calculation of column stats in child expressions.
0e030ac
to
31a73eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review Radu! I've added more tests that exercise many of the complex parts of this. Also addressed the panics we were seeing.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, RaduBerinde wrote…
I don't understand the formula (it doesn't look right to me).. Can you add some comments explaining it?
s.RowCount
is the number of estimated rows of the join right? Say rightNullCount is 0 and leftNullCount is 100. We would have 100 times more nulls than rows? Or says.RowCount
is very small, this whole thing can easily turn negative.I don't understand why we're multiplying
s.RowCount
with the count of the nulls (and not the fraction of rows that have nulls). The formula I would use would be similar to the one incolStatIndexJoin
(s.RowCount * (f1 + f2 - f1 * f2)
).Also, what test stat output are you looking at specifically? This is only for multi-column stats which are unused unless we explicitly request them in tests using
colstat=(1,2,3)
That makes more sense, actually, to use the independent probability rule. I've added a simple explanation in the comments.
I've also added a basic test case with multi-col stats across join columns so you can see the output, but it doesn't seem right just yet. I'll keep looking into this tomorrow.
31a73eb
to
f5594ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 31 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
That makes more sense, actually, to use the independent probability rule. I've added a simple explanation in the comments.
I've also added a basic test case with multi-col stats across join columns so you can see the output, but it doesn't seem right just yet. I'll keep looking into this tomorrow.
Sorry, I screwed up the formula for the null count in the cross join in my formula. I think it should be:
crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount
Does that make more sense? Maybe Radu's solution is still better - haven't had time to think about it much, but I think this new formula should also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
Sorry, I screwed up the formula for the null count in the cross join in my formula. I think it should be:
crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount
Does that make more sense? Maybe Radu's solution is still better - haven't had time to think about it much, but I think this new formula should also work.
Your formula is equivalent to leftRowCount * rightRowCount * (f1 + f2 - f1 * f2)
, which becomes s.RowCount * (f1 + f2 - f1 * f2)
(my formula) after multiplying with Selectivity
(right?) So either one works, just don't use my formula and then also apply Selectivity
:)
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file): Previously, RaduBerinde wrote…
You’re totally right - thanks, Radu! 👍 |
f5594ad
to
3609c21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):
Previously, rytaft wrote…
You’re totally right - thanks, Radu! 👍
Whoops, my bad about the double application of selectivity. Fixed - please take another look! The stats in test files (specifically the ones at the bottom of memo/testdata/stats/join
) look better now.
pkg/sql/opt/testutils/opt_tester.go, line 341 at r4 (raw file):
Previously, RaduBerinde wrote…
[nit] This could use a bit more explanation, it's not enough for someone looking at this for the first time. Explain that we're not interested in the logical props, but calculating them (specifically the statistics) will trigger lazy calculation of column stats in child expressions.
Done
pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Looking into this.
Done. We were not processing selectivity for one of the nullable-but-null-rejected columns due to an input FD.
26815e3
to
f3e803c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 821 at r11 (raw file):
Previously, rytaft wrote…
move this switch inside the loop (just above
adjustNullCountsForOuterJoins
) so you still callinnerJoinNullCount
for inner joins.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r13.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 818 at r13 (raw file):
// Loop through all colSets added in this step, and adjust null counts // for outer join operations accordingly.
Update comment - this is not just updating outer joins anymore
f3e803c
to
a741216
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/statistics_builder.go, line 818 at r13 (raw file):
Previously, rytaft wrote…
Update comment - this is not just updating outer joins anymore
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting close now! I think you still need to fix up the tests a bit. I really like that you bumped up the null counts for some additional tests, but I think you need to be a bit more deliberate about which queries you choose to test with. Instead of copying a bunch of queries from the existing set of tests, try to think of a small number of new ones that will test some specific part of your code (or reuse a query if you can explain in a comment why it's interesting for null counts). Try to get as much coverage as possible for all the different edge cases you are considering.
I'm making a bit of a fuss about the tests because I think it's concerning that the test output in the join test file hasn't changed since r9 even though you've made several non-trivial changes to the code since then. Also it's difficult to review when there are a lot of new tests and it's not clear what they are testing.
Reviewed 1 of 22 files at r6, 1 of 7 files at r8, 1 of 8 files at r9, 1 of 1 files at r14.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/testdata/stats/scan, line 314 at r14 (raw file):
└── scan a ├── columns: y:2(int) s:3(string) └── stats: [rows=3000, distinct(2,3)=1000, null(2,3)=2000]
null(2,3) should probably be less than 2000. I think you need to add some logic in colStatLeaf
to subtract the expected number of collisions (or feel free to add a TODO)
a741216
to
caf6edb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! The emphasis on tests makes sense. I've added more tests (+ more colStats on existing tests) that should cover all the join cases, but I'll be sifting through it all more closely on Monday. Also need to resolve a 20-file merge conflict with master, now that the memo change is in.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/testdata/stats/scan, line 314 at r14 (raw file):
Previously, rytaft wrote…
null(2,3) should probably be less than 2000. I think you need to add some logic in
colStatLeaf
to subtract the expected number of collisions (or feel free to add a TODO)
Added a simple collision factor calculation that works well for 2-column cases but starts to underestimate collisions for larger sets of columns. I've added a TODO to think of a better general solution for n-way collisions.
0a3435b
to
a33c8b3
Compare
Thanks for the reviews so far! I've resolved the gigantic merge conflict with master - but in the process I had to comment out |
a33c8b3
to
eb21d96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this -- that rebase was probably a pain. If you haven't already, please open an issue for changing distinct counts to include null values. Feel free to assign it to me.
You still need more tests -- I think at least one new test to cover each operator that you changed as part of this PR (e.g., are nulls getting counted correctly in VALUES
? PROJECT
?)
Reviewed 33 of 33 files at r16.
Reviewable status: complete! 0 of 0 LGTMs obtained
eb21d96
to
95f6b9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added:
-
Tests for most missing operations
-
Fix for testrace failure due to extra logprop building (ended up going with Radu's suggestion to disable stat building in
checkExpr
, formerly inMemoizeDenormExpr
). The bug ended up being a side-effect of the fact that, in the new memo, we preemptively delete references to logical prop builder and metadata once we know optimization is complete - so it's too late to build more logical props inside opt_tester.go after that point. Simply not building extra stats intestrace
is the easier way out. -
One minor fix in
colStatSetNode
uncovered by tests
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/memo.go, line 507 at r4 (raw file):
Previously, RaduBerinde wrote…
One possibility is to set a flag in logPropsBuilder to skip the building of stats (this code doesn't care abut stats).
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new tests! I think you just need to update one of the tests for project.
Reviewed 10 of 10 files at r17.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/testdata/stats/project, line 260 at r17 (raw file):
# Test that the stats are calculated correctly for synthesized columns. build SELECT * FROM (SELECT concat(s, y::string) FROM a) AS q(v) WHERE v = 'foo'
Can you also add a synthesized column that is NULL
?
pkg/sql/opt/memo/testdata/stats/set, line 629 at r17 (raw file):
]' ----
Nice tests!
This change takes null counts that are already collected in table statistics, and propagates them through the different operators in statistics_builder. It also uses these null counts to generate selectivities which are then applied to the row count. The expectation is that this change will lead to much better cardinality estimation, especially in workloads involving lots of null values that get filtered out by an operator or constraint. Previously, we treated null counts like any other distinct value. Fixes cockroachdb#30289 Release note: None
95f6b9e
to
a8be7a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/opt/memo/testdata/stats/project, line 260 at r17 (raw file):
Previously, rytaft wrote…
Can you also add a synthesized column that is
NULL
?
Done. Also fixed an oversight in colStatProject
in the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
congrats on pushing this one over the finish line!
Reviewed 2 of 2 files at r18.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
👎 Rejected by code reviews |
bors r+ |
30827: opt: Propagate null counts through statsbuilder r=itsbilal a=itsbilal NOTE: This is a large diff with mostly minor test changes. Implementation work is confined to just these files: - [opt/memo/statistics_builder.go](pkg/sql/opt/memo/statistics_builder.go) - [opt/metadata.go](pkg/sql/opt/metadata.go) - [opt/props/statistics.go](pkg/sql/opt/props/statistics.go) This change takes null counts that are already collected in table statistics, and propagates them through the different operators in statistics_builder. It also uses these null counts to generate selectivities which are then applied to the row count. The expectation is that this change will lead to much better cardinality estimation, especially in workloads involving lots of null values that get filtered out by an operator or constraint. Previously, we treated null counts like any other distinct value. Fixes #30289 Release note: None Co-authored-by: Bilal Akhtar <me@itsbilal.com>
Build succeeded |
NOTE: This is a large diff with mostly minor test changes. Implementation work is confined to just these files:
This change takes null counts that are already collected in table
statistics, and propagates them through the different operators in
statistics_builder. It also uses these null counts to generate
selectivities which are then applied to the row count.
The expectation is that this change will lead to much better cardinality
estimation, especially in workloads involving lots of null values that
get filtered out by an operator or constraint. Previously, we treated
null counts like any other distinct value.
Fixes #30289
Release note: None