Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: Propagate null counts through statsbuilder #30827

Merged
merged 1 commit into from
Oct 23, 2018

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Oct 1, 2018

NOTE: This is a large diff with mostly minor test changes. Implementation work is confined to just these files:

This change takes null counts that are already collected in table
statistics, and propagates them through the different operators in
statistics_builder. It also uses these null counts to generate
selectivities which are then applied to the row count.

The expectation is that this change will lead to much better cardinality
estimation, especially in workloads involving lots of null values that
get filtered out by an operator or constraint. Previously, we treated
null counts like any other distinct value.

Fixes #30289

Release note: None

@itsbilal itsbilal self-assigned this Oct 1, 2018
@itsbilal itsbilal requested a review from a team as a code owner October 1, 2018 14:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal itsbilal force-pushed the null-counts-statbuilder branch 2 times, most recently from e6f2e47 to fa3e9fb Compare October 1, 2018 14:48
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

I did a first pass -- my main comments are:

  1. For multi-column stats, I believe we are determining null count based on whether at least one column in the column set is null. I think your code assumes that all columns must be null.
  2. We are not including null values in the distinct count. So if all values in a column are null, distinct count is 0.
  3. We've already determined which columns are not null in the logicalPropsBuilder. So you should be able to just use relProps.NotNullCols.
  4. There are a few places where you are setting NullCount to 0 where I think that might be a bit premature. I marked a couple of them, but it might be good to cross-reference against relProps.NotNullCols.

@RaduBerinde - can you confirm points 1 and 2?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/metadata.go, line 169 at r1 (raw file):

	// notnull is whether this column is guaranteed to not hold nulls.
	notnull bool

I don't think we need to store this in the metadata. We are already storing it in the logical properties.


pkg/sql/opt/memo/statistics_builder.go, line 332 at r1 (raw file):

	// duplicates where all columns are NULL.
	if fd.ColsAreLaxKey(colSet) {
		colStat.DistinctCount = s.RowCount

I think you should update this to be:

colStat.DistinctCount = s.RowCount - colStat.NullCount

(and move it below calculation of colStat.NullCount)


pkg/sql/opt/memo/statistics_builder.go, line 333 at r1 (raw file):

	if fd.ColsAreLaxKey(colSet) {
		colStat.DistinctCount = s.RowCount
		colStat.NullCount = 1

Why is this equal to 1? I don't think we can make any assumptions about the null count for lax keys. You probably want to calculate this using unknownNullCountRatio.


pkg/sql/opt/memo/statistics_builder.go, line 337 at r1 (raw file):

	}

	if s.RowCount == 0 {

Are we ever hitting this case? I thought we handled this case elsewhere (but I'm probably mistaken). Either way, you should move this case up above, so we don't call ColsAreLaxKey if it's not necessary.


pkg/sql/opt/memo/statistics_builder.go, line 359 at r1 (raw file):

			colStatLeaf := sb.colStatLeaf(util.MakeFastIntSet(i), s, fd)
			distinctCount *= colStatLeaf.DistinctCount
			nullFactor *= colStatLeaf.NullCount / s.RowCount

I think we are defining null count to be at least one column in the column set is null. This seems to be calculating the null count assuming all columns must be null.


pkg/sql/opt/memo/statistics_builder.go, line 447 at r1 (raw file):

		// Calculate row count and selectivity
		// -----------------------------------
		savedRowCount := s.RowCount

[nit] I think inputRowCount would be a bit more consistent with the naming in the rest of this file. You could also just use inputStats.RowCount to avoid creating a new variable.


pkg/sql/opt/memo/statistics_builder.go, line 841 at r1 (raw file):

			colStat.DistinctCount = s.RowCount
		}
		// Similarly, the null count should be no larger than (RowCount - DistinctCount + 1).

I believe by convention we are not treating NULL as part of the distinct count. So if all values are NULL, distinct count should be 0.

So this should be no larger than (RowCount - DistinctCount)


pkg/sql/opt/memo/statistics_builder.go, line 925 at r1 (raw file):

		// Assuming null columns are completely independent, calculate
		// the expected value of having nulls in both column sets.

See comment above - I think it should be nulls in either column set


pkg/sql/opt/memo/statistics_builder.go, line 933 at r1 (raw file):

		colStat.DistinctCount = s.RowCount
	}
	// Similarly, the null count should be no larger than (RowCount - DistinctCount + 1).

See comment above - I think it should be (RowCount - DistinctCount)


pkg/sql/opt/memo/statistics_builder.go, line 973 at r1 (raw file):

		colStat, _ := s.ColStats.Add(colSet)
		colStat.DistinctCount = 1
		colStat.NullCount = 0

It's possible that the output of the scalar group by could be NULL (feel free to just add a TODO here so we don't forget about it)


pkg/sql/opt/memo/statistics_builder.go, line 1048 at r1 (raw file):

	leftNullCount := min(1, leftColStat.NullCount)
	rightNullCount := min(1, rightColStat.NullCount)

Why are you reducing to 1 here? I think this is only valid for union, intersect and except, not for the ALL variants.


pkg/sql/opt/memo/statistics_builder.go, line 1055 at r1 (raw file):

	case opt.UnionOp, opt.UnionAllOp:
		colStat.DistinctCount = leftColStat.DistinctCount + rightColStat.DistinctCount
		colStat.NullCount = min(leftNullCount, rightNullCount)

I think this should be the sum here


pkg/sql/opt/memo/statistics_builder.go, line 1256 at r1 (raw file):

		// The ordinality column is a key, so every row is distinct.
		colStat.DistinctCount = s.RowCount
		colStat.NullCount = 0

Unless def.ColID is the only column, I'm not sure we can say NullCount is definitely zero


pkg/sql/opt/memo/statistics_builder.go, line 1593 at r1 (raw file):

// This function should be called before selectivityFromNullCounts.
//
func (sb *statisticsBuilder) updateNullCountsFromConstraint(

Is there any reason you can't just use the NotNullCols logical property? I don't think you need to look at the constraint at all (that was already done in logicalPropsBuilder).


pkg/sql/opt/memo/statistics_builder.go, line 1877 at r1 (raw file):

	// Short circuit if no rows
	if rowCount <= 0 {
		return selectivity

Do we ever hit this case?

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I should have addressed everything you pointed out. The logictest failure that existed in the first revision should also be fixed now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/metadata.go, line 169 at r1 (raw file):

Previously, rytaft wrote…

I don't think we need to store this in the metadata. We are already storing it in the logical properties.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 332 at r1 (raw file):

Previously, rytaft wrote…

I think you should update this to be:

colStat.DistinctCount = s.RowCount - colStat.NullCount

(and move it below calculation of colStat.NullCount)

Done.


pkg/sql/opt/memo/statistics_builder.go, line 333 at r1 (raw file):

Previously, rytaft wrote…

Why is this equal to 1? I don't think we can make any assumptions about the null count for lax keys. You probably want to calculate this using unknownNullCountRatio.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 337 at r1 (raw file):

Previously, rytaft wrote…

Are we ever hitting this case? I thought we handled this case elsewhere (but I'm probably mistaken). Either way, you should move this case up above, so we don't call ColsAreLaxKey if it's not necessary.

Done. Removed it since we weren't hitting it.


pkg/sql/opt/memo/statistics_builder.go, line 359 at r1 (raw file):

Previously, rytaft wrote…

I think we are defining null count to be at least one column in the column set is null. This seems to be calculating the null count assuming all columns must be null.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 447 at r1 (raw file):

Previously, rytaft wrote…

[nit] I think inputRowCount would be a bit more consistent with the naming in the rest of this file. You could also just use inputStats.RowCount to avoid creating a new variable.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 841 at r1 (raw file):

Previously, rytaft wrote…

I believe by convention we are not treating NULL as part of the distinct count. So if all values are NULL, distinct count should be 0.

So this should be no larger than (RowCount - DistinctCount)

Done.


pkg/sql/opt/memo/statistics_builder.go, line 925 at r1 (raw file):

Previously, rytaft wrote…

See comment above - I think it should be nulls in either column set

Done.


pkg/sql/opt/memo/statistics_builder.go, line 933 at r1 (raw file):

Previously, rytaft wrote…

See comment above - I think it should be (RowCount - DistinctCount)

Done.


pkg/sql/opt/memo/statistics_builder.go, line 973 at r1 (raw file):

Previously, rytaft wrote…

It's possible that the output of the scalar group by could be NULL (feel free to just add a TODO here so we don't forget about it)

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1048 at r1 (raw file):

Previously, rytaft wrote…

Why are you reducing to 1 here? I think this is only valid for union, intersect and except, not for the ALL variants.

Done. Special-cased the set and bag operations separately.


pkg/sql/opt/memo/statistics_builder.go, line 1055 at r1 (raw file):

Previously, rytaft wrote…

I think this should be the sum here

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1256 at r1 (raw file):

Previously, rytaft wrote…

Unless def.ColID is the only column, I'm not sure we can say NullCount is definitely zero

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1593 at r1 (raw file):

Previously, rytaft wrote…

Is there any reason you can't just use the NotNullCols logical property? I don't think you need to look at the constraint at all (that was already done in logicalPropsBuilder).

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1877 at r1 (raw file):

Previously, rytaft wrote…

Do we ever hit this case?

Yes, we do hit it in some cases it seems. And since we divide by rowCount here, it's best to short circuit it early on.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking, but added a few comments to get started

Reviewed 2 of 37 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):


query TTTTT
EXPLAIN (TYPES) SELECT min(y) FROM xyz WHERE x = 1

Why did these plans change? It seems like the new plans are worse than the old ones.


pkg/sql/opt/memo/statistics_builder.go, line 329 at r2 (raw file):

	// If some of the columns are a lax key, the distinct count equals the row
	// count.

Update comment to mention null count.


pkg/sql/opt/memo/statistics_builder.go, line 330 at r2 (raw file):

	// If some of the columns are a lax key, the distinct count equals the row
	// count.
	if fd.ColsAreLaxKey(colSet) {

I think ColsAreLaxKey returns true for both strict and lax keys. So you should probably add a check if colSet.SubsetOf(notNullCols) { colStat.NullCount = 0 } (similar to below).


pkg/sql/opt/memo/statistics_builder.go, line 618 at r2 (raw file):

		// There are no columns in this expression, so it must be a constant.
		colStat.DistinctCount = 1
		colStat.NullCount = 0

Seems like this expression could also be NULL. Maybe you could check if it's a NullOp, and if so set colStat.NullCount to the row count?


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

			colStat, _ = s.ColStats.Add(colSet)
			colStat.DistinctCount = leftColStat.DistinctCount * rightColStat.DistinctCount
			colStat.NullCount = leftColStat.NullCount * rightColStat.NullCount

Is this right? I'm having a hard time convincing myself what the right value should be.


pkg/sql/opt/memo/statistics_builder.go, line 919 at r2 (raw file):

		// Assuming null columns are completely independent, calculate
		// the expected value of having nulls in either column sets.

[nit] column set


pkg/sql/opt/memo/statistics_builder.go, line 1123 at r2 (raw file):

	colStat.DistinctCount = float64(len(distinct))
	// We cannot read values, so just use the default
	// null count ratio here.

I think you can check if the operator is NullOp


pkg/sql/opt/memo/statistics_builder.go, line 1228 at r2 (raw file):

	colStat, _ := ev.Logical().Relational.Stats.ColStats.Add(colSet)
	colStat.DistinctCount = 1
	colStat.NullCount = 0

Another one I'm not convinced should be 0


pkg/sql/opt/memo/statistics_builder.go, line 1325 at r2 (raw file):

	if s.RowCount == 1 {
		colStat.DistinctCount = 1
		colStat.NullCount = 0

ditto


pkg/sql/opt/memo/statistics_builder.go, line 1328 at r2 (raw file):

	} else {
		colStat.DistinctCount = s.RowCount * unknownDistinctCountRatio
		colStat.NullCount = s.RowCount * math.Pow(unknownNullCountRatio, float64(colSet.Len()))

This needs to be updated


pkg/sql/opt/memo/statistics_builder.go, line 1554 at r2 (raw file):

	}

	sb.updateNullCountsFromProps(ev, relProps)

I think this should be called outside applyConstraint so it will get called if a scan is unconstrained.


pkg/sql/opt/memo/statistics_builder.go, line 1571 at r2 (raw file):

	numUnappliedConjuncts = 0
	for i := 0; i < cs.Length(); i++ {
		sb.updateNullCountsFromProps(ev, relProps)

same here - this should be moved out of applyConstraintSet


pkg/sql/opt/memo/statistics_builder.go, line 1588 at r2 (raw file):

// updateNullCountsFromProps zeroes null counts for columns that cannot
// have nulls in them, usually due to a column property or an application.
// of a null-excluding filter. The actual determinatino of non-nullable

determinatio -> determination


pkg/sql/opt/memo/statistics_builder.go, line 1589 at r2 (raw file):

// have nulls in them, usually due to a column property or an application.
// of a null-excluding filter. The actual determinatino of non-nullable
// columns is done in the logical prop builder.

[nit] logical props builder

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):

		})
		colStat.DistinctCount = min(distinctCount, s.RowCount)
		colStat.NullCount = min(nullCount, s.RowCount)

In other places you are setting this to min(nullCount, s.RowCount-colStat.DistinctCount). I don't have a strong opinion about which is better, but at least we should be consistent.


pkg/sql/opt/memo/statistics_builder.go, line 473 at r2 (raw file):

	}

	return colStat

It may be worth adding the following check at the end of every colStatXXX function: if colSet.SubsetOf(relProps.NotNullCols) { colStat.NullCount = 0 }. (See if it changes any test output -- alternatively, just check the code that's calculating NotNullCols inside logicalPropsBuilder, and see which operations can possibly change it from their input. Maybe these cases are already covered by the filters in Scan, Select and Join.)


pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):
If colSet contains a single column, the null count equals min(1, inputColStat.NullCount). For example:

> create table t (x int, y int);
> insert into t values (null, 1), (null, 1), (1, 2), (1, null), (1, 2);
> select count(*), x from t group by x;
  count |  x    
+-------+------+
      2 | NULL  
      3 |    1  
(2 rows)

Multi-column stats are more complicated, and I'm not really sure what the best approach is. See if you can come up with something that seems reasonable.... (happy to discuss ideas offline)


pkg/sql/opt/memo/statistics_builder.go, line 1058 at r2 (raw file):

	// (the non-ALL variants).
	leftNullCount := min(1, leftColStat.NullCount)
	rightNullCount := min(1, rightColStat.NullCount)

This only works if each side has one column. I think you should be consistent with whatever you come up with for multi-column stats for grouping columns and apply that here.


pkg/sql/opt/memo/statistics_builder.go, line 1062 at r2 (raw file):

	switch ev.Operator() {
	case opt.UnionOp:
		colStat.NullCount = max(leftNullCount, rightNullCount)

I think this should probably be leftNullCount + rightNullCount. I think you can actually use the same formulas for the regular and ALL variants, and put them in the switch above. The difference will be that for the ALL variants, use:

leftNullCount = leftColStat.NullCount
rightNullCount = rightColStat.NullCount

And for the regular variants, use the multi-column stats for grouping columns estimation.


pkg/sql/opt/memo/statistics_builder.go, line 1070 at r2 (raw file):

		colStat.NullCount = min(leftColStat.NullCount, rightColStat.NullCount)
	case opt.ExceptOp:
		colStat.NullCount = max(0, leftNullCount-rightNullCount)

Seems like you might be under-estimating null count here, especially in the case of more than one column. I'd probably just use leftNullCount, similar to the distinct count calculation above. It's definitely not perfect, but at least it's consistent.


pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):

	// This is the ratio of null column values to number of rows for nullable
	// columns, which is used in the absence of any real statistics for non-key

whether or not the column is a key is not relevant for null count - only whether the column is nullable.


pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):

) {
	relProps.NotNullCols.ForEach(func(col int) {
		colStat := sb.ensureColStat(util.MakeFastIntSet(col), math.MaxFloat64, ev, relProps)

There is a potential problem here, which is that if a particular column is not already in s.ColStats, we are just copying the distinct count from its input. The distinct count should probably be scaled based on the selectivity of the filter, like we do in colStatScan, colStatSelect, etc. This also means that we may need to move this function call after any call to selectivityFromDistinctCounts. (But that kind of breaks the flow of the calling functions -- play around with it and see if you can fix the logic without hurting the flow too much)


pkg/sql/opt/memo/statistics_builder.go, line 1733 at r2 (raw file):

	// Find the minimum distinct and null counts for all columns in this equivalency
	// group.

I'm not sure it's necessary to update null counts here. Equivalency groups are created based on predicates like x = y, which are null-rejecting. So both x and y will have their null counts set to 0 inside updateNullCountsFromProps. Although maybe I'm missing an edge case where x and y can both be null... it would be good to confirm before you rip this out.


pkg/sql/opt/memo/statistics_builder.go, line 1788 at r2 (raw file):

		if inputStat.DistinctCount != 0 && colStat.DistinctCount < inputStat.DistinctCount {
			newDistinct := colStat.DistinctCount
			oldDistinct := inputStat.DistinctCount

[nit] Why don't you move these definitions outside the if statement so we can use them in the if statement as well.


pkg/sql/opt/memo/statistics_builder.go, line 1805 at r2 (raw file):

//              i in
//            {constrained
//               columns}

Nice formula!


pkg/sql/opt/memo/statistics_builder.go, line 1828 at r2 (raw file):

			panic("rowCount passed in was too small")
		}
		if inputStat.NullCount != 0 && colStat.NullCount < inputStat.NullCount {

Isn't it sufficient to check if colStat.NullCount < inputStat.NullCount?


pkg/sql/opt/memo/statistics_builder.go, line 1969 at r2 (raw file):

			// Cases of NULL in a constraint should be ignored. For example,
			// without knowledge of the data distribution, /a: (/NULL - /10] should
			// have the same estimated selectivity as /a: [/10 - ].

You should also mention that we are handing the selectivity of NULL constraints separately in selectivityFromNullCounts

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah what Rebecca said above is correct.

[nit] Just a tip for future changes - I found that for bigger changes, it helps to separate the code changes (and new tests) from the existing test changes (in different commits). It's easier to review, and it's easier to rebase (if you have a conflict in a test file, you can just nuke the entire commit and re-run tests with -rewrite)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):

		// Assuming null columns are completely independent, calculate
		// the expected value of having nulls in either column sets.
		colStat.NullCount += lookupColStat.NullCount

I was looking at r1 and was going to suggest the formula rowcount * (f1 + f2 - f1 * f2) where f1,f2 are the two null/rowcount fractions.

@rytaft
Copy link
Collaborator

rytaft commented Oct 3, 2018


pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):

Previously, RaduBerinde wrote…

I was looking at r1 and was going to suggest the formula rowcount * (f1 + f2 - f1 * f2) where f1,f2 are the two null/rowcount fractions.

Great idea!

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your help so far @rytaft and @RaduBerinde! I should have addressed mostly everything major that was pointed out - the one major exception being the regression with indexed min(x) 's not being replaced with a limit anymore that I'm still looking into.

I've pushed my work so far but feel free to hold off on the review until I've addressed that.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):

Previously, rytaft wrote…

Why did these plans change? It seems like the new plans are worse than the old ones.

Looking into this.


pkg/sql/opt/memo/statistics_builder.go, line 329 at r2 (raw file):

Previously, rytaft wrote…

Update comment to mention null count.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 330 at r2 (raw file):

Previously, rytaft wrote…

I think ColsAreLaxKey returns true for both strict and lax keys. So you should probably add a check if colSet.SubsetOf(notNullCols) { colStat.NullCount = 0 } (similar to below).

Done.


pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):

Previously, rytaft wrote…

In other places you are setting this to min(nullCount, s.RowCount-colStat.DistinctCount). I don't have a strong opinion about which is better, but at least we should be consistent.

Done - moved to RowCount only.


pkg/sql/opt/memo/statistics_builder.go, line 618 at r2 (raw file):

Previously, rytaft wrote…

Seems like this expression could also be NULL. Maybe you could check if it's a NullOp, and if so set colStat.NullCount to the row count?

Done.


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, rytaft wrote…

Is this right? I'm having a hard time convincing myself what the right value should be.

Any ideas of alternatives? It's hard to deduce this without understanding how the rows interleave - but there may be a decent heuristic.


pkg/sql/opt/memo/statistics_builder.go, line 919 at r2 (raw file):

Previously, rytaft wrote…

[nit] column set

Done.


pkg/sql/opt/memo/statistics_builder.go, line 920 at r2 (raw file):

Previously, rytaft wrote…

Great idea!

Done.


pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):

Previously, rytaft wrote…

If colSet contains a single column, the null count equals min(1, inputColStat.NullCount). For example:

> create table t (x int, y int);
> insert into t values (null, 1), (null, 1), (1, 2), (1, null), (1, 2);
> select count(*), x from t group by x;
  count |  x    
+-------+------+
      2 | NULL  
      3 |    1  
(2 rows)

Multi-column stats are more complicated, and I'm not really sure what the best approach is. See if you can come up with something that seems reasonable.... (happy to discuss ideas offline)

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1058 at r2 (raw file):

Previously, rytaft wrote…

This only works if each side has one column. I think you should be consistent with whatever you come up with for multi-column stats for grouping columns and apply that here.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1062 at r2 (raw file):

Previously, rytaft wrote…

I think this should probably be leftNullCount + rightNullCount. I think you can actually use the same formulas for the regular and ALL variants, and put them in the switch above. The difference will be that for the ALL variants, use:

leftNullCount = leftColStat.NullCount
rightNullCount = rightColStat.NullCount

And for the regular variants, use the multi-column stats for grouping columns estimation.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1070 at r2 (raw file):

Previously, rytaft wrote…

Seems like you might be under-estimating null count here, especially in the case of more than one column. I'd probably just use leftNullCount, similar to the distinct count calculation above. It's definitely not perfect, but at least it's consistent.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1123 at r2 (raw file):

Previously, rytaft wrote…

I think you can check if the operator is NullOp

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1228 at r2 (raw file):

Previously, rytaft wrote…

Another one I'm not convinced should be 0

Done - went with unknownNullCountRatio


pkg/sql/opt/memo/statistics_builder.go, line 1325 at r2 (raw file):

Previously, rytaft wrote…

ditto

Done - went with unknownNullCountRatio


pkg/sql/opt/memo/statistics_builder.go, line 1328 at r2 (raw file):

Previously, rytaft wrote…

This needs to be updated

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):

Previously, rytaft wrote…

whether or not the column is a key is not relevant for null count - only whether the column is nullable.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1554 at r2 (raw file):

Previously, rytaft wrote…

I think this should be called outside applyConstraint so it will get called if a scan is unconstrained.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1571 at r2 (raw file):

Previously, rytaft wrote…

same here - this should be moved out of applyConstraintSet

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1588 at r2 (raw file):

Previously, rytaft wrote…

determinatio -> determination

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1589 at r2 (raw file):

Previously, rytaft wrote…

[nit] logical props builder

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):

Previously, rytaft wrote…

There is a potential problem here, which is that if a particular column is not already in s.ColStats, we are just copying the distinct count from its input. The distinct count should probably be scaled based on the selectivity of the filter, like we do in colStatScan, colStatSelect, etc. This also means that we may need to move this function call after any call to selectivityFromDistinctCounts. (But that kind of breaks the flow of the calling functions -- play around with it and see if you can fix the logic without hurting the flow too much)

Is this an issue though? It seems like we only call this function after we've already called applyConstraint or applyConstraintSet - so either the distinct count has already been updated to match the constraint, or that particular row is unconstrained and will later have its counts modified when we call ApplySelectivity. I may need to dive in deeper to be confident with this solution though.


pkg/sql/opt/memo/statistics_builder.go, line 1788 at r2 (raw file):

Previously, rytaft wrote…

[nit] Why don't you move these definitions outside the if statement so we can use them in the if statement as well.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1828 at r2 (raw file):

Previously, rytaft wrote…

Isn't it sufficient to check if colStat.NullCount < inputStat.NullCount?

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes! I've added a few more comments inline, but my big comment now is that I think you need to add some more tests in memo/testdata/stats that explicitly exercise all the changes you've made. Also, take a look at the stats in the existing tests and make sure you can convince yourself that they make sense.

Reviewed 2 of 37 files at r1, 6 of 37 files at r2, 33 of 33 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 332 at r2 (raw file):

	if fd.ColsAreLaxKey(colSet) {
		colStat.NullCount = unknownNullCountRatio * s.RowCount
		colStat.DistinctCount = s.RowCount - colStat.NullCount

I think this old formula was better for this particular case (sorry if my comment below was confusing -- I didn't mean for you to change this one).


pkg/sql/opt/memo/statistics_builder.go, line 355 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done - moved to RowCount only.

This is probably fine, but maybe we should see which one produces better plans. It's possible we should actually be doing something in between these two extremes.... It would be good to add some tests that explicitly exercise this case.


pkg/sql/opt/memo/statistics_builder.go, line 978 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done.

Nice!


pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):
I'd just say:

// This is the ratio of null column values to number of rows for nullable
// columns, which is used in the absence of any real statistics.

pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Is this an issue though? It seems like we only call this function after we've already called applyConstraint or applyConstraintSet - so either the distinct count has already been updated to match the constraint, or that particular row is unconstrained and will later have its counts modified when we call ApplySelectivity. I may need to dive in deeper to be confident with this solution though.

The problem is you're not calling ApplySelectivity on this particular colStat. Once it's in the cache it doesn't get updated again. So I think you need to add a call to ApplySelectivity here, but make sure that doesn't cause the final selectivity to change.


pkg/sql/opt/memo/statistics_builder.go, line 195 at r3 (raw file):

// statsFromChild retrieves the main statistics struct from a specific child
// of the given expression.
func (sb *statisticsBuilder) statFromChild(ev ExprView, childIdx int) *props.Statistics {

[nit] statFromChild -> statsFromChild


pkg/sql/opt/memo/statistics_builder.go, line 940 at r3 (raw file):

		// Assuming null columns are completely independent, calculate
		// the expected value of having nulls in either column set.
		f1 := lookupColStat.NullCount / tableStats.RowCount

Since ApplySelectivity updates null count, I think this denominator should be inputStats.RowCount


pkg/sql/opt/memo/statistics_builder.go, line 1106 at r3 (raw file):

	case opt.ExceptOp, opt.ExceptAllOp:
		colStat.DistinctCount = leftColStat.DistinctCount
		colStat.NullCount = leftColStat.NullCount

leftColStat.NullCount -> leftNullCount


pkg/sql/opt/memo/statistics_builder_test.go, line 177 at r3 (raw file):

	statsFunc(
		cs2,
		"[rows=3.33333333e+09, distinct(2)=500, null(2)=0]",

This is an example of the issue where the distinct count of (2) is just copied from the input as-is. It should be scaled with a call to ApplySelectivity.


pkg/sql/opt/memo/testdata/stats/select, line 199 at r3 (raw file):

# Hide stats because when run with testrace, more stats are populated than
# otherwise. See comment in MemoizeDenormExpr in memo.go.
opt format=hide-stats

One thing you can do to get around this is specifically request certain column stats using the colstats directive. I think that will allow you to ensure you get the same results for test and testrace.


pkg/sql/opt/props/statistics.go, line 84 at r3 (raw file):

		s.RowCount = 0
		for i, n := 0, s.ColStats.Count(); i < n; i++ {
			s.ColStats.Get(i).DistinctCount = 0

update null count here too


pkg/sql/opt/props/statistics.go, line 165 at r3 (raw file):

	c.DistinctCount = d - d*math.Pow(1-selectivity, n/d)

	// Since the null count is a more simpler count of all null rows, we can

[nit] more simpler -> simple

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Any ideas of alternatives? It's hard to deduce this without understanding how the rows interleave - but there may be a decent heuristic.

I think we should follow a similar approach to how we calculated the row count. So there are 3 steps:

  1. Estimate the null count for the cross join
  2. Scale the null count based on the selectivity of the filter
  3. Adjust if needed for outer joins

For 1, I think the formula is something like:
crossJoinNullCount := leftColStat.NullCount * s.RowCount + rightColStat.NullCount * s.RowCount - leftColStat.NullCount * rightColStat.NullCount
For 2, you can just call colStat.ApplySelectivity
For 3, take a look at what we're doing in buildJoin. For example, for left join, you should adjust the null count like this:
colStat.NullCount = max(innerJoinNullCount, leftColStat.NullCount)


pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):

		} else {
			inputRowCount := sb.statFromChild(ev, 0 /* childIdx */).RowCount
			colStat.NullCount = ((colStat.DistinctCount + 1) / inputRowCount) * inputColStat.NullCount

Thinking about this more, I think this formula makes more sense if we include null values in the distinct count. I think that would make a lot of things here easier to reason about. @RaduBerinde, is there any reason we can't treat NULL as another distinct value? For example, consider the following data:

a    | b
-----+-----
NULL | 1
NULL | 1
NULL | NULL
1    | 2
3    | 4

I think we should treat this as having DistinctCount=4 and NullCount=3. Does that seem reasonable?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):

Previously, rytaft wrote…

Thinking about this more, I think this formula makes more sense if we include null values in the distinct count. I think that would make a lot of things here easier to reason about. @RaduBerinde, is there any reason we can't treat NULL as another distinct value? For example, consider the following data:

a    | b
-----+-----
NULL | 1
NULL | 1
NULL | NULL
1    | 2
3    | 4

I think we should treat this as having DistinctCount=4 and NullCount=3. Does that seem reasonable?

Sure. We'll need to update the sampler.go to put the NULLs in the sketch too but it's a simple change.

We can also change what "null" means in the multi-column case if the other definition makes more sense. We aren't even calculating multi-column stats yet so only some comments would need to be updated.

@itsbilal itsbilal force-pushed the null-counts-statbuilder branch 2 times, most recently from 6383db0 to b90a309 Compare October 9, 2018 20:17
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have addressed all comments and issues pointed out so far. Still going through the test output for anything I may have missed, and gonna be writing more tests in opt/memo/testdata/stats next to exercise the non-zero null count cases. There's no observable difference in pkg/sql/opt/bench benchmarks, at least after benchstat-ing 5 runs.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 332 at r2 (raw file):

Previously, rytaft wrote…

I think this old formula was better for this particular case (sorry if my comment below was confusing -- I didn't mean for you to change this one).

Done.


pkg/sql/opt/memo/statistics_builder.go, line 473 at r2 (raw file):

Previously, rytaft wrote…

It may be worth adding the following check at the end of every colStatXXX function: if colSet.SubsetOf(relProps.NotNullCols) { colStat.NullCount = 0 }. (See if it changes any test output -- alternatively, just check the code that's calculating NotNullCols inside logicalPropsBuilder, and see which operations can possibly change it from their input. Maybe these cases are already covered by the filters in Scan, Select and Join.)

Done.


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, rytaft wrote…

I think we should follow a similar approach to how we calculated the row count. So there are 3 steps:

  1. Estimate the null count for the cross join
  2. Scale the null count based on the selectivity of the filter
  3. Adjust if needed for outer joins

For 1, I think the formula is something like:
crossJoinNullCount := leftColStat.NullCount * s.RowCount + rightColStat.NullCount * s.RowCount - leftColStat.NullCount * rightColStat.NullCount
For 2, you can just call colStat.ApplySelectivity
For 3, take a look at what we're doing in buildJoin. For example, for left join, you should adjust the null count like this:
colStat.NullCount = max(innerJoinNullCount, leftColStat.NullCount)

Done - but feel free to suggest more changes since it's a fairly complex expression. The test stat output generated by it seems reasonable though.


pkg/sql/opt/memo/statistics_builder.go, line 1472 at r2 (raw file):

Previously, rytaft wrote…

I'd just say:

// This is the ratio of null column values to number of rows for nullable
// columns, which is used in the absence of any real statistics.

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1604 at r2 (raw file):

Previously, rytaft wrote…

The problem is you're not calling ApplySelectivity on this particular colStat. Once it's in the cache it doesn't get updated again. So I think you need to add a call to ApplySelectivity here, but make sure that doesn't cause the final selectivity to change.

Done. Required moving updateNullCountsFromProps after really every other selectivityFrom* call - not sure if that's an ideal idiom going forward but it works reliably.


pkg/sql/opt/memo/statistics_builder.go, line 1733 at r2 (raw file):

Previously, rytaft wrote…

I'm not sure it's necessary to update null counts here. Equivalency groups are created based on predicates like x = y, which are null-rejecting. So both x and y will have their null counts set to 0 inside updateNullCountsFromProps. Although maybe I'm missing an edge case where x and y can both be null... it would be good to confirm before you rip this out.

Would that still work for null counts? If I add a panic for the case where one of the equivGroup cols is not in NotNullCols, it does panic - so maybe we do need to propagate non-zero null counts in some cases?


pkg/sql/opt/memo/statistics_builder.go, line 1969 at r2 (raw file):

Previously, rytaft wrote…

You should also mention that we are handing the selectivity of NULL constraints separately in selectivityFromNullCounts

Done.


pkg/sql/opt/memo/statistics_builder.go, line 195 at r3 (raw file):

Previously, rytaft wrote…

[nit] statFromChild -> statsFromChild

Done.


pkg/sql/opt/memo/statistics_builder.go, line 940 at r3 (raw file):

Previously, rytaft wrote…

Since ApplySelectivity updates null count, I think this denominator should be inputStats.RowCount

Done.


pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):

Previously, RaduBerinde wrote…

Sure. We'll need to update the sampler.go to put the NULLs in the sketch too but it's a simple change.

We can also change what "null" means in the multi-column case if the other definition makes more sense. We aren't even calculating multi-column stats yet so only some comments would need to be updated.

Can this be part of a follow up change? It's not necessary for the introduction of null counts, and it can just be a simple reorganization once this behemoth is in.


pkg/sql/opt/memo/statistics_builder.go, line 1106 at r3 (raw file):

Previously, rytaft wrote…

leftColStat.NullCount -> leftNullCount

Done.


pkg/sql/opt/memo/statistics_builder_test.go, line 177 at r3 (raw file):

Previously, rytaft wrote…

This is an example of the issue where the distinct count of (2) is just copied from the input as-is. It should be scaled with a call to ApplySelectivity.

Done. The exact output here didn't change even after the fix because the distinct count is so much smaller than the row count, that the ApplySelectivity formula barely modifies it.


pkg/sql/opt/memo/testdata/stats/select, line 199 at r3 (raw file):

Previously, rytaft wrote…

One thing you can do to get around this is specifically request certain column stats using the colstats directive. I think that will allow you to ensure you get the same results for test and testrace.

That didn't really help, since that only works on the root expression. If exploration rules swap out a child expression, colstats has no impact on that output. I added an additional call to logPropsBuilder.buildProps in the tester that's run on every child to fix this.


pkg/sql/opt/props/statistics.go, line 84 at r3 (raw file):

Previously, rytaft wrote…

update null count here too

Done.


pkg/sql/opt/props/statistics.go, line 165 at r3 (raw file):

Previously, rytaft wrote…

[nit] more simpler -> simple

Done.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bilal, great work on a tedious change! I have some comments, I'll do a final review after your next update.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/memo.go, line 433 at r4 (raw file):

// DeriveLogicalProps derives logical props for the specified expression,
// usually resulting in population of child stats in the test output tree.
func DeriveLogicalProps(evalCtx *tree.EvalContext, ev ExprView) {

[nit] The comment should explain why/when this should be used. It's a very narrow use-case and it should be explicit that it's just for testing.


pkg/sql/opt/memo/memo.go, line 507 at r4 (raw file):

			m.groups = append(m.groups, makeMemoGroup(tmpGroupID, denorm))
			ev := MakeNormExprView(m, tmpGroupID)
			// Building out logical props could lead to more lazy population of

One possibility is to set a flag in logPropsBuilder to skip the building of stats (this code doesn't care abut stats).


pkg/sql/opt/memo/memo.go, line 511 at r4 (raw file):

			// when run with and without -race.
			//
			// TODO: Figure out a way to either run this for all tests, or

[nit] usually TODO(yourname) (btw the name isn't indicative of who is supposed to work on it, it's really indicative of who one should ask for clarifications)


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done - but feel free to suggest more changes since it's a fairly complex expression. The test stat output generated by it seems reasonable though.

I don't understand the formula (it doesn't look right to me).. Can you add some comments explaining it?

s.RowCount is the number of estimated rows of the join right? Say rightNullCount is 0 and leftNullCount is 100. We would have 100 times more nulls than rows? Or say s.RowCount is very small, this whole thing can easily turn negative.

I don't understand why we're multiplying s.RowCount with the count of the nulls (and not the fraction of rows that have nulls). The formula I would use would be similar to the one in colStatIndexJoin (s.RowCount * (f1 + f2 - f1 * f2)).

Also, what test stat output are you looking at specifically? This is only for multi-column stats which are unused unless we explicitly request them in tests using colstat=(1,2,3)


pkg/sql/opt/memo/statistics_builder.go, line 1009 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Can this be part of a follow up change? It's not necessary for the introduction of null counts, and it can just be a simple reorganization once this behemoth is in.

Absolutely. You can also file an issue if it's not something you want to work on right away.


pkg/sql/opt/props/statistics.go, line 135 at r4 (raw file):

	// DistinctCount is the estimated number of distinct values of this
	// set of columns for this expression.

[nit] Mention here how this interacts with NullCount (specifically mention that any rows that have at least a null don't contribute to this count). And maybe leave a TODO to consider changing things so DistinctCount counts these as well (as Becca suggested).


pkg/sql/opt/testutils/opt_tester.go, line 341 at r4 (raw file):

		// Derive logical props for all child expressions, even
		// denormalized ones - to make statistics look complete

[nit] This could use a bit more explanation, it's not enough for someone looking at this for the first time. Explain that we're not interested in the logical props, but calculating them (specifically the statistics) will trigger lazy calculation of column stats in child expressions.

@itsbilal itsbilal force-pushed the null-counts-statbuilder branch 2 times, most recently from 0e030ac to 31a73eb Compare October 10, 2018 22:05
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review Radu! I've added more tests that exercise many of the complex parts of this. Also addressed the panics we were seeing.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, RaduBerinde wrote…

I don't understand the formula (it doesn't look right to me).. Can you add some comments explaining it?

s.RowCount is the number of estimated rows of the join right? Say rightNullCount is 0 and leftNullCount is 100. We would have 100 times more nulls than rows? Or say s.RowCount is very small, this whole thing can easily turn negative.

I don't understand why we're multiplying s.RowCount with the count of the nulls (and not the fraction of rows that have nulls). The formula I would use would be similar to the one in colStatIndexJoin (s.RowCount * (f1 + f2 - f1 * f2)).

Also, what test stat output are you looking at specifically? This is only for multi-column stats which are unused unless we explicitly request them in tests using colstat=(1,2,3)

That makes more sense, actually, to use the independent probability rule. I've added a simple explanation in the comments.

I've also added a basic test case with multi-col stats across join columns so you can see the output, but it doesn't seem right just yet. I'll keep looking into this tomorrow.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 31 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

That makes more sense, actually, to use the independent probability rule. I've added a simple explanation in the comments.

I've also added a basic test case with multi-col stats across join columns so you can see the output, but it doesn't seem right just yet. I'll keep looking into this tomorrow.

Sorry, I screwed up the formula for the null count in the cross join in my formula. I think it should be:
crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount
Does that make more sense? Maybe Radu's solution is still better - haven't had time to think about it much, but I think this new formula should also work.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, rytaft wrote…

Sorry, I screwed up the formula for the null count in the cross join in my formula. I think it should be:
crossJoinNullCount := leftColStat.NullCount * rightRowCount + rightColStat.NullCount * leftRowCount - leftColStat.NullCount * rightColStat.NullCount
Does that make more sense? Maybe Radu's solution is still better - haven't had time to think about it much, but I think this new formula should also work.

Your formula is equivalent to leftRowCount * rightRowCount * (f1 + f2 - f1 * f2), which becomes s.RowCount * (f1 + f2 - f1 * f2) (my formula) after multiplying with Selectivity (right?) So either one works, just don't use my formula and then also apply Selectivity :)

@rytaft
Copy link
Collaborator

rytaft commented Oct 11, 2018


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, RaduBerinde wrote…

Your formula is equivalent to leftRowCount * rightRowCount * (f1 + f2 - f1 * f2), which becomes s.RowCount * (f1 + f2 - f1 * f2) (my formula) after multiplying with Selectivity (right?) So either one works, just don't use my formula and then also apply Selectivity :)

You’re totally right - thanks, Radu! 👍

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 828 at r2 (raw file):

Previously, rytaft wrote…

You’re totally right - thanks, Radu! 👍

Whoops, my bad about the double application of selectivity. Fixed - please take another look! The stats in test files (specifically the ones at the bottom of memo/testdata/stats/join) look better now.


pkg/sql/opt/testutils/opt_tester.go, line 341 at r4 (raw file):

Previously, RaduBerinde wrote…

[nit] This could use a bit more explanation, it's not enough for someone looking at this for the first time. Explain that we're not interested in the logical props, but calculating them (specifically the statistics) will trigger lazy calculation of column stats in child expressions.

Done


pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 356 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Looking into this.

Done. We were not processing selectivity for one of the nullable-but-null-rejected columns due to an input FD.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 821 at r11 (raw file):

Previously, rytaft wrote…

move this switch inside the loop (just above adjustNullCountsForOuterJoins) so you still call innerJoinNullCount for inner joins.

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r13.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 818 at r13 (raw file):

	// Loop through all colSets added in this step, and adjust null counts
	// for outer join operations accordingly.

Update comment - this is not just updating outer joins anymore

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/statistics_builder.go, line 818 at r13 (raw file):

Previously, rytaft wrote…

Update comment - this is not just updating outer joins anymore

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close now! I think you still need to fix up the tests a bit. I really like that you bumped up the null counts for some additional tests, but I think you need to be a bit more deliberate about which queries you choose to test with. Instead of copying a bunch of queries from the existing set of tests, try to think of a small number of new ones that will test some specific part of your code (or reuse a query if you can explain in a comment why it's interesting for null counts). Try to get as much coverage as possible for all the different edge cases you are considering.

I'm making a bit of a fuss about the tests because I think it's concerning that the test output in the join test file hasn't changed since r9 even though you've made several non-trivial changes to the code since then. Also it's difficult to review when there are a lot of new tests and it's not clear what they are testing.

Reviewed 1 of 22 files at r6, 1 of 7 files at r8, 1 of 8 files at r9, 1 of 1 files at r14.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/testdata/stats/scan, line 314 at r14 (raw file):

 └── scan a
      ├── columns: y:2(int) s:3(string)
      └── stats: [rows=3000, distinct(2,3)=1000, null(2,3)=2000]

null(2,3) should probably be less than 2000. I think you need to add some logic in colStatLeaf to subtract the expected number of collisions (or feel free to add a TODO)

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! The emphasis on tests makes sense. I've added more tests (+ more colStats on existing tests) that should cover all the join cases, but I'll be sifting through it all more closely on Monday. Also need to resolve a 20-file merge conflict with master, now that the memo change is in.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/testdata/stats/scan, line 314 at r14 (raw file):

Previously, rytaft wrote…

null(2,3) should probably be less than 2000. I think you need to add some logic in colStatLeaf to subtract the expected number of collisions (or feel free to add a TODO)

Added a simple collision factor calculation that works well for 2-column cases but starts to underestimate collisions for larger sets of columns. I've added a TODO to think of a better general solution for n-way collisions.

@itsbilal itsbilal force-pushed the null-counts-statbuilder branch 2 times, most recently from 0a3435b to a33c8b3 Compare October 22, 2018 18:36
@itsbilal
Copy link
Member Author

Thanks for the reviews so far!

I've resolved the gigantic merge conflict with master - but in the process I had to comment out DeriveLogicalProps (due to a panic around the new form of representing Scans that I'm looking into). So testrace will fail due to mismatching opt_tester output again - but everything else should be ready for review.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this -- that rebase was probably a pain. If you haven't already, please open an issue for changing distinct counts to include null values. Feel free to assign it to me.

You still need more tests -- I think at least one new test to cover each operator that you changed as part of this PR (e.g., are nulls getting counted correctly in VALUES? PROJECT?)

Reviewed 33 of 33 files at r16.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added:

  • Tests for most missing operations

  • Fix for testrace failure due to extra logprop building (ended up going with Radu's suggestion to disable stat building in checkExpr, formerly in MemoizeDenormExpr). The bug ended up being a side-effect of the fact that, in the new memo, we preemptively delete references to logical prop builder and metadata once we know optimization is complete - so it's too late to build more logical props inside opt_tester.go after that point. Simply not building extra stats in testrace is the easier way out.

  • One minor fix in colStatSetNode uncovered by tests

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/memo.go, line 507 at r4 (raw file):

Previously, RaduBerinde wrote…

One possibility is to set a flag in logPropsBuilder to skip the building of stats (this code doesn't care abut stats).

Done.

rytaft
rytaft previously requested changes Oct 23, 2018
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new tests! I think you just need to update one of the tests for project.

Reviewed 10 of 10 files at r17.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/testdata/stats/project, line 260 at r17 (raw file):

# Test that the stats are calculated correctly for synthesized columns.
build
SELECT * FROM (SELECT concat(s, y::string) FROM a) AS q(v) WHERE v = 'foo'

Can you also add a synthesized column that is NULL?


pkg/sql/opt/memo/testdata/stats/set, line 629 at r17 (raw file):

]'
----

Nice tests!

This change takes null counts that are already collected in table
statistics, and propagates them through the different operators in
statistics_builder. It also uses these null counts to generate
selectivities which are then applied to the row count.

The expectation is that this change will lead to much better cardinality
estimation, especially in workloads involving lots of null values that
get filtered out by an operator or constraint. Previously, we treated
null counts like any other distinct value.

Fixes cockroachdb#30289

Release note: None
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/opt/memo/testdata/stats/project, line 260 at r17 (raw file):

Previously, rytaft wrote…

Can you also add a synthesized column that is NULL?

Done. Also fixed an oversight in colStatProject in the process.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: congrats on pushing this one over the finish line!

Reviewed 2 of 2 files at r18.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@itsbilal
Copy link
Member Author

Thanks so much for your reviews @rytaft ! I've filed #31746 for the distinct count followup change.

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 23, 2018

👎 Rejected by code reviews

@itsbilal itsbilal dismissed rytaft’s stale review October 23, 2018 15:59

LGTM'd on reviewable

@itsbilal
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Oct 23, 2018
30827: opt: Propagate null counts through statsbuilder r=itsbilal a=itsbilal

NOTE: This is a large diff with mostly minor test changes. Implementation work is confined to just these files:

- [opt/memo/statistics_builder.go](pkg/sql/opt/memo/statistics_builder.go)
- [opt/metadata.go](pkg/sql/opt/metadata.go)
- [opt/props/statistics.go](pkg/sql/opt/props/statistics.go)

This change takes null counts that are already collected in table
statistics, and propagates them through the different operators in
statistics_builder. It also uses these null counts to generate
selectivities which are then applied to the row count.

The expectation is that this change will lead to much better cardinality
estimation, especially in workloads involving lots of null values that
get filtered out by an operator or constraint. Previously, we treated
null counts like any other distinct value.

Fixes #30289

Release note: None

Co-authored-by: Bilal Akhtar <me@itsbilal.com>
@craig
Copy link
Contributor

craig bot commented Oct 23, 2018

Build succeeded

@craig craig bot merged commit a8be7a6 into cockroachdb:master Oct 23, 2018
@itsbilal itsbilal deleted the null-counts-statbuilder branch October 23, 2018 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants