Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: support `corr()` #44628

Merged
merged 2 commits into from Feb 21, 2020
Merged

sql: support `corr()` #44628

merged 2 commits into from Feb 21, 2020

Conversation

@hueypark
Copy link
Contributor

hueypark commented Feb 2, 2020

Support aggregate function corr(). It referred to the PostgreSQL implementation, but the implementation details follow SQL 2003. See: https://www.postgresql.org/docs/10/functions-aggregate.html

See #41274

Release note (sql change): support aggregate function corr()

@hueypark hueypark requested a review from cockroachdb/sql-opt-prs as a code owner Feb 2, 2020
@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Feb 2, 2020

This change is Reviewable

@yuzefovich

This comment has been minimized.

Copy link
Contributor

yuzefovich commented Feb 2, 2020

Thanks for working on this!

I briefly skimmed the PR, and I think it is missing one important part - when a new aggregate function is introduced, it should be added to pkg/sql/execinfrapb/processors_sql.proto so that we could serialize it correctly.

@yuzefovich yuzefovich requested review from RaduBerinde and yuzefovich Feb 2, 2020
@hueypark

This comment has been minimized.

Copy link
Contributor Author

hueypark commented Feb 2, 2020

Thank you for the review!

I briefly skimmed the PR, and I think it is missing one important part - when a new aggregate function is introduced, it should be added to pkg/sql/execinfrapb/processors_sql.proto so that we could serialize it correctly.

I already have corr in pkg/sql/execinfrapb/processors_sql.proto. What additional changes do we need?

@yuzefovich

This comment has been minimized.

Copy link
Contributor

yuzefovich commented Feb 2, 2020

Oops, my bad, indeed. Like I said, I briefly skimmed the PR, apparently too briefly :) I'll give a more thorough review tomorrow.

Copy link
Contributor

yuzefovich left a comment

Very nice work!

I have some comments and some nits.

Reviewed 15 of 15 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @hueypark and @RaduBerinde)


pkg/sql/opt_exec_factory.go, line 448 at r1 (raw file):

		agg := &aggregations[i]
		builtin := agg.Builtin
		var renderIdx []int

pkg/sql/walk.go, line 391 at r1 (raw file):

				} else {
					fmt.Fprintf(&buf, "%s(", agg.funcName)
					if 0 < len(agg.argRenderIdxs) {

[super nit]: it's slightly unusual to see the comparison against a number when the number is on the left-hand side.


pkg/sql/logictest/testdata/logic_test/aggregate, line 1244 at r1 (raw file):


query RR
SELECT corr(y, x)::decimal, corr(int_y, int_x)::decimal FROM statistics_agg_test;

nit: a semicolon is not needed in the single statement logic test.


pkg/sql/logictest/testdata/logic_test/aggregate, line 1247 at r1 (raw file):

----
0.9326733179802503 0.9326733179802503

It'd be nice to see an out of range case here as well.


pkg/sql/opt/exec/execbuilder/relational.go, line 922 at r1 (raw file):

		var filterOrd exec.ColumnOrdinal = -1

		switch item.Agg.Op() {

I think that special casing for CorrOp here is not necessary. Could you refactor what is now default case to support multiple children?


pkg/sql/opt/memo/extract.go, line 120 at r1 (raw file):

	}

	switch e.Op() {

ditto for no special casing.


pkg/sql/sem/builtins/aggregate_builtins.go, line 941 at r1 (raw file):

// Add implements tree.AggregateFunc interface.
func (a *corrAggregate) Add(_ context.Context, datumY tree.Datum, otherArgs ...tree.Datum) error {

Hm, the order of arguments seems suspicious to me - I'd expect datumX to be mentioned explicitly and datumY coming for the variable-args.


pkg/sql/sem/builtins/aggregate_builtins.go, line 973 at r1 (raw file):

		math.IsInf(a.sy2, 0) ||
		math.IsInf(a.sxy, 0) {
		return pgerror.New(pgcode.NumericValueOutOfRange, "float out of range")

nit: you could create this error once. Or you could export errFloatOutOfRange from pkg/sql/sem/tree/eval.go and reuse it here.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1011 at r1 (raw file):

// Reset implements tree.AggregateFunc interface.
func (a *corrAggregate) Reset(context.Context) {
	//a.count = 0

This seems incomplete.

@yuzefovich

This comment has been minimized.

Copy link
Contributor

yuzefovich commented Feb 3, 2020

The empty comment was:

  agg := &aggregations[i]
  builtin := agg.Builtin
  var renderIdx []int

(nit) s/renderIdx/renderIdxs/g.

Also, it might be beneficial to type out the formula for computing the correlation somewhere in corrAggregate.

Copy link
Contributor

andy-kimball left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @hueypark, @RaduBerinde, and @yuzefovich)


pkg/sql/logictest/testdata/logic_test/aggregate, line 1248 at r1 (raw file):

0.9326733179802503 0.9326733179802503

subtest string_agg

I think we need a wider range of tests. Some tests that come to mind:

  1. Correlation of -1, 0, 1
  2. NaN
  3. Out of range (as Yahor says)
  4. NULL values

Think hard about what other tests would be interesting, making sure you handle all the cases in the code you wrote.


pkg/sql/opt/exec/execbuilder/relational.go, line 922 at r1 (raw file):

Previously, yuzefovich wrote…

I think that special casing for CorrOp here is not necessary. Could you refactor what is now default case to support multiple children?

+1


pkg/sql/opt/exec/execbuilder/relational.go, line 928 at r1 (raw file):

				child := item.Agg.Child(j)

				if aggFilter, ok := child.(*memo.AggFilterExpr); ok {

AggFilterExpr and AggDistinctExpr should only occur once. We've never done multiple argument aggregates before, and I'm not sure they'll work correctly with this new function. +@RaduBerinde for his perspective on this.

Separately, this shows a need for having thorough tests for FILTER, DISTINCT, and ORDER BY used with the CORR aggregate function. For example:

select corr(distinct y, x order by x, y) filter (where x>1 and y>1) from statistics_agg_test;

The input values would have to be more interesting, containing duplicates and values both greater and less than 1 for x and yes. It's a tricky set of cases, and unfortunately I'm not sure we can handle these cases correctly today.


pkg/sql/opt/memo/extract.go, line 120 at r1 (raw file):

Previously, yuzefovich wrote…

ditto for no special casing.

+1

Copy link
Member

RaduBerinde left a comment

I also noticed that there is some optimizer builder code which assumes all but the first argument is a constant. We'll need to fix that too: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/opt/optbuilder/groupby.go#L358

@andy-kimball and myself are considering moving the AggDistinct and AggFilter modifiers up on top of the aggregate function (similar to how windowing modifiers work). They are currently under the function, on the input side, which makes it hard to extend things to multiple arguments. I think once we do that, it will be much easier to add corr().

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @hueypark, @RaduBerinde, and @yuzefovich)

andy-kimball added a commit to andy-kimball/cockroach that referenced this pull request Feb 7, 2020
PR cockroachdb#44628 is attempting to add the corr() aggregate function, which
will be the first true multi-argument aggregate in CRDB (besides
string_agg, which is special-cased). That PR runs into problems
when dealing with aggregate DISTINCT and FILTER clauses. The main
issue is that the optimizer currently adds the AggDistinct and
AggFilter operators as *inputs* to the aggregate function. This
doesn't make sense when there are multiple inputs, since it's now
ambiguous which input to use for that purpose.

This commit flips the situation by wrapping the aggregate function
with AggDistinct and AggFilter rather than having them wrap the
input to the aggregate function:

  (AggFilter (AggDistinct (Sum (Variable 1))))

instead of:

  (Sum (AggFilter (AggDistinct (Variable 1))))

As part of this change, it was also convenient to refactor the
EliminateAggDistinctForKeys rule and also to consolidate the
ReplaceScalarMinWithLimit and ReplaceScalarMaxWithLimit rules
into one rule.

Release note: None
andy-kimball added a commit to andy-kimball/cockroach that referenced this pull request Feb 8, 2020
PR cockroachdb#44628 is attempting to add the corr() aggregate function, which
will be the first true multi-argument aggregate in CRDB (besides
string_agg, which is special-cased). That PR runs into problems
when dealing with aggregate DISTINCT and FILTER clauses. The main
issue is that the optimizer currently adds the AggDistinct and
AggFilter operators as *inputs* to the aggregate function. This
doesn't make sense when there are multiple inputs, since it's now
ambiguous which input to use for that purpose.

This commit flips the situation by wrapping the aggregate function
with AggDistinct and AggFilter rather than having them wrap the
input to the aggregate function:

  (AggFilter (AggDistinct (Sum (Variable 1))))

instead of:

  (Sum (AggFilter (AggDistinct (Variable 1))))

As part of this change, it was also convenient to refactor the
EliminateAggDistinctForKeys rule and also to consolidate the
ReplaceScalarMinWithLimit and ReplaceScalarMaxWithLimit rules
into one rule.

Release note: None
andy-kimball added a commit to andy-kimball/cockroach that referenced this pull request Feb 8, 2020
PR cockroachdb#44628 is attempting to add the corr() aggregate function, which
will be the first true multi-argument aggregate in CRDB (besides
string_agg, which is special-cased). That PR runs into problems
when dealing with aggregate DISTINCT and FILTER clauses. The main
issue is that the optimizer currently adds the AggDistinct and
AggFilter operators as *inputs* to the aggregate function. This
doesn't make sense when there are multiple inputs, since it's now
ambiguous which input to use for that purpose.

This commit flips the situation by wrapping the aggregate function
with AggDistinct and AggFilter rather than having them wrap the
input to the aggregate function:

  (AggFilter (AggDistinct (Sum (Variable 1))))

instead of:

  (Sum (AggFilter (AggDistinct (Variable 1))))

As part of this change, it was also convenient to refactor the
EliminateAggDistinctForKeys rule and also to consolidate the
ReplaceScalarMinWithLimit and ReplaceScalarMaxWithLimit rules
into one rule.

Release note: None
andy-kimball added a commit to andy-kimball/cockroach that referenced this pull request Feb 8, 2020
PR cockroachdb#44628 is attempting to add the corr() aggregate function, which
will be the first true multi-argument aggregate in CRDB (besides
string_agg, which is special-cased). That PR runs into problems
when dealing with aggregate DISTINCT and FILTER clauses. The main
issue is that the optimizer currently adds the AggDistinct and
AggFilter operators as *inputs* to the aggregate function. This
doesn't make sense when there are multiple inputs, since it's now
ambiguous which input to use for that purpose.

This commit flips the situation by wrapping the aggregate function
with AggDistinct and AggFilter rather than having them wrap the
input to the aggregate function:

  (AggFilter (AggDistinct (Sum (Variable 1))))

instead of:

  (Sum (AggFilter (AggDistinct (Variable 1))))

As part of this change, it was also convenient to refactor the
EliminateAggDistinctForKeys rule and also to consolidate the
ReplaceScalarMinWithLimit and ReplaceScalarMaxWithLimit rules
into one rule.

Release note: None
@hueypark hueypark force-pushed the hueypark:support-corr branch from 37d4c61 to e3fb03e Feb 9, 2020
Copy link
Contributor Author

hueypark left a comment

Thank you for the review!

I also noticed that there is some optimizer builder code which assumes all but the first argument is a constant. We'll need to fix that too: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/opt/optbuilder/groupby.go#L358

Done.

@andy-kimball and myself are considering moving the AggDistinct and AggFilter modifiers up on top of the aggregate function (similar to how windowing modifiers work). They are currently under the function, on the input side, which makes it hard to extend things to multiple arguments. I think once we do that, it will be much easier to add corr().

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)


pkg/sql/opt_exec_factory.go, line 448 at r1 (raw file):

(nit) s/renderIdx/renderIdxs/g.
Done.

Also, it might be beneficial to type out the formula for computing the correlation somewhere in corrAggregate.
I don't know exactly what you mean.


pkg/sql/walk.go, line 391 at r1 (raw file):

Previously, yuzefovich wrote…

[super nit]: it's slightly unusual to see the comparison against a number when the number is on the left-hand side.

Done.


pkg/sql/logictest/testdata/logic_test/aggregate, line 1244 at r1 (raw file):

Previously, yuzefovich wrote…

[nit]: a semicolon is not needed in the single statement logic test.

Done.


pkg/sql/logictest/testdata/logic_test/aggregate, line 1247 at r1 (raw file):

Previously, yuzefovich wrote…

It'd be nice to see an out of range case here as well.

Done.


pkg/sql/logictest/testdata/logic_test/aggregate, line 1248 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I think we need a wider range of tests. Some tests that come to mind:

  1. Correlation of -1, 0, 1
  2. NaN
  3. Out of range (as Yahor says)
  4. NULL values

Think hard about what other tests would be interesting, making sure you handle all the cases in the code you wrote.

Thank you for your comment on tests. Done.


pkg/sql/opt/exec/execbuilder/relational.go, line 922 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

+1

Done.


pkg/sql/opt/exec/execbuilder/relational.go, line 928 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

AggFilterExpr and AggDistinctExpr should only occur once. We've never done multiple argument aggregates before, and I'm not sure they'll work correctly with this new function. +@RaduBerinde for his perspective on this.

Separately, this shows a need for having thorough tests for FILTER, DISTINCT, and ORDER BY used with the CORR aggregate function. For example:

select corr(distinct y, x order by x, y) filter (where x>1 and y>1) from statistics_agg_test;

The input values would have to be more interesting, containing duplicates and values both greater and less than 1 for x and yes. It's a tricky set of cases, and unfortunately I'm not sure we can handle these cases correctly today.

Partly done except aggregate with both DISTINCT and ORDER BY.

Maybe we can handle sql: aggregate with both DISTINCT and ORDER BY not supported later.


pkg/sql/opt/memo/extract.go, line 120 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

+1

Done.


pkg/sql/sem/builtins/aggregate_builtins.go, line 941 at r1 (raw file):

Hm, the order of arguments seems suspicious to me - I'd expect datumX to be mentioned explicitly and datumY coming for the variable-args.

At first, I was suspicious too. But I follow PostgreSQL and SQL 2003 rule.


pkg/sql/sem/builtins/aggregate_builtins.go, line 973 at r1 (raw file):

Previously, yuzefovich wrote…

[nit]: you could create this error once. Or you could export errFloatOutOfRange from pkg/sql/sem/tree/eval.go and reuse it here.

Done.


pkg/sql/sem/builtins/aggregate_builtins.go, line 1011 at r1 (raw file):

Previously, yuzefovich wrote…

This seems incomplete.

Done.

craig bot pushed a commit that referenced this pull request Feb 9, 2020
44839: opt: prepare to support multi-argument aggregate functions r=andy-kimball a=andy-kimball

PR #44628 is attempting to add the corr() aggregate function, which
will be the first true multi-argument aggregate in CRDB (besides
string_agg, which is special-cased). That PR runs into problems
when dealing with aggregate DISTINCT and FILTER clauses. The main
issue is that the optimizer currently adds the AggDistinct and
AggFilter operators as *inputs* to the aggregate function. This
doesn't make sense when there are multiple inputs, since it's now
ambiguous which input to use for that purpose.

This commit flips the situation by wrapping the aggregate function
with AggDistinct and AggFilter rather than having them wrap the
input to the aggregate function:

  (AggFilter (AggDistinct (Sum (Variable 1))))

instead of:

  (Sum (AggFilter (AggDistinct (Variable 1))))

As part of this change, it was also convenient to refactor the
EliminateAggDistinctForKeys rule and also to consolidate the
ReplaceScalarMinWithLimit and ReplaceScalarMaxWithLimit rules
into one rule.

Release note: None

Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
@andy-kimball

This comment has been minimized.

Copy link
Contributor

andy-kimball commented Feb 9, 2020

I just merged #44839, which prepares the optimizer for multi-argument aggregation functions.

@andy-kimball

This comment has been minimized.

Copy link
Contributor

andy-kimball commented Feb 9, 2020

One idea (but not required) is to split this work into two commits (or PR's):

  • The first would be to add support for multi-argument aggregate functions by enabling string_agg to have 2 variable arguments (today, the 2nd arg must be constant). This PR would include a big chunk of what you've done in this PR.
  • The second would be to add the corr() function using the framework you enabled in the first PR. This would contain the remaining chunk of work that is specific to corr.

The nice thing about this approach is that the 2nd PR would then become the "template" that others could follow when adding other aggregate functions in the future. As it is, this PR is really doing 2 things at once, and it can be challenging to separate what's needed for the 1st vs. what's needed for the 2nd.

@hueypark hueypark force-pushed the hueypark:support-corr branch from e3fb03e to a67c208 Feb 9, 2020
Copy link
Contributor Author

hueypark left a comment

Thank you for merge!

I splitted this work into two commits.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)

@hueypark hueypark force-pushed the hueypark:support-corr branch from a67c208 to dbcc32e Feb 10, 2020
Copy link
Contributor Author

hueypark left a comment

Hmm... some TeamCity build failed for no reason. Can anyone check this?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)

@RaduBerinde

This comment has been minimized.

Copy link
Member

RaduBerinde commented Feb 10, 2020

I restarted it. I believe we have some known flaky roachtests.

Copy link
Contributor Author

hueypark left a comment

Thank you for checking.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)

Copy link
Contributor Author

hueypark left a comment

Can I ask when we are going to merge this?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)

Copy link
Contributor

andy-kimball left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @hueypark, @RaduBerinde, and @yuzefovich)


pkg/sql/opt_exec_factory.go, line 450 at r3 (raw file):

		var aggFn func(*tree.EvalContext, tree.Datums) tree.AggregateFunc

		switch len(agg.ArgCols) {

Can you generalize this into a loop, rather than having one case per #ArgCols?


pkg/sql/walk.go, line 425 at r2 (raw file):

						}
						for _, idx := range agg.argRenderIdxs {
							buf.WriteString(inputCols[idx].Name)

You need to separate arguments with a comma, otherwise you get this:

root@:26257/defaultdb> explain select corr(x, y) from (values (1, 1), (2, 2)) t(x, y);
     tree     |    field    |     description
--------------+-------------+-----------------------
              | distributed | true
              | vectorized  | false
  group       |             |
   │          | aggregate 0 | corr(column1column2)
   │          | scalar      |
   └── values |             |
              | size        | 2 columns, 2 rows

Notice how column1 and column2 are smashed together. Also, make sure to add a test case for this in execbuilder/testdata/explain to make sure we handle EXPLAIN with multiple args.


pkg/sql/opt/operator.go, line 256 at r3 (raw file):

// AggregateIgnoresNulls returns true if the given aggregate operator has a
// single input, and if it always evaluates to the same result regardless of

CorrOp does not belong here, since it does not have a single input.


pkg/sql/opt/operator.go, line 269 at r3 (raw file):

// AggregateIsNullOnEmpty returns true if the given aggregate operator has a
// single input, and if it returns NULL when the input set contains no values.

Ditto.


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

		args := make([]opt.ScalarExpr, 0, 2)
		for i, arg := range agg.args {
			switch agg.def.Name {

What I had in mind here when I put the comment was to get rid of the special case for string_agg altogether. If you decide you want to do that, then it'd mean adding new string_agg related test cases in various places. If you'd rather let me do it, then leave in the comment as a marker, so I don't forget to fix string_agg.

Regardless of which way you go, we do need test cases for DISTINCT and FILTER, when used with multiple agg args. Those could be added in this commit, if you enable string_agg with multiple variable args, or they could go in the corr commit. We'd want those tests at multiple levels:

  opt/optbuilder/testdata/aggregate (for building the opt tree)
  opt/execbuilder/testdata/aggregate (for EXPLAIN case)
  opt/norm/testdata/rules/agg (for making sure that DISTINCT/FILTER rules work properly with aggs of multiple args)
  pkg/sql/logictest/testdata/logic_test/aggregate (to make sure multiple args work in regular execution)
  pkg/sql/logictest/testdata/logic_test/distsql_agg (to make sure multiple args work with distsql)
  

This is not easy, I know, but this is a pretty important change, and we want to make sure it's tested in depth, as it affects so many components in the query stack.

When I suggested breaking the PR into two commits, I had in mind using string_agg test cases as the way to test this commit. Once you've thoroughly tested multi-arg aggs by testing string_agg, you wouldn't need to do that same level of testing with corr. There, your focus would be on testing the implementation of corr rather than on the fact that it had multiple args (which should already be tested sufficiently by string_agg tests).


pkg/sql/opt/optbuilder/groupby.go, line 773 at r3 (raw file):

	case "concat_agg":
		return b.factory.ConstructConcatAgg(args[0])
	case "corr":

Anytime you change an optbuilder file, there should be a test for that functionality in the testdata directory. In this case, I'd expect a test of some kind in the aggregate file. In general, we try to unit test thoroughly at each layer.


pkg/sql/sem/builtins/aggregate_builtins.go, line 930 at r3 (raw file):

}

// corrAggregate represents SQL:2003 correlation coefficient.

Add to this comment, giving a description of the correlation coefficient, perhaps copied straight from the SQL 2003 spec.

Copy link
Contributor

andy-kimball left a comment

Sorry for the delay, a lot going on. Thanks for splitting the PR into two commits. There are still various issues to address, but this is coming along. Thanks for working on it!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @hueypark, @RaduBerinde, and @yuzefovich)

Copy link
Contributor Author

hueypark left a comment

Thank you for your review. I'll check the details this weekend.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @hueypark, @RaduBerinde, and @yuzefovich)

Copy link
Contributor Author

hueypark left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)


pkg/sql/opt_exec_factory.go, line 450 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Can you generalize this into a loop, rather than having one case per #ArgCols?

Done.


pkg/sql/walk.go, line 425 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

You need to separate arguments with a comma, otherwise you get this:

root@:26257/defaultdb> explain select corr(x, y) from (values (1, 1), (2, 2)) t(x, y);
     tree     |    field    |     description
--------------+-------------+-----------------------
              | distributed | true
              | vectorized  | false
  group       |             |
   │          | aggregate 0 | corr(column1column2)
   │          | scalar      |
   └── values |             |
              | size        | 2 columns, 2 rows

Notice how column1 and column2 are smashed together. Also, make sure to add a test case for this in execbuilder/testdata/explain to make sure we handle EXPLAIN with multiple args.

Done.


pkg/sql/opt/operator.go, line 256 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

CorrOp does not belong here, since it does not have a single input.

Done.


pkg/sql/opt/operator.go, line 269 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Ditto.

Done.


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

What I had in mind here when I put the comment was to get rid of the special case for string_agg altogether. If you decide you want to do that, then it'd mean adding new string_agg related test cases in various places. If you'd rather let me do it, then leave in the comment as a marker, so I don't forget to fix string_agg.

Regardless of which way you go, we do need test cases for DISTINCT and FILTER, when used with multiple agg args. Those could be added in this commit, if you enable string_agg with multiple variable args, or they could go in the corr commit. We'd want those tests at multiple levels:

  opt/optbuilder/testdata/aggregate (for building the opt tree)
  opt/execbuilder/testdata/aggregate (for EXPLAIN case)
  opt/norm/testdata/rules/agg (for making sure that DISTINCT/FILTER rules work properly with aggs of multiple args)
  pkg/sql/logictest/testdata/logic_test/aggregate (to make sure multiple args work in regular execution)
  pkg/sql/logictest/testdata/logic_test/distsql_agg (to make sure multiple args work with distsql)
  

This is not easy, I know, but this is a pretty important change, and we want to make sure it's tested in depth, as it affects so many components in the query stack.

When I suggested breaking the PR into two commits, I had in mind using string_agg test cases as the way to test this commit. Once you've thoroughly tested multi-arg aggs by testing string_agg, you wouldn't need to do that same level of testing with corr. There, your focus would be on testing the implementation of corr rather than on the fact that it had multiple args (which should already be tested sufficiently by string_agg tests).

Done.
If we want more tests. Please let me know.


pkg/sql/opt/optbuilder/groupby.go, line 773 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Anytime you change an optbuilder file, there should be a test for that functionality in the testdata directory. In this case, I'd expect a test of some kind in the aggregate file. In general, we try to unit test thoroughly at each layer.

Done.


pkg/sql/sem/builtins/aggregate_builtins.go, line 930 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Add to this comment, giving a description of the correlation coefficient, perhaps copied straight from the SQL 2003 spec.

Done.

@hueypark hueypark force-pushed the hueypark:support-corr branch from dbcc32e to b296e22 Feb 16, 2020
Copy link
Contributor

andy-kimball left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @hueypark, @RaduBerinde, and @yuzefovich)


pkg/sql/opt_exec_factory.go, line 447 at r5 (raw file):

		agg := &aggregations[i]
		builtin := agg.Builtin
		var renderIdxs []int

Should preallocate these to avoid unnecessary allocations during resizing:

renderIdxs := make([]int, len(agg.ArgCols));
params := make([]*types.T, len(agg.ArgCols));

pkg/sql/walk.go, line 429 at r5 (raw file):

							names = append(names, inputCols[idx].Name)
						}
						buf.WriteString(strings.Join(names, ", "))

NIT: Multiple unnecessary allocations here. Instead, always follow this pattern in cases like this:

for i, idx := range agg.argRenderIdxs {
  if i != 0 {
    buf.WriteString(", ")
  }
  buf.WriteString(inputCols[idx].Name)
}

pkg/sql/opt/operator.go, line 256 at r3 (raw file):

Previously, hueypark (Jaewan Park) wrote…

Done.

After seeing the change you had to make to decorrelate_test.go, I looked into this more deeply. We should change this comment to:

// AggregateIgnoreNulls returns true if the given aggregate operator ignores
// rows where its first argument evaluates to NULL. In other words, it always
// evaluates to the same result even if those rows are filtered. For example:
//
//   SELECT string_agg(x, y)
//   FROM (VALUES ('foo', ','), ('bar', ','), (NULL, ',')) t(x, y)
//
// In this example, the NULL row can be removed from the input, and the
// string_agg function still returns the same result. Contrast this to the
// array_agg function:
//
//   SELECT array_agg(x)
//   FROM (VALUES ('foo'), (NULL), ('bar')) t(x)
//
// If the NULL row is removed here, array_agg returns {foo,bar} instead of
// {foo,NULL,bar}.

Then you can add CorrOp back into the list, since it does seem that it skips input rows where its first argument is NULL. You should also add a couple tests. Add this to opt/norm/testdata/rules/decorrelate:

# With a multi-argument aggregate.
norm expect=TryDecorrelateScalarGroupBy
SELECT k, (SELECT string_agg(a.s, ',') FROM a WHERE a.k = a2.i) FROM a AS a2
----

The expected result is that the correlated subquery is turned into a regular LEFT JOIN.

And this test to opt/norm/testdata/rules/reject_nulls:

# Use with multi-argument aggregate function.
norm expect=RejectNullsGroupBy
SELECT string_agg(s, ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s, ',')='foo'
----

The expected result is the LEFT JOIN gets converted into an INNER JOIN.


pkg/sql/opt/operator.go, line 269 at r3 (raw file):

Previously, hueypark (Jaewan Park) wrote…

Done.

Change this comment to:

// AggregateIsNullOnEmpty returns true if the given aggregate operator returns
// NULL when the input set contains no values. This group of aggregates overlaps
// considerably with the AggregateIgnoresNulls group, with the notable exception
// of COUNT, which returns zero instead of NULL when its input is empty.

And then add back CorrOp.


pkg/sql/opt/norm/decorrelate_test.go, line 27 at r5 (raw file):

	for op := range opt.AggregateOpReverseMap {
		switch op {
		case opt.CorrOp, opt.CountRowsOp:

Revert this once the Aggregate boolean functions in operator.go are updated.


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, hueypark (Jaewan Park) wrote…

Done.
If we want more tests. Please let me know.

This should just be:

  colID := argCols[0].id
  args = append(args, b.factory.ConstructVariable(colID))

We just want to make every argument a Variable. I think there's a subtle bug with the way you're doing it for cases like:

string_agg(s, 'foo', s, 'bar')

We have an assumption when building the execution operators that all constant operands are at the end of the argument list, but that'd be violated by this code.

After fixing this, make sure to add a couple interesting test cases (in logic_test/aggregate).


pkg/sql/sem/builtins/aggregate_builtins.go, line 930 at r3 (raw file):

Previously, hueypark (Jaewan Park) wrote…

Done.

Very nice comment, thanks.

@hueypark hueypark force-pushed the hueypark:support-corr branch from b296e22 to cc7dd21 Feb 18, 2020
Copy link
Contributor Author

hueypark left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)


pkg/sql/opt_exec_factory.go, line 447 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Should preallocate these to avoid unnecessary allocations during resizing:

renderIdxs := make([]int, len(agg.ArgCols));
params := make([]*types.T, len(agg.ArgCols));

Done.


pkg/sql/walk.go, line 429 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

NIT: Multiple unnecessary allocations here. Instead, always follow this pattern in cases like this:

for i, idx := range agg.argRenderIdxs {
  if i != 0 {
    buf.WriteString(", ")
  }
  buf.WriteString(inputCols[idx].Name)
}

Done.


pkg/sql/opt/operator.go, line 256 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

After seeing the change you had to make to decorrelate_test.go, I looked into this more deeply. We should change this comment to:

// AggregateIgnoreNulls returns true if the given aggregate operator ignores
// rows where its first argument evaluates to NULL. In other words, it always
// evaluates to the same result even if those rows are filtered. For example:
//
//   SELECT string_agg(x, y)
//   FROM (VALUES ('foo', ','), ('bar', ','), (NULL, ',')) t(x, y)
//
// In this example, the NULL row can be removed from the input, and the
// string_agg function still returns the same result. Contrast this to the
// array_agg function:
//
//   SELECT array_agg(x)
//   FROM (VALUES ('foo'), (NULL), ('bar')) t(x)
//
// If the NULL row is removed here, array_agg returns {foo,bar} instead of
// {foo,NULL,bar}.

Then you can add CorrOp back into the list, since it does seem that it skips input rows where its first argument is NULL. You should also add a couple tests. Add this to opt/norm/testdata/rules/decorrelate:

# With a multi-argument aggregate.
norm expect=TryDecorrelateScalarGroupBy
SELECT k, (SELECT string_agg(a.s, ',') FROM a WHERE a.k = a2.i) FROM a AS a2
----

The expected result is that the correlated subquery is turned into a regular LEFT JOIN.

And this test to opt/norm/testdata/rules/reject_nulls:

# Use with multi-argument aggregate function.
norm expect=RejectNullsGroupBy
SELECT string_agg(s, ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s, ',')='foo'
----

The expected result is the LEFT JOIN gets converted into an INNER JOIN.

Done.


pkg/sql/opt/operator.go, line 269 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Change this comment to:

// AggregateIsNullOnEmpty returns true if the given aggregate operator returns
// NULL when the input set contains no values. This group of aggregates overlaps
// considerably with the AggregateIgnoresNulls group, with the notable exception
// of COUNT, which returns zero instead of NULL when its input is empty.

And then add back CorrOp.

Done.


pkg/sql/opt/norm/decorrelate_test.go, line 27 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Revert this once the Aggregate boolean functions in operator.go are updated.

Done.


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This should just be:

  colID := argCols[0].id
  args = append(args, b.factory.ConstructVariable(colID))

We just want to make every argument a Variable. I think there's a subtle bug with the way you're doing it for cases like:

string_agg(s, 'foo', s, 'bar')

We have an assumption when building the execution operators that all constant operands are at the end of the argument list, but that'd be violated by this code.

After fixing this, make sure to add a couple interesting test cases (in logic_test/aggregate).

We should be able to treat the argument as a constant. If not, we can not support RejectNullsGroupBy for string_agg.

Right now we can't treat the first argument as a constant, so I left a comment.

Copy link
Contributor

andy-kimball left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @hueypark, @RaduBerinde, and @yuzefovich)


pkg/sql/walk.go, line 425 at r2 (raw file):

Previously, hueypark (Jaewan Park) wrote…

Done.

This needs a test in exec/execbuilder/testdata/explain.


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, hueypark (Jaewan Park) wrote…

We should be able to treat the argument as a constant. If not, we can not support RejectNullsGroupBy for string_agg.

Right now we can't treat the first argument as a constant, so I left a comment.

I'm not sure I understand. Can you show me the problematic case that you're worried about? I believe this should just work:

for i, arg := range agg.args {
  colID := argCols[0].id
  args = append(args, b.factory.ConstructVariable(colID))

  // Skip past argCols that have been handled. There may be variable
  // number of them, so need to set up for next aggregate function.
  argCols = argCols[1:]
}

If you look at the plans that are generated, even constants are just projected input variables to the aggregates. Once we do this, we could theoretically remove support for aggregation argument constants during execution.

@hueypark hueypark force-pushed the hueypark:support-corr branch from cc7dd21 to e94c67a Feb 19, 2020
Copy link
Contributor Author

hueypark left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)


pkg/sql/walk.go, line 425 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This needs a test in exec/execbuilder/testdata/explain.

Oh, I forgot this.
I added it now.


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I'm not sure I understand. Can you show me the problematic case that you're worried about? I believe this should just work:

for i, arg := range agg.args {
  colID := argCols[0].id
  args = append(args, b.factory.ConstructVariable(colID))

  // Skip past argCols that have been handled. There may be variable
  // number of them, so need to set up for next aggregate function.
  argCols = argCols[1:]
}

If you look at the plans that are generated, even constants are just projected input variables to the aggregates. Once we do this, we could theoretically remove support for aggregation argument constants during execution.

I created the third commit for a problematic case.

Pay attention to below test in /sql/opt/norm/testdata/rules/reject_nulls.

SELECT string_agg(s, ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s, ',')='foo'

I can't find a starting point for troubleshooting. Can you advise something?

Copy link
Contributor

andy-kimball left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, hueypark (Jaewan Park) wrote…

I created the third commit for a problematic case.

Pay attention to below test in /sql/opt/norm/testdata/rules/reject_nulls.

SELECT string_agg(s, ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s, ',')='foo'

I can't find a starting point for troubleshooting. Can you advise something?

OK, thanks, I see the problem. It looks like we don't yet have support for Project in DeriveRejectNullCols. To add that support, add these lines to the switch:

	case opt.ProjectOp:
		// Pass through all null-rejection columns that the Project passes through.
		// The PushSelectIntoProject rule is able to push the IS NOT NULL filter
		// below the Project for these columns.
		rejectNullCols := DeriveRejectNullCols(in.Child(0).(memo.RelExpr))
		relProps.Rule.RejectNullCols = relProps.OutputCols.Intersection(rejectNullCols)

Also, add this test to reject_nulls:

# Don't reject nulls when aggregate argument is a not a Project passthrough
# column.
norm expect-not=RejectNullsGroupBy
SELECT string_agg(s || 'bar', ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s || 'bar', ',')='foo'
----

If you're interested in more background information about null rejection, see the comment at the top of reject_nulls.opt and the comment on the props.Relational.Rule.RejectNullColls rule.

And here is a seminal academic paper that covers null rejection: https://www.researchgate.net/publication/220225172_Outerjoin_Simplification_and_Reordering_for_Query_Optimization

Copy link
Contributor

andy-kimball left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, hueypark (Jaewan Park) wrote…

I created the third commit for a problematic case.

Pay attention to below test in /sql/opt/norm/testdata/rules/reject_nulls.

SELECT string_agg(s, ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s, ',')='foo'

I can't find a starting point for troubleshooting. Can you advise something?

OK, thanks, I see the problem. It looks like we don't yet have support for Project in DeriveRejectNullCols. To add that support, add these lines to the switch:

	case opt.ProjectOp:
		// Pass through all null-rejection columns that the Project passes through.
		// The PushSelectIntoProject rule is able to push the IS NOT NULL filter
		// below the Project for these columns.
		rejectNullCols := DeriveRejectNullCols(in.Child(0).(memo.RelExpr))
		relProps.Rule.RejectNullCols = relProps.OutputCols.Intersection(rejectNullCols)

Also, add this test to reject_nulls:

# Don't reject nulls when aggregate argument is a not a Project passthrough
# column.
norm expect-not=RejectNullsGroupBy
SELECT string_agg(s || 'bar', ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s || 'bar', ',')='foo'
----

If you're interested in more background information about null rejection, see the comment at the top of reject_nulls.opt and the comment on the props.Relational.Rule.RejectNullColls rule.

And here is a seminal academic paper that covers null rejection: https://www.researchgate.net/publication/220225172_Outerjoin_Simplification_and_Reordering_for_Query_Optimization

@hueypark hueypark force-pushed the hueypark:support-corr branch from e94c67a to a30c578 Feb 19, 2020
Copy link
Contributor Author

hueypark left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)


pkg/sql/opt/optbuilder/groupby.go, line 354 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

OK, thanks, I see the problem. It looks like we don't yet have support for Project in DeriveRejectNullCols. To add that support, add these lines to the switch:

	case opt.ProjectOp:
		// Pass through all null-rejection columns that the Project passes through.
		// The PushSelectIntoProject rule is able to push the IS NOT NULL filter
		// below the Project for these columns.
		rejectNullCols := DeriveRejectNullCols(in.Child(0).(memo.RelExpr))
		relProps.Rule.RejectNullCols = relProps.OutputCols.Intersection(rejectNullCols)

Also, add this test to reject_nulls:

# Don't reject nulls when aggregate argument is a not a Project passthrough
# column.
norm expect-not=RejectNullsGroupBy
SELECT string_agg(s || 'bar', ',')
FROM (SELECT x FROM xy)
LEFT JOIN (SELECT k, s FROM a)
ON True
GROUP BY k
HAVING string_agg(s || 'bar', ',')='foo'
----

If you're interested in more background information about null rejection, see the comment at the top of reject_nulls.opt and the comment on the props.Relational.Rule.RejectNullColls rule.

And here is a seminal academic paper that covers null rejection: https://www.researchgate.net/publication/220225172_Outerjoin_Simplification_and_Reordering_for_Query_Optimization

Done.

Thank you for the detailed explanation.

Copy link
Contributor

andy-kimball left a comment

:lgtm:, thanks for all the work on this. Looks like we're now set up to add the other statistics-related aggregate functions.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)

Copy link
Contributor Author

hueypark left a comment

Thank you for the review!

Could we merge this immediately? In a sprint planned tomorrow, someone could implement aggregate functions.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @RaduBerinde, and @yuzefovich)

@RaduBerinde

This comment has been minimized.

Copy link
Member

RaduBerinde commented Feb 21, 2020

@hueypark there are some failing tests in the CI run (TestNormRules, TestBuilder/aggregate). Once they are addressed we can merge.

@RaduBerinde

This comment has been minimized.

Copy link
Member

RaduBerinde commented Feb 21, 2020

If it helps, I can also push this into a temporary branch that can be used for the sprint.

@hueypark

This comment has been minimized.

Copy link
Contributor Author

hueypark commented Feb 21, 2020

@hueypark there are some failing tests in the CI run (TestNormRules, TestBuilder/aggregate). Once they are addressed we can merge.

This is a little weird. Is there anything I missed?

If it helps, I can also push this into a temporary branch that can be used for the sprint.

No thanks. There are still other interesting issues.

@RaduBerinde

This comment has been minimized.

Copy link
Member

RaduBerinde commented Feb 21, 2020

You can repro with make testrace PKG=./pkg/sql/opt/norm

We have a bunch of assertions that only run in "race" mode. I think there's an assertion about string agg that is now incorrect (in memo.CheckExpr).

hueypark added 2 commits Feb 21, 2020
It is for issue #41274(sql: Support aggregate functions for statistics).

Release note: None
Support aggregate function `corr()`. It referred to the PostgreSQL implementation, but the implementation details follow SQL 2003. See: https://www.postgresql.org/docs/10/functions-aggregate.html

See #41274

Release note (sql change): Support aggregate function `corr()`
@hueypark hueypark force-pushed the hueypark:support-corr branch from a30c578 to 865e011 Feb 21, 2020
@hueypark

This comment has been minimized.

Copy link
Contributor Author

hueypark commented Feb 21, 2020

Thank you, I couldn't see testrace before you told me.

@hueypark

This comment has been minimized.

Copy link
Contributor Author

hueypark commented Feb 21, 2020

The test failed again with TestPrimaryKeyChangeWithOperations and something is strange. Can I get retry permission?

@RaduBerinde

This comment has been minimized.

Copy link
Member

RaduBerinde commented Feb 21, 2020

I doubt that test has anything to do with this change. Let's try to merge.

bors r+

@craig

This comment has been minimized.

Copy link

craig bot commented Feb 21, 2020

Build succeeded

@craig craig bot merged commit a7a7690 into cockroachdb:master Feb 21, 2020
3 of 4 checks passed
3 of 4 checks passed
GitHub CI (Cockroach) TeamCity build failed
Details
Jordan Test CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
@RaduBerinde

This comment has been minimized.

Copy link
Member

RaduBerinde commented Feb 22, 2020

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.