xform: multiple MIN/MAX subqueries rewrite, filter support #70854

msirek · 2021-09-29T01:56:53Z

Previously, we only rewrote single-table queries with scalar
min and max aggregate expressions into separate subqueries
if there was no WHERE clause, for example:

    SELECT min(pk_col1), max(pk_col1) FROM t1;

becomes:

    SELECT (SELECT min(pk_col1) FROM t1),
           (SELECT max(pk_col1) FROM t1);

The purpose of this transformation was to allow each subquery
to utilize a limited scan via the primary key.

This was inadequate because it is of limited value.
Many user queries use a WHERE clause, so WHERE clause
support is crucial to allow this optimization to be
applied in more cases.

To address this, this patch adds a new rewrite rule named
ReplaceFilteredScalarMinMaxWithSubqueries to detect a Select
as input to two or more scalar MIN or MAX expressions. If present,
a new Select is built for each MIN or MAX expression with the same
Filters (WHERE clause) as the original Select, but with mapped column
IDs in a copy of the tree. Each new Select is used as input
to a newly constructed subquery and combined via a values clause.
The values clause replaces the original ScalarGroupByExpr.

Example:

    CREATE TABLE t1 (a INT, b INT, PRIMARY KEY(a));
    
    EXPLAIN (OPT) SELECT min(a), max(a) FROM t1 WHERE b < 6;
    ---------------------------------------------
      values
       └── tuple
            ├── subquery
            │    └── scalar-group-by
            │         ├── limit
            │         │    ├── select
            │         │    │    ├── scan t1
            │         │    │    └── filters
            │         │    │         └── b < 6
            │         │    └── 1
            │         └── aggregations
            │              └── const-agg
            │                   └── a
            └── subquery
                 └── scalar-group-by
                      ├── limit
                      │    ├── select
                      │    │    ├── scan t1,rev
                      │    │    └── filters
                      │    │         └── b < 6
                      │    └── 1
                      └── aggregations
                           └── const-agg
                                └── a

Release note (performance improvement): A SELECT query from
a single table with more than one MIN or MAX scalar aggregate
expression and a WHERE clause can now be performed with LIMITED SCANs,
one per aggregate expression, instead of a single FULL SCAN.

Note: No other aggregate function, such as SUM, may be present
in the query block in order for it to be eligible for
this transformation. This optimization should occur when
each MIN or MAX expression involves a leading index column,
so that a sort is not required for the limit operation, and
the resulting query plan will appear cheapest to the optimizer.

blathers-crl · 2021-09-29T01:56:55Z

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

cockroach-teamcity · 2021-09-29T01:57:00Z

This change is

rytaft

Looking good!

nit: in the PR message you can use backticks to format the code nicely (single backticks for inline code elements, or three backticks at the beginning and end for blocks of code). You can just update the PR message directly in GitHub.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msirek)

pkg/sql/opt/xform/groupby_funcs.go, line 57 at r1 (raw file):

		newScanExpr := c.e.f.ConstructScan(newScanPrivate)

		var inputExpr memo.RelExpr

nit: you could define this earlier (maybe outside the for loop) and then directly assign the output of ConstructScan to it.

pkg/sql/opt/xform/groupby_funcs.go, line 61 at r1 (raw file):

		// If the input to the scalar group by is a Select with filters, remap the
		// column IDs in the filters and use that to build a new Select.
		if grp.Child(0).Op() == opt.SelectOp {

instead of checking that the group is a Select, I would just check that len(filters) > 0

pkg/sql/opt/xform/rules/groupby.opt, line 21 at r1 (raw file):


# ReplaceFilteredScalarMinMaxWithSubqueries replaces a scalar group by from a
# SelectExpr having a Filter, a single ScanExpr and n > 1 MIN/MAX expressions

nit: all SelectExprs should have a filter -- I'd just say a "SelectExpr with a ScanExpr as input and...".

pkg/sql/opt/xform/testdata/rules/groupby, line 1187 at r1 (raw file):

           └── z:3

# We expect ReplaceScalarMinMaxWithScalarSubqueries not to be preferred here.

This comment isn't quite right. ReplaceScalarMinMaxWithScalarSubqueries cannot be applied because there is a filter. ReplaceFilteredScalarMinMaxWithSubqueries is applied, but the resulting plan is not chosen because the ordering of the scan does not match the aggregation column. (FYI, expect-not means the rule did not generate any new expressions in the memo. expect means the rule did generate a new expression in the memo, but it does not necessarily mean the new expression is chosen).

I would just leave the old comment as-is.

pkg/sql/opt/xform/testdata/rules/groupby, line 1333 at r1 (raw file):


# ReplaceFilteredScalarMinMaxWithSubqueries can be applied due to min/max on
# leading index column.

nit: the rule will be applied regardless -- the plan is actually chosen because the min/max is on the leading index column. I'd just remove this comment.

pkg/sql/opt/xform/testdata/rules/groupby, line 1402 at r1 (raw file):


# ReplaceFilteredScalarMinMaxWithSubqueries can be applied due to min/max on
# leading index column.

ditto

blathers-crl · 2021-09-30T00:08:55Z

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

msirek

Thanks for reviewing! Added the backticks.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)

pkg/sql/opt/xform/groupby_funcs.go, line 57 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: you could define this earlier (maybe outside the for loop) and then directly assign the output of ConstructScan to it.

Done.

pkg/sql/opt/xform/groupby_funcs.go, line 61 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

instead of checking that the group is a Select, I would just check that len(filters) > 0

Right, I guess this is after the rewrite phase, so the EliminateSelect rule probably wouldn't kick in. Made the change.

pkg/sql/opt/xform/rules/groupby.opt, line 21 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: all SelectExprs should have a filter -- I'd just say a "SelectExpr with a ScanExpr as input and...".

Done.

pkg/sql/opt/xform/testdata/rules/groupby, line 1187 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This comment isn't quite right. ReplaceScalarMinMaxWithScalarSubqueries cannot be applied because there is a filter. ReplaceFilteredScalarMinMaxWithSubqueries is applied, but the resulting plan is not chosen because the ordering of the scan does not match the aggregation column. (FYI, expect-not means the rule did not generate any new expressions in the memo. expect means the rule did generate a new expression in the memo, but it does not necessarily mean the new expression is chosen).

I would just leave the old comment as-is.

The old comment was: 'ReplaceScalarMinMaxWithScalarSubqueries can only be applied to canonical
scans', doesn't really describe this case any more. I cleaned up the comment to describe what's actually going on and moved this test down into the the ReplaceFilteredScalarMinMaxWithSubqueries section. The new comment is:

We expect the ReplaceFilteredScalarMinMaxWithSubqueries and

ReplaceScalarMinMaxWithLimit rules to fire, however, as we know nothing about

the ordering of y on the index zyx after a scan on xyz with z > 0, a sort is

required, making the new plan more expensive and therefore not selected.

opt expect=ReplaceFilteredScalarMinMaxWithSubqueries expect=ReplaceScalarMinMaxWithLimit
SELECT min(y), max(x) FROM xyz@zyx WHERE z > 0

pkg/sql/opt/xform/testdata/rules/groupby, line 1333 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: the rule will be applied regardless -- the plan is actually chosen because the min/max is on the leading index column. I'd just remove this comment.

Right, confusing statement. The leading index column just allows the sort requested in ReplaceScalarMinMaxWithLimit to be avoided. Removed these comments.

pkg/sql/opt/xform/testdata/rules/groupby, line 1402 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

ditto

Removed.

mgartner

Great work!

Reviewed 2 of 5 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @rytaft)

-- commits, line 4 at r2:
nit: rewrite => rewrote

-- commits, line 6 at r2:
nit: is => was

-- commits, line 10 at r2:
nit: Adding empty lines around your examples here and below would make this easier to read.

-- commits, line 13 at r2:
nit: Make this a full sentence like "The purpose of this transformation was to allow each subquery to ...". It's a bit confusing to read this sentence extension after the examples.

-- commits, line 31 at r2:
This rule applies in more cases than described here. It applies for two or more min or max aggregates. This release note implies that the rule only works for min(a), max(b), but not min(a), min(b), or min(a), max(b), min(c). Maybe there's a way to make it more clear that these are optimized too?

-- commits, line 35 at r2:
nit: I'd move this example above the release note. I don't think an example in the release note is necessary.

pkg/sql/logictest/testdata/logic_test/aggregate, line 788 at r2 (raw file):


query FI
SELECT max(z), min(x) FROM xyz having (max(z), min(x)) = (select max(z), min(x) from xyz)

nit: capitalize keywords from, where, select, having in these 3 tests. I think we have a linter that will enforce this, but I'm not sure.

pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 909 at r2 (raw file):

# Filter on derived table is supported.
query T
EXPLAIN SELECT max(z), min(x) FROM (select x,y,z from xyz a) dt where dt.y > 0

nit: capitalize keywords

pkg/sql/opt/exec/execbuilder/testdata/aggregate, line 957 at r2 (raw file):

# Scalar subquery in filter is supported.
query T
EXPLAIN SELECT max(z), min(x) FROM xyz where (z,x) = (select max(z), min(x) from xyz)

nit: capitalize keywords

pkg/sql/opt/xform/groupby_funcs.go, line 61 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

Right, I guess this is after the rewrite phase, so the EliminateSelect rule probably wouldn't kick in. Made the change.

EliminateSelect should still kick in when you call c.e.f.ConstructSelect. If you look at the internals of that function you'll see the generated code for EliminateSelect. But you'll still want the if len(filters) > 0 { check because MapFilterCols will probably panic with nil filters. I think @rytaft's suggestion was for the sake of simplicity.

pkg/sql/opt/xform/groupby_funcs.go, line 133 at r2 (raw file):

		case opt.MinOp, opt.MaxOp:
			// The child of a min or max aggregation should always be a variable, but
			// we add a sanity check here anyway.

👍 Thanks for adding this!

pkg/sql/opt/xform/testdata/rules/groupby, line 1378 at r2 (raw file):


opt expect=ReplaceFilteredScalarMinMaxWithSubqueries
SELECT max(z), min(x) FROM xyz WHERE y is not null

nit: capitalize IS NOT NULL

pkg/sql/opt/xform/testdata/rules/groupby, line 1447 at r2 (raw file):

# expressions which are projections.
opt expect-not=ReplaceFilteredScalarMinMaxWithSubqueries
SELECT max(z+1), min(x) FROM xyz WHERE y is not null

nit: capitalize IS NOT NULL

pkg/sql/opt/xform/testdata/rules/groupby, line 1480 at r2 (raw file):

# ORDER BY should not disable ReplaceFilteredScalarMinMaxWithSubqueries.
opt expect=ReplaceFilteredScalarMinMaxWithSubqueries
SELECT max(z), min(x) FROM xyz WHERE y is not null order by 1

nit: capitalize IS NOT NULL ORDER BY

pkg/sql/opt/xform/testdata/rules/groupby, line 1550 at r2 (raw file):

# the ordering of y on the index zyx after a scan on xyz with z > 0, a sort is
# required, making the new plan more expensive and therefore not selected.
opt expect=ReplaceFilteredScalarMinMaxWithSubqueries expect=ReplaceScalarMinMaxWithLimit

To expect multipled rules, use expect=(FirstRule,SecondRule) rather than two expects.

rytaft

once you address comments from @mgartner. Nice work!

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @msirek)

Previously, we only rewrote single-table queries with scalar min and max aggregate expressions into separate subqueries if there was no WHERE clause, for example: SELECT min(pk_col1), max(pk_col1) FROM t1; becomes: SELECT (SELECT min(pk_col1) FROM t1), (SELECT max(pk_col1) FROM t1); The purpose of this transformation was to allow each subquery to utilize a limited scan via the primary key. This was inadequate because it is of limited value. Many user queries use a WHERE clause, so WHERE clause support is crucial to allow this optimization to be applied in more cases. To address this, this patch adds a new rewrite rule named ReplaceFilteredScalarMinMaxWithSubqueries to detect a Select as input to two or more scalar MIN or MAX expressions. If present, a new Select is built for each MIN or MAX expression with the same Filters (WHERE clause) as the original Select, but with mapped column IDs in a copy of the tree. Each new Select is used as input to a newly constructed subquery and combined via a values clause. The values clause replaces the original ScalarGroupByExpr. Example: CREATE TABLE t1 (a INT, b INT, PRIMARY KEY(a)); EXPLAIN (OPT) SELECT min(a), max(a) FROM t1 WHERE b < 6; --------------------------------------------- values └── tuple ├── subquery │ └── scalar-group-by │ ├── limit │ │ ├── select │ │ │ ├── scan t1 │ │ │ └── filters │ │ │ └── b < 6 │ │ └── 1 │ └── aggregations │ └── const-agg │ └── a └── subquery └── scalar-group-by ├── limit │ ├── select │ │ ├── scan t1,rev │ │ └── filters │ │ └── b < 6 │ └── 1 └── aggregations └── const-agg └── a Release note (performance improvement): A SELECT query from a single table with more than one MIN or MAX scalar aggregate expression and a WHERE clause can now be performed with LIMITED SCANs, one per aggregate expression, instead of a single FULL SCAN. Note: No other aggregate function, such as SUM, may be present in the query block in order for it to be eligible for this transformation. This optimization should occur when each MIN or MAX expression involves a leading index column, so that a sort is not required for the limit operation, and the resulting query plan will appear cheapest to the optimizer.

msirek

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)

-- commits, line 4 at r2: