Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt/memo: extend OutputCols with VirtualCols in statistics builder #120668

Merged
merged 3 commits into from Mar 22, 2024

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Mar 18, 2024

opt: add props.Statistics.VirtualCols

As of #118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: #68254

Epic: CRDB-8949

Release note: None


sql: add optimizer_use_virtual_computed_column_stats session variable

Informs: #68254

Epic: CRDB-8949

Release note (sql): Add new session variable
optimizer_use_virtual_computed_column_stats. When this variable is
enabled, the optimizer will make use of table statistics on virtual
computed columns.


opt/memo: extend OutputCols with VirtualCols in statistics builder

Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.

In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.

This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.

Informs: #68254

Epic: CRDB-8949

Release note: None

This comment was marked as outdated.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2 michae2 force-pushed the buildvirtstats branch 2 times, most recently from e9ba8d4 to 05dabed Compare March 19, 2024 07:41
@michae2 michae2 changed the title Buildvirtstats opt/memo: extend OutputCols with VirtualCols in statistics builder Mar 19, 2024
@michae2 michae2 marked this pull request as ready for review March 19, 2024 07:45
@michae2 michae2 requested a review from a team as a code owner March 19, 2024 07:45
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, 11 of 11 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 1766 at r1 (raw file):

	s.RowCount = inputStats.RowCount
	s.VirtualCols.UnionWith(inputStats.VirtualCols)

Does it make sense to include columns from the input, since the index join can't pass them through? (Does that even matter?)


pkg/sql/opt/memo/statistics_builder.go line 2682 at r1 (raw file):

	s.Available = bindingProps.Statistics().Available
	s.RowCount = bindingProps.Statistics().RowCount
	s.VirtualCols.UnionWith(bindingProps.Statistics().VirtualCols)

I think these columns will be keyed to the bound expression, rather than the WithScan output columns.


pkg/sql/opt/memo/statistics_builder.go line 256 at r3 (raw file):

// computed columns we have statistics on that are in scope.
func (sb *statisticsBuilder) colStatCols(props *props.Relational) opt.ColSet {
	if sb.evalCtx.SessionData().OptimizerUseVirtualComputedColumnStats {

[nit] Since we've seen some stats code show up in profiles recently, could we do something like this?

if sb.evalCtx.SessionData().OptimizerUseVirtualComputedColumnStats && !props.Statistics().VirtualCols.Empty() {

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Cool!

Reviewed 13 of 13 files at r1, 11 of 11 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 671 at r1 (raw file):

					// in higher groups we can decide whether to look up statistics on
					// virtual columns or on the columns used in their defining
					// expressions.

nit: This comment seems out-of-date.


pkg/sql/opt/memo/statistics_builder.go line 1766 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Does it make sense to include columns from the input, since the index join can't pass them through? (Does that even matter?)

What do you mean by an "index join can't pass [input columns] through"? What about something like:

  select
   ├── columns: a:1 b:2 c:3
   ├── stats: [rows=0.9108108, distinct(2)=0.910811, null(2)=0, distinct(3)=0.910811, null(3)=0, distinct(2,3)=0.910811, null(2,3)=0]
   ├── key: (1)
   ├── fd: ()-->(2,3)
   ├── prune: (1)
   ├── index-join t
   │    ├── columns: a:1 b:2 c:3
   │    ├── stats: [rows=10]
   │    ├── key: (1)
   │    ├── fd: ()-->(2), (1)-->(3)
   │    └── scan t@t_b_idx
   │         ├── columns: a:1 b:2
   │         ├── constraint: /2/1: [/1 - /1]
   │         ├── stats: [rows=10, distinct(2)=1, null(2)=0]
   │         ├── key: (1)
   │         ├── fd: ()-->(2)
   └── filters
        └── c:3 = 2 [outer=(3), constraints=(/3: [/2 - /2]; tight), fd=()-->(3)]

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 1766 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

What do you mean by an "index join can't pass [input columns] through"? What about something like:

  select
   ├── columns: a:1 b:2 c:3
   ├── stats: [rows=0.9108108, distinct(2)=0.910811, null(2)=0, distinct(3)=0.910811, null(3)=0, distinct(2,3)=0.910811, null(2,3)=0]
   ├── key: (1)
   ├── fd: ()-->(2,3)
   ├── prune: (1)
   ├── index-join t
   │    ├── columns: a:1 b:2 c:3
   │    ├── stats: [rows=10]
   │    ├── key: (1)
   │    ├── fd: ()-->(2), (1)-->(3)
   │    └── scan t@t_b_idx
   │         ├── columns: a:1 b:2
   │         ├── constraint: /2/1: [/1 - /1]
   │         ├── stats: [rows=10, distinct(2)=1, null(2)=0]
   │         ├── key: (1)
   │         ├── fd: ()-->(2)
   └── filters
        └── c:3 = 2 [outer=(3), constraints=(/3: [/2 - /2]; tight), fd=()-->(3)]

Good point. I confused the optimizer representation with what happens during execution. I think this is probably fine, then.

As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql): Add new session variable
`optimizer_use_virtual_computed_column_stats`. When this variable is
enabled, the optimizer will make use of table statistics on virtual
computed columns.
Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.

In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.

This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @mgartner, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 671 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: This comment seems out-of-date.

This is from the first commit, and it's still true: we only add virtual columns when we have stats for them. I wanted to document it because it's a little subtle, not sure if this is the best spot.


pkg/sql/opt/memo/statistics_builder.go line 1766 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Good point. I confused the optimizer representation with what happens during execution. I think this is probably fine, then.

Yes, we do want to do this for index join. I added an index join testcase, and you can see the virtcolstats: (2) going up through the index join.


pkg/sql/opt/memo/statistics_builder.go line 2682 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think these columns will be keyed to the bound expression, rather than the WithScan output columns.

Oh! Great catch! This actually caused a panic when combined with my next PR.

I removed this (so now both set operations and with-scans are barriers for VirtualCols).

I also added a testcase for this that shows we don't panic (but the stats are inaccurate).


pkg/sql/opt/memo/statistics_builder.go line 256 at r3 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] Since we've seen some stats code show up in profiles recently, could we do something like this?

if sb.evalCtx.SessionData().OptimizerUseVirtualComputedColumnStats && !props.Statistics().VirtualCols.Empty() {

Oh, good call! Done.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Great work!

Reviewed 11 of 11 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)

@michae2 michae2 added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Mar 22, 2024
@michae2
Copy link
Collaborator Author

michae2 commented Mar 22, 2024

Thanks Drew!

bors r=DrewKimball,mgartner

@craig
Copy link
Contributor

craig bot commented Mar 22, 2024

@craig craig bot merged commit 8dc82e5 into cockroachdb:master Mar 22, 2024
21 of 36 checks passed
Copy link

blathers-crl bot commented Mar 22, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 23ab815 to blathers/backport-release-23.1-120668: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from 23ab815 to blathers/backport-release-23.2-120668: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@mgartner
Copy link
Collaborator

pkg/sql/opt/memo/statistics_builder.go line 671 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This is from the first commit, and it's still true: we only add virtual columns when we have stats for them. I wanted to document it because it's a little subtle, not sure if this is the best spot.

Oh sorry, I misunderstood the code here.

michae2 added a commit to michae2/cockroach that referenced this pull request Mar 23, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
@michae2 michae2 deleted the buildvirtstats branch March 23, 2024 05:05
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 26, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
craig bot pushed a commit that referenced this pull request Mar 26, 2024
120875: opt/memo: use virtual column stats for matching scalar expressions r=DrewKimball a=michae2

As of #120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: #68254
Fixes: #110146

Epic: CRDB-8949

Release note: None

Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 26, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 29, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 29, 2024
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants