New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt/memo: extend OutputCols with VirtualCols in statistics builder #120668
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
e9ba8d4
to
05dabed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r1, 11 of 11 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 1766 at r1 (raw file):
s.RowCount = inputStats.RowCount s.VirtualCols.UnionWith(inputStats.VirtualCols)
Does it make sense to include columns from the input, since the index join can't pass them through? (Does that even matter?)
pkg/sql/opt/memo/statistics_builder.go
line 2682 at r1 (raw file):
s.Available = bindingProps.Statistics().Available s.RowCount = bindingProps.Statistics().RowCount s.VirtualCols.UnionWith(bindingProps.Statistics().VirtualCols)
I think these columns will be keyed to the bound expression, rather than the WithScan
output columns.
pkg/sql/opt/memo/statistics_builder.go
line 256 at r3 (raw file):
// computed columns we have statistics on that are in scope. func (sb *statisticsBuilder) colStatCols(props *props.Relational) opt.ColSet { if sb.evalCtx.SessionData().OptimizerUseVirtualComputedColumnStats {
[nit] Since we've seen some stats code show up in profiles recently, could we do something like this?
if sb.evalCtx.SessionData().OptimizerUseVirtualComputedColumnStats && !props.Statistics().VirtualCols.Empty() {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r1, 11 of 11 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 671 at r1 (raw file):
// in higher groups we can decide whether to look up statistics on // virtual columns or on the columns used in their defining // expressions.
nit: This comment seems out-of-date.
pkg/sql/opt/memo/statistics_builder.go
line 1766 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Does it make sense to include columns from the input, since the index join can't pass them through? (Does that even matter?)
What do you mean by an "index join can't pass [input columns] through"? What about something like:
select
├── columns: a:1 b:2 c:3
├── stats: [rows=0.9108108, distinct(2)=0.910811, null(2)=0, distinct(3)=0.910811, null(3)=0, distinct(2,3)=0.910811, null(2,3)=0]
├── key: (1)
├── fd: ()-->(2,3)
├── prune: (1)
├── index-join t
│ ├── columns: a:1 b:2 c:3
│ ├── stats: [rows=10]
│ ├── key: (1)
│ ├── fd: ()-->(2), (1)-->(3)
│ └── scan t@t_b_idx
│ ├── columns: a:1 b:2
│ ├── constraint: /2/1: [/1 - /1]
│ ├── stats: [rows=10, distinct(2)=1, null(2)=0]
│ ├── key: (1)
│ ├── fd: ()-->(2)
└── filters
└── c:3 = 2 [outer=(3), constraints=(/3: [/2 - /2]; tight), fd=()-->(3)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 1766 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
What do you mean by an "index join can't pass [input columns] through"? What about something like:
select ├── columns: a:1 b:2 c:3 ├── stats: [rows=0.9108108, distinct(2)=0.910811, null(2)=0, distinct(3)=0.910811, null(3)=0, distinct(2,3)=0.910811, null(2,3)=0] ├── key: (1) ├── fd: ()-->(2,3) ├── prune: (1) ├── index-join t │ ├── columns: a:1 b:2 c:3 │ ├── stats: [rows=10] │ ├── key: (1) │ ├── fd: ()-->(2), (1)-->(3) │ └── scan t@t_b_idx │ ├── columns: a:1 b:2 │ ├── constraint: /2/1: [/1 - /1] │ ├── stats: [rows=10, distinct(2)=1, null(2)=0] │ ├── key: (1) │ ├── fd: ()-->(2) └── filters └── c:3 = 2 [outer=(3), constraints=(/3: [/2 - /2]; tight), fd=()-->(3)]
Good point. I confused the optimizer representation with what happens during execution. I think this is probably fine, then.
As of cockroachdb#118241 we now collect table statistics on virtual computed columns, but do not yet use them in statistics builder. The difficulty with using these stats in statistics builder is that virtual computed columns are synthesized by various non-Scan expressions (Project, Select, etc). When calculating stats for these non-Scan expressions, we need to find the virtual column stats even though the virtual columns are not produced by the input to these expressions. To solve this, we add a VirtualCols set to props.Statistics which holds all of the virtual columns that could be produced by the input to a group. Expressions that could synthesize virtual columns will look in this set to discover whether there are statistics for any of the scalar expressions they render. If there are, they will call colStatXXX using the virtual column ID as if the virtual column had originated from their input. This commit adds VirtualCols but does not yet use it. Note that we cannot currently pass VirtualCols up through set operations or with-scans, due to the column ID translation they use. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note: None
Informs: cockroachdb#68254 Epic: CRDB-8949 Release note (sql): Add new session variable `optimizer_use_virtual_computed_column_stats`. When this variable is enabled, the optimizer will make use of table statistics on virtual computed columns.
Throughout statistics builder we use OutputCols to determine which columns come from the input to an expression. We then typically call colStatXXX with those columns as part of statistics calculation. In order to use statistics on virtual computed columns, we need to call colStatXXX on any virtual columns that could come from our input, even if they are not passed upward through OutputCols. To do this we extend OutputCols with the VirtualCols set we built in a previous commit. This commit replaces almost all usages of OutputCols in statistics builder with a call to helper function colStatCols, which returns a union of OutputCols and VirtualCols. This is enough to get the optimizer to use statistics on virtual computed columns in some simple plans. More complex plans will require matching the virtual column scalar expressions, which will be in the next PR. I've left some TODOs marking spots where this next PR will touch. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @mgartner, and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 671 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: This comment seems out-of-date.
This is from the first commit, and it's still true: we only add virtual columns when we have stats for them. I wanted to document it because it's a little subtle, not sure if this is the best spot.
pkg/sql/opt/memo/statistics_builder.go
line 1766 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Good point. I confused the optimizer representation with what happens during execution. I think this is probably fine, then.
Yes, we do want to do this for index join. I added an index join testcase, and you can see the virtcolstats: (2)
going up through the index join.
pkg/sql/opt/memo/statistics_builder.go
line 2682 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
I think these columns will be keyed to the bound expression, rather than the
WithScan
output columns.
Oh! Great catch! This actually caused a panic when combined with my next PR.
I removed this (so now both set operations and with-scans are barriers for VirtualCols).
I also added a testcase for this that shows we don't panic (but the stats are inaccurate).
pkg/sql/opt/memo/statistics_builder.go
line 256 at r3 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit] Since we've seen some stats code show up in profiles recently, could we do something like this?
if sb.evalCtx.SessionData().OptimizerUseVirtualComputedColumnStats && !props.Statistics().VirtualCols.Empty() {
Oh, good call! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 11 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @rytaft)
Thanks Drew! bors r=DrewKimball,mgartner |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 23ab815 to blathers/backport-release-23.1-120668: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. error creating merge commit from 23ab815 to blathers/backport-release-23.2-120668: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, michae2 (Michael Erickson) wrote…
Oh sorry, I misunderstood the code here. |
As of cockroachdb#120668 we now use statistics on virtual computed columns in statistics builder. Simple queries that synthesize virtual columns in Project expressions already benefit, because they use the virtual column ID when synthesizing the virtual column. Other expressions, however, do not directly use the virtual column ID when synthesizing a virtual column. This includes Select expressions, joins, constrained scans, and some Project expressions. For example, consider a query like the following: ``` CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b)); SELECT * FROM ab WHERE a % 10 > 3; ``` Even though the filter condition is in terms of `a`, we'd like to use the statistics on virtual computed column `b` since the expression matches. In order to do this, we replace `a % 10` with `b` in a copy of the filter condition before doing any stats calculations. Then we perform our normal stats calculations, using `b`. Fixes: cockroachdb#68254 Fixes: cockroachdb#110146 Epic: CRDB-8949 Release note: None
As of cockroachdb#120668 we now use statistics on virtual computed columns in statistics builder. Simple queries that synthesize virtual columns in Project expressions already benefit, because they use the virtual column ID when synthesizing the virtual column. Other expressions, however, do not directly use the virtual column ID when synthesizing a virtual column. This includes Select expressions, joins, constrained scans, and some Project expressions. For example, consider a query like the following: ``` CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b)); SELECT * FROM ab WHERE a % 10 > 3; ``` Even though the filter condition is in terms of `a`, we'd like to use the statistics on virtual computed column `b` since the expression matches. In order to do this, we replace `a % 10` with `b` in a copy of the filter condition before doing any stats calculations. Then we perform our normal stats calculations, using `b`. Fixes: cockroachdb#68254 Fixes: cockroachdb#110146 Epic: CRDB-8949 Release note: None
120875: opt/memo: use virtual column stats for matching scalar expressions r=DrewKimball a=michae2 As of #120668 we now use statistics on virtual computed columns in statistics builder. Simple queries that synthesize virtual columns in Project expressions already benefit, because they use the virtual column ID when synthesizing the virtual column. Other expressions, however, do not directly use the virtual column ID when synthesizing a virtual column. This includes Select expressions, joins, constrained scans, and some Project expressions. For example, consider a query like the following: ``` CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b)); SELECT * FROM ab WHERE a % 10 > 3; ``` Even though the filter condition is in terms of `a`, we'd like to use the statistics on virtual computed column `b` since the expression matches. In order to do this, we replace `a % 10` with `b` in a copy of the filter condition before doing any stats calculations. Then we perform our normal stats calculations, using `b`. Fixes: #68254 Fixes: #110146 Epic: CRDB-8949 Release note: None Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
As of cockroachdb#120668 we now use statistics on virtual computed columns in statistics builder. Simple queries that synthesize virtual columns in Project expressions already benefit, because they use the virtual column ID when synthesizing the virtual column. Other expressions, however, do not directly use the virtual column ID when synthesizing a virtual column. This includes Select expressions, joins, constrained scans, and some Project expressions. For example, consider a query like the following: ``` CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b)); SELECT * FROM ab WHERE a % 10 > 3; ``` Even though the filter condition is in terms of `a`, we'd like to use the statistics on virtual computed column `b` since the expression matches. In order to do this, we replace `a % 10` with `b` in a copy of the filter condition before doing any stats calculations. Then we perform our normal stats calculations, using `b`. Fixes: cockroachdb#68254 Fixes: cockroachdb#110146 Epic: CRDB-8949 Release note: None
As of cockroachdb#120668 we now use statistics on virtual computed columns in statistics builder. Simple queries that synthesize virtual columns in Project expressions already benefit, because they use the virtual column ID when synthesizing the virtual column. Other expressions, however, do not directly use the virtual column ID when synthesizing a virtual column. This includes Select expressions, joins, constrained scans, and some Project expressions. For example, consider a query like the following: ``` CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b)); SELECT * FROM ab WHERE a % 10 > 3; ``` Even though the filter condition is in terms of `a`, we'd like to use the statistics on virtual computed column `b` since the expression matches. In order to do this, we replace `a % 10` with `b` in a copy of the filter condition before doing any stats calculations. Then we perform our normal stats calculations, using `b`. Fixes: cockroachdb#68254 Fixes: cockroachdb#110146 Epic: CRDB-8949 Release note: None
As of cockroachdb#120668 we now use statistics on virtual computed columns in statistics builder. Simple queries that synthesize virtual columns in Project expressions already benefit, because they use the virtual column ID when synthesizing the virtual column. Other expressions, however, do not directly use the virtual column ID when synthesizing a virtual column. This includes Select expressions, joins, constrained scans, and some Project expressions. For example, consider a query like the following: ``` CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b)); SELECT * FROM ab WHERE a % 10 > 3; ``` Even though the filter condition is in terms of `a`, we'd like to use the statistics on virtual computed column `b` since the expression matches. In order to do this, we replace `a % 10` with `b` in a copy of the filter condition before doing any stats calculations. Then we perform our normal stats calculations, using `b`. Fixes: cockroachdb#68254 Fixes: cockroachdb#110146 Epic: CRDB-8949 Release note: None
opt: add props.Statistics.VirtualCols
As of #118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.
To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.
This commit adds VirtualCols but does not yet use it.
Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.
Informs: #68254
Epic: CRDB-8949
Release note: None
sql: add optimizer_use_virtual_computed_column_stats session variable
Informs: #68254
Epic: CRDB-8949
Release note (sql): Add new session variable
optimizer_use_virtual_computed_column_stats
. When this variable isenabled, the optimizer will make use of table statistics on virtual
computed columns.
opt/memo: extend OutputCols with VirtualCols in statistics builder
Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.
In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.
This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.
Informs: #68254
Epic: CRDB-8949
Release note: None