New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: collect table statistics on virtual computed columns #118241
Conversation
This needs more tests, but the distsql changes are RFAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I have a few comments.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/distsql_plan_stats.go
line 363 at r1 (raw file):
} func (dsp *DistSQLPlanner) createStatsPlan(
Do we not need to apply similar changes to createPartialStatsPlan
function? Is it because currently partial statistics are only supported on non-virtual columns? Should we add an explicit error-check for that and a unit test?
pkg/sql/distsql_plan_stats.go
line 419 at r1 (raw file):
if !tableColSet.Contains(c) { if _, err = catalog.MustFindColumnByID(desc, c); err != nil { fmt.Println("could not find column!")
nit: this seems like a debugging leftover - should this be log.*
call? Or perhaps this could be a panic behind buildutil.CrdbTestBuild
?
pkg/sql/distsql_plan_stats.go
line 475 at r1 (raw file):
) if err != nil { fmt.Println("error from MakeComputedExprs:", err)
nit: ditto for logging / panic in test builds.
pkg/sql/distsql_plan_stats.go
line 483 at r1 (raw file):
resultCols := colinfo.ResultColumnsFromColumns(desc.GetID(), requestedCols) ivh := tree.MakeIndexedVarHelper(nil, len(scan.cols))
nit: add /* container */
.
pkg/sql/distsql_plan_stats.go
line 487 at r1 (raw file):
for i, col := range requestedCols { if col.IsVirtual() { if virtIdx > len(virtComputedExprs) {
Shouldn't we use >=
instead of >
in this if
condition (as well as two others below)?
pkg/sql/distsql_plan_stats.go
line 489 at r1 (raw file):
if virtIdx > len(virtComputedExprs) { return nil, errors.AssertionFailedf( "virtual computed column expressions do not match requested columns",
nit: consider including more details into the assertion errors.
pkg/sql/opt/exec/execbuilder/testdata/distsql_misc
line 219 at r1 (raw file):
· Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJzsl9Fu2zYUhu_3FMS5ajG6EinZUXjlzG0Bo0tS2EZvBqNgxBNXiExqJI00C_xYe4E92SApbm3VNix3XnsRXQjg4dHP_-h8FKhHcH_mIGAwenMxeUPGk4vJcDwZDsbEMXJ9RW4pQUoUeTu6viRKegkUtFF4JefoQPwBDChwoBABhRgodGFKobAmReeMLVMeqweG6jOIkEKmi4Uvw1MKqbEI4hF85nMEARN5k-MIpUIbhEBBoZdZXi1TLt0vbx-LO3wACgOTL-baCaIouaEkBQrjQpaBTsCI1IowYvwntEBhhFqhFaTPKOnzX5kQYng1SSiRXmquXvQj2mdCvP39-mKSvAQK7z4Qn81RkPCfv109To32qH1m9DdT1tw7ojA1CpUgLAzr8M2DR0csSiVIEobktzo8G70fkFTmuVvLLWRmV7m8Cl5-GAyI81iQ1Cy0Jy_wsw8y7V8KEgZfExDvdiVUtszCFwtfrzRdUqjHT-_feTlDEGytYcPXIMIlPbxnYzkvcrRBd7NfdXic_VW-qbI3XvqqAX1O-9FOK7xhpbvTylcHC22sQotqw8F0udPsxWxmcSa9sQEL_xPbUcM223yF7HDs2dHYB6wT8B8JPt8GPnvVI--yLejzbejHp0Cft0CftenbCv3eSdDvbVjhhyPEj0eId4LoGaHvQ4i36dsKobOTIHS2YSU6HKHoeISiThA_I_R9CEVt-rZCKDkJQsmGlfhwhOLjEYo7QfdHIhRtQ4i_irYjFG1DqHcKhKIWCMVt-rZC6PwkCJ23OU6O0BVGO2yc5bavFDZW6rDy0IdqhvUJ0ZmFTfG9NWmVWw-vK6EqoND5epbVg6FeTTlvUc6_nIbXldheJb6hxNaVuk0lvt9TG1PRXql4txJrKsVty5NVV0Cjvzf2rt5LDnX501H-LD6F671Ux5PqX2g1M0fn5OzLZAjTdX-9pr_uXn-93ZXyplLv56r0rOnvbK-_ZHelUVMp-bkqTZr-zvdvhHB3qfE3u3P_Rv_faz0vP0m3ubn_mCkQED5dnS231QXlA3Lmyu_i-JO5r8xOHoryq3Yrc4cULuUdvkaPdp7pzPksBeHtApfLX_4NAAD___EFm6M=
nit: maybe add a test where the computed expression cannot be evaluated in a distributed fashion (i.e. when the stats plan must be local)?
5fd31c8
to
c87fbda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
- Added checks and a test for un-distributable virtual column expressions (great catch!).
- Added a cluster setting.
- Added more tests.
- Partial statistics (USING EXTREMES) just work! Added a test.
I think this is RFAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)
pkg/sql/create_stats.go
line 390 at r2 (raw file):
if virtColEnabled { semaCtx := tree.MakeSemaContext() exprs, _, err := schemaexpr.MakeComputedExprs(
This is unfortunate but necessary to perform the distsql check below (checkExpr
) which relies on having a fully-typed expression.
Also unfortunate is that we're performing this work twice: both here and in createStatsPlan
. I'm doing this here to silently skip over un-distributable virtual column expressions, and in createStatsPlan
to build the plan (and to return an error if un-distributable virtual column expressions sneak in another way, e.g. from CREATE STATISTICS
with explicitly-requested columns). I'm not yet sure of an easy way to de-duplicate the work.
pkg/sql/distsql_plan_stats.go
line 363 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Do we not need to apply similar changes to
createPartialStatsPlan
function? Is it because currently partial statistics are only supported on non-virtual columns? Should we add an explicit error-check for that and a unit test?
As it turns out, with this change partial statistics (USING EXTREMES) already work on virtual computed columns. 🎉 This is because partial statistics are only supported for indexed columns, and if a virtual computed column is indexed it acts as a normal stored column in that secondary index. We don't need to do any extra rendering.
I added some tests for partial statistics on virtual computed columns to the end of pkg/sql/logictest/testdata/logic_test/distsql_stats
.
pkg/sql/distsql_plan_stats.go
line 419 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this seems like a debugging leftover - should this be
log.*
call? Or perhaps this could be a panic behindbuildutil.CrdbTestBuild
?
Gah, sorry. Thanks. Removed.
pkg/sql/distsql_plan_stats.go
line 475 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: ditto for logging / panic in test builds.
🤦 removed
pkg/sql/distsql_plan_stats.go
line 483 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: add
/* container */
.
Done.
pkg/sql/distsql_plan_stats.go
line 487 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Shouldn't we use
>=
instead of>
in thisif
condition (as well as two others below)?
Yes, thank you.
pkg/sql/distsql_plan_stats.go
line 489 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: consider including more details into the assertion errors.
Done.
pkg/sql/opt/exec/execbuilder/testdata/distsql_misc
line 219 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe add a test where the computed expression cannot be evaluated in a distributed fashion (i.e. when the stats plan must be local)?
Great catch! I was not looking for this, and was always executing the plan in a distributed fashion regardless of whether the expression supported it.
I think we probably want to skip collecting stats on un-distributable virtual column expressions, rather than run CREATE STATISTICS locally. (I think it would create a risk of performance impact to do a full table scan through distsender.)
I've added checks for un-distributable expressions to createStatsDefaultColumns
and createStatsPlan
, as well as tests. If the expression cannot be distributed we now skip collecting stats on the virtual column (or return an error if the column was explicitly requested).
Oh, I forgot to mention:
I think we could pretty easily do (2) but I have left it out of this PR for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few nits
Reviewed 1 of 3 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)
pkg/sql/create_stats.go
line 429 at r2 (raw file):
} // Do not collect stats for virtual computed columns.
nit: this comment is slightly misleading, since we do want to collect stats for some virtual computed columns
pkg/sql/create_stats.go
line 568 at r2 (raw file):
col := desc.PublicColumns()[i] // Do not collect stats for virtual computed columns.
nit: misleading comment
pkg/sql/distsql_plan_stats.go
line 417 at r2 (raw file):
refColIDs.ForEach(func(c descpb.ColumnID) { if !tableColSet.Contains(c) { if _, err = catalog.MustFindColumnByID(desc, c); err != nil {
why do you need this?
pkg/sql/importer/import_job.go
line 1044 at r2 (raw file):
// columns. multiColEnabled := false defaultHistogramBuckets := stats.GetDefaultHistogramBuckets(execCfg.SV(), desc)
you're removing the ability to set the histogram buckets here -- was this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2)
pkg/sql/create_stats.go
line 273 at r2 (raw file):
for i := range columns { if columns[i].IsVirtual() && !statsOnVirtualCols.Get(n.p.ExecCfg().SV()) { return nil, pgerror.Newf(
nit: maybe add an error hint to change the cluster setting to allow it?
pkg/sql/create_stats.go
line 390 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
This is unfortunate but necessary to perform the distsql check below (
checkExpr
) which relies on having a fully-typed expression.Also unfortunate is that we're performing this work twice: both here and in
createStatsPlan
. I'm doing this here to silently skip over un-distributable virtual column expressions, and increateStatsPlan
to build the plan (and to return an error if un-distributable virtual column expressions sneak in another way, e.g. fromCREATE STATISTICS
with explicitly-requested columns). I'm not yet sure of an easy way to de-duplicate the work.
This seems reasonable to me.
IIUC one way to avoid the duplicate work would be to extend jobspb.CreateStatsDetails
to somehow mark distributable expressions for virtual columns (e.g. list of column ids). Then we'd only check the distributability when creating this proto, and in createStatsPlan
we wouldn't have to do that. However, the current approach seems good.
pkg/sql/create_stats.go
line 403 at r2 (raw file):
} for i, col := range desc.PublicColumns() { cannotDistribute[i] = col.IsVirtual() && checkExpr(exprs[i]) != nil
nit: maybe do s/checkExpr/checkExprForDistSQL/
to clarify this helper?
pkg/sql/create_stats.go
line 430 at r2 (raw file):
// Do not collect stats for virtual computed columns. if col.IsVirtual() && (!virtColEnabled || cannotDistribute[col.Ordinal()]) {
nit: maybe extract a helper for this conditional?
pkg/sql/opt/exec/execbuilder/testdata/distsql_misc
line 219 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Great catch! I was not looking for this, and was always executing the plan in a distributed fashion regardless of whether the expression supported it.
I think we probably want to skip collecting stats on un-distributable virtual column expressions, rather than run CREATE STATISTICS locally. (I think it would create a risk of performance impact to do a full table scan through distsender.)
I've added checks for un-distributable expressions to
createStatsDefaultColumns
andcreateStatsPlan
, as well as tests. If the expression cannot be distributed we now skip collecting stats on the virtual column (or return an error if the column was explicitly requested).
Nice! Although I wonder whether we should have an escape hatch to allow stats collection even if the expr cannot be distributed, like controlled by a session variable (disabled by default). Perhaps if the virtual column was explicitly requested, this should always be allowed. I think the current approach makes sense, so maybe sprinkle some TODOs for this, and we'll address them once we have at least one user complain about it?
pkg/sql/opt/exec/execbuilder/testdata/distsql_misc
line 251 at r2 (raw file):
Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJzsV8Fu2zgQve9XEHNKsHQkUrLj8OSs2wJGN01hG70sgoIRp65gWdSSNNJs4M_aH9gvW0iqa0mxXAtpgQSoD4I4MxrOm3l8Bh_A_p2AgPH09eX8NZnNL-eT2XwynhHLyJvp9RVR0kmgkGqF7-QKLYi_gAEFDhQCoBAChT7cUMiMjtBabfKQh-KDifoCwqcQp9na5eYbCpE2COIBXOwSBAFzeZvgFKVC4_lAQaGTcVJsk289yh8fsyXeA4WxTtar1AoiKbmlJKJEAYVZJnNbz2NEpoowot1nNEBhiqlCI8iIUTLilIwCSkZh_v47E0JM3s2HlEgnU65ORgEdhUK8-fP6cj48BQpvPxAXr1AQ_79_bbmOdOowdbFOH7mMvrNEYaQVKkGY75fm23uHlhiUSpCh75M_SvNi-n5MIpkkthKbydhsY3lhvPowHhPrMCORXqeOnOAX58WpOxXE93YBiMu2gKIsvXbZ2pU73WwolOuv47BOLhAEq8xv8gqEv6HHj3AmV1mCxuvXx1eaZ_E_WOxdQJo56fKB7F55xUqrq6DuoFVD-MhHq7b-7nXQCpk3IPdbIe-QrlNtFBpUNaQ3m9amXC4WBhfSaeMx__j2kBPu--R2HS3R2dNKs9ocjdYdCqs38juR9ba2BQ_ICd-Z2xoeNBrO6iRjx-sEe4pOeKzn8eeiFHyfUrCzAXkb79EKvk8rwp-hFbyDVrAuY9xqxeBFa8WgBpkfz1z-JObynhf8Yu6PYy7vMsYtc89fNHPPa5CD45kbPIm5Qc8LfzH3xzE36DLGLXOHL5q5wxrk8Hjmhk9ibtjz-s-FucE-5vKzYD9zg33MHfwM5gYdmBt2GeOWuRcvmrkXXS5TU7SZTi02bhj7d_IbO_VYfhVBtcDy3mL12kT43uioiC2X10WiwqDQutLLysUk3bqsMyhX3-6C1UzsYCZey8SqmfrNTPxwTV2KCg6mCtszsWamsCs8WUwFUnR32izLM2sxLSQ9Z-7WUZ7a0rP7v9l6V2itXOwC_BBuqnUOmnX2D9Y5aEfMm5kGzxPxebPO84N1DtsRB81Mw-eJeNis8-LwAfHbIYePTu1hAeiEOWjFzM_C72DuNzBf5JL1KdF3H2MFAvyvv96ex_YH-QdyYXPdnH3Wd0XR8_ssV71PMrFI4Uou8RU6NKs4ja2LIxDOrHGz-e3_AAAA___c-GNe # Check that we also collect stats on the hidden expression index virt column.
nit: might be also interesting to see what happens when the column is made INVISIBLE.
Add rendering of virtual computed column expressions to the CREATE STATISTICS distsql plan. This rendering should always be added as post processors on the TableReader nodes that feed the Samplers. With this change we now collect table statistics on virtual computed columns. A future PR will make use of the statistsics in statistics builder. Collection of partial statistics (USING EXTREMES) on virtual computed columns simply works after this change because partial statistics collection utilizes secondary indexes, where virtual computed columns are regular stored columns. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note (sql change): Add a new cluster setting, `sql.stats.virtual_computed_columns.enabled`, which when set enables collection of table statistics on virtual computed columns.
This function checks whether an expression supports distributed execution. The name was a little ambiguous, especially in the giant sql package. Epic: None Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
After finishing the next part (statistics_builder) I will work on the backport.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft and @yuzefovich)
pkg/sql/create_stats.go
line 273 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe add an error hint to change the cluster setting to allow it?
Done.
pkg/sql/create_stats.go
line 403 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe do
s/checkExpr/checkExprForDistSQL/
to clarify this helper?
Done (in a second commit).
pkg/sql/create_stats.go
line 429 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: this comment is slightly misleading, since we do want to collect stats for some virtual computed columns
Rephrased.
pkg/sql/create_stats.go
line 430 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe extract a helper for this conditional?
Done.
pkg/sql/create_stats.go
line 568 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: misleading comment
Rephrased.
pkg/sql/distsql_plan_stats.go
line 417 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
why do you need this?
Ah, good catch, I don't. (It was a sanity check during development.) Removed.
pkg/sql/importer/import_job.go
line 1044 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
you're removing the ability to set the histogram buckets here -- was this intentional?
Yes: as it turns out, StubTableStats
doesn't do anything useful with the number of histogram buckets anyway. It always returns statistics with nil HistogramData. (And this is the only user of StubTableStats
.)
pkg/sql/opt/exec/execbuilder/testdata/distsql_misc
line 219 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Nice! Although I wonder whether we should have an escape hatch to allow stats collection even if the expr cannot be distributed, like controlled by a session variable (disabled by default). Perhaps if the virtual column was explicitly requested, this should always be allowed. I think the current approach makes sense, so maybe sprinkle some TODOs for this, and we'll address them once we have at least one user complain about it?
Good point, added some TODOs.
pkg/sql/opt/exec/execbuilder/testdata/distsql_misc
line 251 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: might be also interesting to see what happens when the column is made INVISIBLE.
Added a test with some hidden columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)
pkg/sql/importer/import_job.go
line 1044 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Yes: as it turns out,
StubTableStats
doesn't do anything useful with the number of histogram buckets anyway. It always returns statistics with nil HistogramData. (And this is the only user ofStubTableStats
.)
thanks for the explanation!
Thanks! bors r=yuzefovich,rytaft |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
Reviewed 2 of 5 files at r2, 5 of 5 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
Note to self: when backporting, need to include #118930 |
As of cockroachdb#118241 we now collect table statistics on virtual computed columns, but do not yet use them in statistics builder. The difficulty with using these stats in statistics builder is that virtual computed columns are synthesized by various non-Scan expressions (Project, Select, etc). When calculating stats for these non-Scan expressions, we need to find the virtual column stats even though the virtual columns are not produced by the input to these expressions. To solve this, we add a VirtualCols set to props.Statistics which holds all of the virtual columns that could be produced by the input to a group. Expressions that could synthesize virtual columns will look in this set to discover whether there are statistics for any of the scalar expressions they render. If there are, they will call colStatXXX using the virtual column ID as if the virtual column had originated from their input. This commit adds VirtualCols but does not yet use it. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note: None
As of cockroachdb#118241 we now collect table statistics on virtual computed columns, but do not yet use them in statistics builder. The difficulty with using these stats in statistics builder is that virtual computed columns are synthesized by various non-Scan expressions (Project, Select, etc). When calculating stats for these non-Scan expressions, we need to find the virtual column stats even though the virtual columns are not produced by the input to these expressions. To solve this, we add a VirtualCols set to props.Statistics which holds all of the virtual columns that could be produced by the input to a group. Expressions that could synthesize virtual columns will look in this set to discover whether there are statistics for any of the scalar expressions they render. If there are, they will call colStatXXX using the virtual column ID as if the virtual column had originated from their input. This commit adds VirtualCols but does not yet use it. Note that we cannot currently pass VirtualCols up through set operations or with-scans, due to the column ID translation they use. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note: None
120668: opt/memo: extend OutputCols with VirtualCols in statistics builder r=DrewKimball,mgartner a=michae2 **opt: add props.Statistics.VirtualCols** As of #118241 we now collect table statistics on virtual computed columns, but do not yet use them in statistics builder. The difficulty with using these stats in statistics builder is that virtual computed columns are synthesized by various non-Scan expressions (Project, Select, etc). When calculating stats for these non-Scan expressions, we need to find the virtual column stats even though the virtual columns are not produced by the input to these expressions. To solve this, we add a VirtualCols set to props.Statistics which holds all of the virtual columns that could be produced by the input to a group. Expressions that could synthesize virtual columns will look in this set to discover whether there are statistics for any of the scalar expressions they render. If there are, they will call colStatXXX using the virtual column ID as if the virtual column had originated from their input. This commit adds VirtualCols but does not yet use it. Note that we cannot currently pass VirtualCols up through set operations or with-scans, due to the column ID translation they use. Informs: #68254 Epic: CRDB-8949 Release note: None --- **sql: add optimizer_use_virtual_computed_column_stats session variable** Informs: #68254 Epic: CRDB-8949 Release note (sql): Add new session variable `optimizer_use_virtual_computed_column_stats`. When this variable is enabled, the optimizer will make use of table statistics on virtual computed columns. --- **opt/memo: extend OutputCols with VirtualCols in statistics builder** Throughout statistics builder we use OutputCols to determine which columns come from the input to an expression. We then typically call colStatXXX with those columns as part of statistics calculation. In order to use statistics on virtual computed columns, we need to call colStatXXX on any virtual columns that could come from our input, even if they are not passed upward through OutputCols. To do this we extend OutputCols with the VirtualCols set we built in a previous commit. This commit replaces almost all usages of OutputCols in statistics builder with a call to helper function colStatCols, which returns a union of OutputCols and VirtualCols. This is enough to get the optimizer to use statistics on virtual computed columns in some simple plans. More complex plans will require matching the virtual column scalar expressions, which will be in the next PR. I've left some TODOs marking spots where this next PR will touch. Informs: #68254 Epic: CRDB-8949 Release note: None Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
As of cockroachdb#118241 we now collect table statistics on virtual computed columns, but do not yet use them in statistics builder. The difficulty with using these stats in statistics builder is that virtual computed columns are synthesized by various non-Scan expressions (Project, Select, etc). When calculating stats for these non-Scan expressions, we need to find the virtual column stats even though the virtual columns are not produced by the input to these expressions. To solve this, we add a VirtualCols set to props.Statistics which holds all of the virtual columns that could be produced by the input to a group. Expressions that could synthesize virtual columns will look in this set to discover whether there are statistics for any of the scalar expressions they render. If there are, they will call colStatXXX using the virtual column ID as if the virtual column had originated from their input. This commit adds VirtualCols but does not yet use it. Note that we cannot currently pass VirtualCols up through set operations or with-scans, due to the column ID translation they use. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note: None
As of cockroachdb#118241 we now collect table statistics on virtual computed columns, but do not yet use them in statistics builder. The difficulty with using these stats in statistics builder is that virtual computed columns are synthesized by various non-Scan expressions (Project, Select, etc). When calculating stats for these non-Scan expressions, we need to find the virtual column stats even though the virtual columns are not produced by the input to these expressions. To solve this, we add a VirtualCols set to props.Statistics which holds all of the virtual columns that could be produced by the input to a group. Expressions that could synthesize virtual columns will look in this set to discover whether there are statistics for any of the scalar expressions they render. If there are, they will call colStatXXX using the virtual column ID as if the virtual column had originated from their input. This commit adds VirtualCols but does not yet use it. Note that we cannot currently pass VirtualCols up through set operations or with-scans, due to the column ID translation they use. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note: None
As of cockroachdb#118241 we now collect table statistics on virtual computed columns, but do not yet use them in statistics builder. The difficulty with using these stats in statistics builder is that virtual computed columns are synthesized by various non-Scan expressions (Project, Select, etc). When calculating stats for these non-Scan expressions, we need to find the virtual column stats even though the virtual columns are not produced by the input to these expressions. To solve this, we add a VirtualCols set to props.Statistics which holds all of the virtual columns that could be produced by the input to a group. Expressions that could synthesize virtual columns will look in this set to discover whether there are statistics for any of the scalar expressions they render. If there are, they will call colStatXXX using the virtual column ID as if the virtual column had originated from their input. This commit adds VirtualCols but does not yet use it. Note that we cannot currently pass VirtualCols up through set operations or with-scans, due to the column ID translation they use. Informs: cockroachdb#68254 Epic: CRDB-8949 Release note: None
Add rendering of virtual computed column expressions to the CREATE STATISTICS distsql plan. This rendering should always be added as post processors on the TableReader nodes that feed the Samplers.
With this change we now collect table statistics on virtual computed columns. A future PR will make use of the statistsics in statistics builder.
Collection of partial statistics (USING EXTREMES) on virtual computed columns simply works after this change because partial statistics collection utilizes secondary indexes, where virtual computed columns are regular stored columns.
Informs: #68254
Epic: CRDB-8949
Release note (sql change): Add a new cluster setting,
sql.stats.virtual_computed_columns.enabled
, which when set enables collection of table statistics on virtual computed columns.