Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: collect table statistics on virtual computed columns #118241

Merged
merged 2 commits into from Jan 31, 2024

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Jan 23, 2024

Add rendering of virtual computed column expressions to the CREATE STATISTICS distsql plan. This rendering should always be added as post processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed columns. A future PR will make use of the statistsics in statistics builder.

Collection of partial statistics (USING EXTREMES) on virtual computed columns simply works after this change because partial statistics collection utilizes secondary indexes, where virtual computed columns are regular stored columns.

Informs: #68254

Epic: CRDB-8949

Release note (sql change): Add a new cluster setting, sql.stats.virtual_computed_columns.enabled, which when set enables collection of table statistics on virtual computed columns.

@michae2 michae2 requested review from a team as code owners January 23, 2024 19:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2
Copy link
Collaborator Author

michae2 commented Jan 23, 2024

This needs more tests, but the distsql changes are RFAL.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a few comments.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)


pkg/sql/distsql_plan_stats.go line 363 at r1 (raw file):

}

func (dsp *DistSQLPlanner) createStatsPlan(

Do we not need to apply similar changes to createPartialStatsPlan function? Is it because currently partial statistics are only supported on non-virtual columns? Should we add an explicit error-check for that and a unit test?


pkg/sql/distsql_plan_stats.go line 419 at r1 (raw file):

				if !tableColSet.Contains(c) {
					if _, err = catalog.MustFindColumnByID(desc, c); err != nil {
						fmt.Println("could not find column!")

nit: this seems like a debugging leftover - should this be log.* call? Or perhaps this could be a panic behind buildutil.CrdbTestBuild?


pkg/sql/distsql_plan_stats.go line 475 at r1 (raw file):

		)
		if err != nil {
			fmt.Println("error from MakeComputedExprs:", err)

nit: ditto for logging / panic in test builds.


pkg/sql/distsql_plan_stats.go line 483 at r1 (raw file):

		resultCols := colinfo.ResultColumnsFromColumns(desc.GetID(), requestedCols)

		ivh := tree.MakeIndexedVarHelper(nil, len(scan.cols))

nit: add /* container */.


pkg/sql/distsql_plan_stats.go line 487 at r1 (raw file):

		for i, col := range requestedCols {
			if col.IsVirtual() {
				if virtIdx > len(virtComputedExprs) {

Shouldn't we use >= instead of > in this if condition (as well as two others below)?


pkg/sql/distsql_plan_stats.go line 489 at r1 (raw file):

				if virtIdx > len(virtComputedExprs) {
					return nil, errors.AssertionFailedf(
						"virtual computed column expressions do not match requested columns",

nit: consider including more details into the assertion errors.


pkg/sql/opt/exec/execbuilder/testdata/distsql_misc line 219 at r1 (raw file):

·
Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJzsl9Fu2zYUhu_3FMS5ajG6EinZUXjlzG0Bo0tS2EZvBqNgxBNXiExqJI00C_xYe4E92SApbm3VNix3XnsRXQjg4dHP_-h8FKhHcH_mIGAwenMxeUPGk4vJcDwZDsbEMXJ9RW4pQUoUeTu6viRKegkUtFF4JefoQPwBDChwoBABhRgodGFKobAmReeMLVMeqweG6jOIkEKmi4Uvw1MKqbEI4hF85nMEARN5k-MIpUIbhEBBoZdZXi1TLt0vbx-LO3wACgOTL-baCaIouaEkBQrjQpaBTsCI1IowYvwntEBhhFqhFaTPKOnzX5kQYng1SSiRXmquXvQj2mdCvP39-mKSvAQK7z4Qn81RkPCfv109To32qH1m9DdT1tw7ojA1CpUgLAzr8M2DR0csSiVIEobktzo8G70fkFTmuVvLLWRmV7m8Cl5-GAyI81iQ1Cy0Jy_wsw8y7V8KEgZfExDvdiVUtszCFwtfrzRdUqjHT-_feTlDEGytYcPXIMIlPbxnYzkvcrRBd7NfdXic_VW-qbI3XvqqAX1O-9FOK7xhpbvTylcHC22sQotqw8F0udPsxWxmcSa9sQEL_xPbUcM223yF7HDs2dHYB6wT8B8JPt8GPnvVI--yLejzbejHp0Cft0CftenbCv3eSdDvbVjhhyPEj0eId4LoGaHvQ4i36dsKobOTIHS2YSU6HKHoeISiThA_I_R9CEVt-rZCKDkJQsmGlfhwhOLjEYo7QfdHIhRtQ4i_irYjFG1DqHcKhKIWCMVt-rZC6PwkCJ23OU6O0BVGO2yc5bavFDZW6rDy0IdqhvUJ0ZmFTfG9NWmVWw-vK6EqoND5epbVg6FeTTlvUc6_nIbXldheJb6hxNaVuk0lvt9TG1PRXql4txJrKsVty5NVV0Cjvzf2rt5LDnX501H-LD6F671Ux5PqX2g1M0fn5OzLZAjTdX-9pr_uXn-93ZXyplLv56r0rOnvbK-_ZHelUVMp-bkqTZr-zvdvhHB3qfE3u3P_Rv_faz0vP0m3ubn_mCkQED5dnS231QXlA3Lmyu_i-JO5r8xOHoryq3Yrc4cULuUdvkaPdp7pzPksBeHtApfLX_4NAAD___EFm6M=

nit: maybe add a test where the computed expression cannot be evaluated in a distributed fashion (i.e. when the stats plan must be local)?

@michae2 michae2 force-pushed the virtcolstats branch 2 times, most recently from 5fd31c8 to c87fbda Compare January 30, 2024 01:47
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

  • Added checks and a test for un-distributable virtual column expressions (great catch!).
  • Added a cluster setting.
  • Added more tests.
  • Partial statistics (USING EXTREMES) just work! Added a test.

I think this is RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)


pkg/sql/create_stats.go line 390 at r2 (raw file):

	if virtColEnabled {
		semaCtx := tree.MakeSemaContext()
		exprs, _, err := schemaexpr.MakeComputedExprs(

This is unfortunate but necessary to perform the distsql check below (checkExpr) which relies on having a fully-typed expression.

Also unfortunate is that we're performing this work twice: both here and in createStatsPlan. I'm doing this here to silently skip over un-distributable virtual column expressions, and in createStatsPlan to build the plan (and to return an error if un-distributable virtual column expressions sneak in another way, e.g. from CREATE STATISTICS with explicitly-requested columns). I'm not yet sure of an easy way to de-duplicate the work.


pkg/sql/distsql_plan_stats.go line 363 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do we not need to apply similar changes to createPartialStatsPlan function? Is it because currently partial statistics are only supported on non-virtual columns? Should we add an explicit error-check for that and a unit test?

As it turns out, with this change partial statistics (USING EXTREMES) already work on virtual computed columns. 🎉 This is because partial statistics are only supported for indexed columns, and if a virtual computed column is indexed it acts as a normal stored column in that secondary index. We don't need to do any extra rendering.

I added some tests for partial statistics on virtual computed columns to the end of pkg/sql/logictest/testdata/logic_test/distsql_stats.


pkg/sql/distsql_plan_stats.go line 419 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this seems like a debugging leftover - should this be log.* call? Or perhaps this could be a panic behind buildutil.CrdbTestBuild?

Gah, sorry. Thanks. Removed.


pkg/sql/distsql_plan_stats.go line 475 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: ditto for logging / panic in test builds.

🤦 removed


pkg/sql/distsql_plan_stats.go line 483 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: add /* container */.

Done.


pkg/sql/distsql_plan_stats.go line 487 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Shouldn't we use >= instead of > in this if condition (as well as two others below)?

Yes, thank you.


pkg/sql/distsql_plan_stats.go line 489 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: consider including more details into the assertion errors.

Done.


pkg/sql/opt/exec/execbuilder/testdata/distsql_misc line 219 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe add a test where the computed expression cannot be evaluated in a distributed fashion (i.e. when the stats plan must be local)?

Great catch! I was not looking for this, and was always executing the plan in a distributed fashion regardless of whether the expression supported it.

I think we probably want to skip collecting stats on un-distributable virtual column expressions, rather than run CREATE STATISTICS locally. (I think it would create a risk of performance impact to do a full table scan through distsender.)

I've added checks for un-distributable expressions to createStatsDefaultColumns and createStatsPlan, as well as tests. If the expression cannot be distributed we now skip collecting stats on the virtual column (or return an error if the column was explicitly requested).

@michae2
Copy link
Collaborator Author

michae2 commented Jan 30, 2024

Oh, I forgot to mention:

  1. We do now collect stats on the hidden virtual column used by expression indexes.
  2. But we do not collect stats on partial index expressions.

I think we could pretty easily do (2) but I have left it out of this PR for now.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few nits

Reviewed 1 of 3 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/create_stats.go line 429 at r2 (raw file):

		}

		// Do not collect stats for virtual computed columns.

nit: this comment is slightly misleading, since we do want to collect stats for some virtual computed columns


pkg/sql/create_stats.go line 568 at r2 (raw file):

		col := desc.PublicColumns()[i]

		// Do not collect stats for virtual computed columns.

nit: misleading comment


pkg/sql/distsql_plan_stats.go line 417 at r2 (raw file):

			refColIDs.ForEach(func(c descpb.ColumnID) {
				if !tableColSet.Contains(c) {
					if _, err = catalog.MustFindColumnByID(desc, c); err != nil {

why do you need this?


pkg/sql/importer/import_job.go line 1044 at r2 (raw file):

			// columns.
			multiColEnabled := false
			defaultHistogramBuckets := stats.GetDefaultHistogramBuckets(execCfg.SV(), desc)

you're removing the ability to set the histogram buckets here -- was this intentional?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2)


pkg/sql/create_stats.go line 273 at r2 (raw file):

		for i := range columns {
			if columns[i].IsVirtual() && !statsOnVirtualCols.Get(n.p.ExecCfg().SV()) {
				return nil, pgerror.Newf(

nit: maybe add an error hint to change the cluster setting to allow it?


pkg/sql/create_stats.go line 390 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This is unfortunate but necessary to perform the distsql check below (checkExpr) which relies on having a fully-typed expression.

Also unfortunate is that we're performing this work twice: both here and in createStatsPlan. I'm doing this here to silently skip over un-distributable virtual column expressions, and in createStatsPlan to build the plan (and to return an error if un-distributable virtual column expressions sneak in another way, e.g. from CREATE STATISTICS with explicitly-requested columns). I'm not yet sure of an easy way to de-duplicate the work.

This seems reasonable to me.

IIUC one way to avoid the duplicate work would be to extend jobspb.CreateStatsDetails to somehow mark distributable expressions for virtual columns (e.g. list of column ids). Then we'd only check the distributability when creating this proto, and in createStatsPlan we wouldn't have to do that. However, the current approach seems good.


pkg/sql/create_stats.go line 403 at r2 (raw file):

		}
		for i, col := range desc.PublicColumns() {
			cannotDistribute[i] = col.IsVirtual() && checkExpr(exprs[i]) != nil

nit: maybe do s/checkExpr/checkExprForDistSQL/ to clarify this helper?


pkg/sql/create_stats.go line 430 at r2 (raw file):

		// Do not collect stats for virtual computed columns.
		if col.IsVirtual() && (!virtColEnabled || cannotDistribute[col.Ordinal()]) {

nit: maybe extract a helper for this conditional?


pkg/sql/opt/exec/execbuilder/testdata/distsql_misc line 219 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Great catch! I was not looking for this, and was always executing the plan in a distributed fashion regardless of whether the expression supported it.

I think we probably want to skip collecting stats on un-distributable virtual column expressions, rather than run CREATE STATISTICS locally. (I think it would create a risk of performance impact to do a full table scan through distsender.)

I've added checks for un-distributable expressions to createStatsDefaultColumns and createStatsPlan, as well as tests. If the expression cannot be distributed we now skip collecting stats on the virtual column (or return an error if the column was explicitly requested).

Nice! Although I wonder whether we should have an escape hatch to allow stats collection even if the expr cannot be distributed, like controlled by a session variable (disabled by default). Perhaps if the virtual column was explicitly requested, this should always be allowed. I think the current approach makes sense, so maybe sprinkle some TODOs for this, and we'll address them once we have at least one user complain about it?


pkg/sql/opt/exec/execbuilder/testdata/distsql_misc line 251 at r2 (raw file):

Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJzsV8Fu2zgQve9XEHNKsHQkUrLj8OSs2wJGN01hG70sgoIRp65gWdSSNNJs4M_aH9gvW0iqa0mxXAtpgQSoD4I4MxrOm3l8Bh_A_p2AgPH09eX8NZnNL-eT2XwynhHLyJvp9RVR0kmgkGqF7-QKLYi_gAEFDhQCoBAChT7cUMiMjtBabfKQh-KDifoCwqcQp9na5eYbCpE2COIBXOwSBAFzeZvgFKVC4_lAQaGTcVJsk289yh8fsyXeA4WxTtar1AoiKbmlJKJEAYVZJnNbz2NEpoowot1nNEBhiqlCI8iIUTLilIwCSkZh_v47E0JM3s2HlEgnU65ORgEdhUK8-fP6cj48BQpvPxAXr1AQ_79_bbmOdOowdbFOH7mMvrNEYaQVKkGY75fm23uHlhiUSpCh75M_SvNi-n5MIpkkthKbydhsY3lhvPowHhPrMCORXqeOnOAX58WpOxXE93YBiMu2gKIsvXbZ2pU73WwolOuv47BOLhAEq8xv8gqEv6HHj3AmV1mCxuvXx1eaZ_E_WOxdQJo56fKB7F55xUqrq6DuoFVD-MhHq7b-7nXQCpk3IPdbIe-QrlNtFBpUNaQ3m9amXC4WBhfSaeMx__j2kBPu--R2HS3R2dNKs9ocjdYdCqs38juR9ba2BQ_ICd-Z2xoeNBrO6iRjx-sEe4pOeKzn8eeiFHyfUrCzAXkb79EKvk8rwp-hFbyDVrAuY9xqxeBFa8WgBpkfz1z-JObynhf8Yu6PYy7vMsYtc89fNHPPa5CD45kbPIm5Qc8LfzH3xzE36DLGLXOHL5q5wxrk8Hjmhk9ibtjz-s-FucE-5vKzYD9zg33MHfwM5gYdmBt2GeOWuRcvmrkXXS5TU7SZTi02bhj7d_IbO_VYfhVBtcDy3mL12kT43uioiC2X10WiwqDQutLLysUk3bqsMyhX3-6C1UzsYCZey8SqmfrNTPxwTV2KCg6mCtszsWamsCs8WUwFUnR32izLM2sxLSQ9Z-7WUZ7a0rP7v9l6V2itXOwC_BBuqnUOmnX2D9Y5aEfMm5kGzxPxebPO84N1DtsRB81Mw-eJeNis8-LwAfHbIYePTu1hAeiEOWjFzM_C72DuNzBf5JL1KdF3H2MFAvyvv96ex_YH-QdyYXPdnH3Wd0XR8_ssV71PMrFI4Uou8RU6NKs4ja2LIxDOrHGz-e3_AAAA___c-GNe

# Check that we also collect stats on the hidden expression index virt column.

nit: might be also interesting to see what happens when the column is made INVISIBLE.

Add rendering of virtual computed column expressions to the CREATE
STATISTICS distsql plan. This rendering should always be added as post
processors on the TableReader nodes that feed the Samplers.

With this change we now collect table statistics on virtual computed
columns. A future PR will make use of the statistsics in statistics
builder.

Collection of partial statistics (USING EXTREMES) on virtual computed
columns simply works after this change because partial statistics
collection utilizes secondary indexes, where virtual computed columns
are regular stored columns.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql change): Add a new cluster setting,
`sql.stats.virtual_computed_columns.enabled`, which when set enables
collection of table statistics on virtual computed columns.
This function checks whether an expression supports distributed
execution. The name was a little ambiguous, especially in the giant
sql package.

Epic: None
Release note: None
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

After finishing the next part (statistics_builder) I will work on the backport.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft and @yuzefovich)


pkg/sql/create_stats.go line 273 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe add an error hint to change the cluster setting to allow it?

Done.


pkg/sql/create_stats.go line 403 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe do s/checkExpr/checkExprForDistSQL/ to clarify this helper?

Done (in a second commit).


pkg/sql/create_stats.go line 429 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: this comment is slightly misleading, since we do want to collect stats for some virtual computed columns

Rephrased.


pkg/sql/create_stats.go line 430 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe extract a helper for this conditional?

Done.


pkg/sql/create_stats.go line 568 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: misleading comment

Rephrased.


pkg/sql/distsql_plan_stats.go line 417 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

why do you need this?

Ah, good catch, I don't. (It was a sanity check during development.) Removed.


pkg/sql/importer/import_job.go line 1044 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

you're removing the ability to set the histogram buckets here -- was this intentional?

Yes: as it turns out, StubTableStats doesn't do anything useful with the number of histogram buckets anyway. It always returns statistics with nil HistogramData. (And this is the only user of StubTableStats.)


pkg/sql/opt/exec/execbuilder/testdata/distsql_misc line 219 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Nice! Although I wonder whether we should have an escape hatch to allow stats collection even if the expr cannot be distributed, like controlled by a session variable (disabled by default). Perhaps if the virtual column was explicitly requested, this should always be allowed. I think the current approach makes sense, so maybe sprinkle some TODOs for this, and we'll address them once we have at least one user complain about it?

Good point, added some TODOs.


pkg/sql/opt/exec/execbuilder/testdata/distsql_misc line 251 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: might be also interesting to see what happens when the column is made INVISIBLE.

Added a test with some hidden columns.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)


pkg/sql/importer/import_job.go line 1044 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Yes: as it turns out, StubTableStats doesn't do anything useful with the number of histogram buckets anyway. It always returns statistics with nil HistogramData. (And this is the only user of StubTableStats.)

thanks for the explanation!

@michae2
Copy link
Collaborator Author

michae2 commented Jan 31, 2024

Thanks!

bors r=yuzefovich,rytaft

@craig
Copy link
Contributor

craig bot commented Jan 31, 2024

Build succeeded:

@craig craig bot merged commit 9a98c70 into cockroachdb:master Jan 31, 2024
8 of 9 checks passed
Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

Reviewed 2 of 5 files at r2, 5 of 5 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)

@michae2 michae2 deleted the virtcolstats branch February 8, 2024 00:28
@michae2
Copy link
Collaborator Author

michae2 commented Feb 8, 2024

Note to self: when backporting, need to include #118930

michae2 added a commit to michae2/cockroach that referenced this pull request Mar 19, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 21, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
craig bot pushed a commit that referenced this pull request Mar 22, 2024
120668: opt/memo: extend OutputCols with VirtualCols in statistics builder r=DrewKimball,mgartner a=michae2

**opt: add props.Statistics.VirtualCols**

As of #118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: #68254

Epic: CRDB-8949

Release note: None

---

**sql: add optimizer_use_virtual_computed_column_stats session variable**

Informs: #68254

Epic: CRDB-8949

Release note (sql): Add new session variable
`optimizer_use_virtual_computed_column_stats`. When this variable is
enabled, the optimizer will make use of table statistics on virtual
computed columns.

---

**opt/memo: extend OutputCols with VirtualCols in statistics builder**

Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.

In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.

This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.

Informs: #68254

Epic: CRDB-8949

Release note: None

Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
@michae2 michae2 added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Mar 23, 2024
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 26, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 29, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
michae2 added a commit to michae2/cockroach that referenced this pull request Mar 29, 2024
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants