Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.2: opt/memo: use virtual column stats in statistics builder #121179

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Mar 27, 2024

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: high priority business need for the functionality which cannot wait until the next release.

Copy link

blathers-crl bot commented Mar 27, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Mar 27, 2024
Copy link

blathers-crl bot commented Mar 27, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

mgartner and others added 6 commits March 28, 2024 22:51
This commit reduces allocation in `statisticsBuilder.colStatJoin`.
Previously, it was creating intersections of two sets, which, in some
cases, were only useful for checking their emptiness. Now we use the
`ColSet.Intersects` method which returns a boolean and does not
build a new set.

Release note: None
As of cockroachdb#118241 we now collect table statistics on virtual computed
columns, but do not yet use them in statistics builder. The difficulty
with using these stats in statistics builder is that virtual computed
columns are synthesized by various non-Scan expressions (Project,
Select, etc). When calculating stats for these non-Scan expressions, we
need to find the virtual column stats even though the virtual columns
are not produced by the input to these expressions.

To solve this, we add a VirtualCols set to props.Statistics which holds
all of the virtual columns that could be produced by the input to a
group. Expressions that could synthesize virtual columns will look in
this set to discover whether there are statistics for any of the scalar
expressions they render. If there are, they will call colStatXXX using
the virtual column ID as if the virtual column had originated from
their input.

This commit adds VirtualCols but does not yet use it.

Note that we cannot currently pass VirtualCols up through set operations
or with-scans, due to the column ID translation they use.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
Informs: cockroachdb#68254

Epic: CRDB-8949

Release note (sql): Add new session variable
`optimizer_use_virtual_computed_column_stats`. When this variable is
enabled, the optimizer will make use of table statistics on virtual
computed columns.
Throughout statistics builder we use OutputCols to determine which
columns come from the input to an expression. We then typically call
colStatXXX with those columns as part of statistics calculation.

In order to use statistics on virtual computed columns, we need to call
colStatXXX on any virtual columns that could come from our input, even
if they are not passed upward through OutputCols. To do this we extend
OutputCols with the VirtualCols set we built in a previous commit. This
commit replaces almost all usages of OutputCols in statistics builder
with a call to helper function colStatCols, which returns a union of
OutputCols and VirtualCols.

This is enough to get the optimizer to use statistics on virtual
computed columns in some simple plans. More complex plans will require
matching the virtual column scalar expressions, which will be in the
next PR. I've left some TODOs marking spots where this next PR will
touch.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
As of cockroachdb#120668 we now use statistics on virtual computed columns in
statistics builder. Simple queries that synthesize virtual columns in
Project expressions already benefit, because they use the virtual column
ID when synthesizing the virtual column.

Other expressions, however, do not directly use the virtual column ID
when synthesizing a virtual column. This includes Select expressions,
joins, constrained scans, and some Project expressions.

For example, consider a query like the following:

```
CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a % 10) VIRTUAL, INDEX (b));
SELECT * FROM ab WHERE a % 10 > 3;
```

Even though the filter condition is in terms of `a`, we'd like to use
the statistics on virtual computed column `b` since the expression
matches.

In order to do this, we replace `a % 10` with `b` in a copy of the
filter condition before doing any stats calculations. Then we perform
our normal stats calculations, using `b`.

Fixes: cockroachdb#68254
Fixes: cockroachdb#110146

Epic: CRDB-8949

Release note: None
With optimizer_use_virtual_computed_column_stats set to false,
constrained scans were still sometimes using stats on virtual computed
columns. This commit adds a check to makeTableStatistics which prevents
creation of any statistics referencing a virtual computed column, which
is a stronger check than existed before.

With this check, the VirtualCols sets will always be empty when
optimizer_use_virtual_computed_column_stats is false.

Informs: cockroachdb#68254

Epic: CRDB-8949

Release note: None
@michae2 michae2 force-pushed the backport23.2-114623-120668-120875-121171 branch from 91239ec to a4f2275 Compare March 29, 2024 06:10
@michae2
Copy link
Collaborator Author

michae2 commented Mar 29, 2024

Are all of the changes protected via a flag or option, new syntax, an environment variable or default-disabled cluster setting?

All of the changes are behind session variable optimizer_use_virtual_computed_column_stats which is disabled by default in these backports.

What are the risks to backporting this change? Can the risks of backporting be mitigated?

Risks:

  • increase in memo size due to VirtualCols leads to measurable increase in memory usage (seems unlikely)
  • with optimizer_use_virtual_computed_column_stats enabled, statistics calculation becomes less accurate for some queries
  • or alternatively, with it enabled, more accurate virtual column stats cause a plan change which turns out to be slower
  • with optimizer_use_virtual_computed_column_stats enabled, recursion in factorOutVirtualCols finds a re-entrancy bug in statistics builder

What are the risks to not backporting?

Statistics estimates for queries on indexed virtual computed columns will continue to be inaccurate, possibly prohibiting creation of indexes on virtual computed columns.

Does this change really need to be backported? Or can it reasonably wait until the next major release to be addressed?

It cannot wait.

@michae2
Copy link
Collaborator Author

michae2 commented Mar 29, 2024

The backport was fairly clean. Differences from master:

  • included first commit 18ecc9f to avoid merge skew in buildJoin and colStatJoin
  • optimizer_use_virtual_computed_column_stats is false by default
  • because of this, had to add set=optimizer_use_virtual_computed_column_stats=true to opttests
  • removed a couple of test cases in pkg/sql/logictest/testdata/logic_test/distsql_stats so that I could backport concurrently with release-23.2: sql: collect table statistics on virtual computed columns #120923

@mgartner
Copy link
Collaborator

Was opt: avoid ColSet allocations in statisticsBuilder.colStatJoin intentionally included?

@michae2
Copy link
Collaborator Author

michae2 commented Mar 29, 2024

Was opt: avoid ColSet allocations in statisticsBuilder.colStatJoin intentionally included?

Yes, the fourth commit touches most of the same lines as this commit, so I included it to avoid merge skew. (I would have ended up re-creating it anyway in order to fix the fourth commit, so it was cleaner this way.)

@michae2
Copy link
Collaborator Author

michae2 commented Mar 29, 2024

(Failures are known flake #120775 and what looks like an unrelated timeout under stressrace, btw.)

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I noted a few minor things I think I missed in the original review, but nothing to hold this up—just things to consider in the future.

Reviewed 1 of 1 files at r1, 12 of 12 files at r2, 8 of 8 files at r3, 2 of 2 files at r4, 4 of 4 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 258 at r4 (raw file):

	if sb.evalCtx.SessionData().OptimizerUseVirtualComputedColumnStats &&
		!props.Statistics().VirtualCols.Empty() {
		return props.OutputCols.Union(props.Statistics().VirtualCols)

I can see now one of the reasons you had this new set contain both output and virtual columns originally (IIRC): it could be used directly here instead of creating a new set with Union and potentially allocating. No need to change back, but something we can maybe reconsider if this ever shows up in a profile.


pkg/sql/opt/memo/statistics_builder.go line 4977 at r5 (raw file):

		expr  opt.Expr
	}
	virtExprs := make([]virtExpr, virtualCols.Len())

nit: We could probably store this in statisticsBuilder and reuse the allocated slice everytime the function is called.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @rafiss, and @rytaft)


pkg/sql/opt/memo/statistics_builder.go line 4977 at r5 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: We could probably store this in statisticsBuilder and reuse the allocated slice everytime the function is called.

Drew had this feedback, too. It makes sense, but I'm a little wary, because I believe it's possible to re-enter this function during the replacement step. I haven't proven this, but during replacement we construct new expressions, and so it might be possible that we construct a new group which then has logical props and statistics built. While building stats we might call back into this function.

So just to play it safe, seems better to keep the slice local.

(Maybe I should leave this as a comment in the code?)

@michae2
Copy link
Collaborator Author

michae2 commented Mar 29, 2024

I'll go ahead and merge so we get some test coverage over the weekend.

@michae2 michae2 merged commit 66575ad into cockroachdb:release-23.2 Mar 29, 2024
5 of 6 checks passed
@michae2 michae2 deleted the backport23.2-114623-120668-120875-121171 branch March 29, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants