Skip to content

sql: introduce canary full stats rollout and stats_as_of session var#158704

Open
ZhouXing19 wants to merge 3 commits intocockroachdb:masterfrom
ZhouXing19:canary-main-1203
Open

sql: introduce canary full stats rollout and stats_as_of session var#158704
ZhouXing19 wants to merge 3 commits intocockroachdb:masterfrom
ZhouXing19:canary-main-1203

Conversation

@ZhouXing19
Copy link
Copy Markdown
Collaborator

@ZhouXing19 ZhouXing19 commented Dec 3, 2025

Informs: #150015

sql/opt: implement canary full statistics rollout with configurable window

This commit implements the core logic for canary statistics rollout,
allowing gradual deployment of newly collected full statistics.
Previously, all queries would immediately use the most recent full
statistics, which could cause performance regressions if the new full
statistics were inaccurate.

The implementation adds a StatsCanaryWindow field in table descriptors
and catalog interfaces to define the canary period, along with logic in
the statistics builder to skip "canary" statistics (the latest stats
within the canary window) when not using the canary path. The pick of using
the canary or stable path for a query is decided via a dice rolling process
with odds determined by the cluster setting sql.stats.canary_fraction.

Release note (sql change): implement canary full statistics rollout core logic, which
is configurable via the table-level storage paramter
(sql_stats_canary_window) and the cluster setting
sql.stats.canary_fraction.


sql: implement stats_as_of session variable for time-based statistics selection

This commit adds a new session variable stats_as_of that allows
controlling statistics selection based on a specific timestamp rather
than the current time. Previously, statistics selection was always
relative to the current wall clock time, making it difficult to get
consistent query plans for historical analysis or testing.

This feature is only for debugging and troubleshooting, and should not
be used in production.

The implementation is also integrated into the existing canary
statistics logic to respect the as-of timestamp when determining canary
window boundaries.

Release note (sql change): adds a new session variable stats_as_of
that allows controlling statistics selection based on a specific
timestamp rather than the current time.


sql/opt: add tests for using canary stats rollout in makeTableStatistics

This commit adds tests for usage of canary stats rollout in
makeTableStatistics(), which is the main entry point where statistics
are selected for query optimization. This commit focuses on unit
testing the makeTableStatistics() function and does not include
end-to-end logic tests, which would require additional changes to
Builder.maybeAnnotateWithEstimates() to support EXPLAIN ANALYZE
output showing which statistics were used during planning.

To enable testing, this commit adds:

Handler in opttester for setting the canary window storage
parameter
Testing knob for controlling the canary fraction setting
Three new test files covering basic canary stats, histogram canary
stats, and multi-column canary stats scenarios
Release note: None

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Dec 3, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ZhouXing19 ZhouXing19 changed the title Canary main 1203 sql: introduce canary full stats rollout and stats_as_of session var Dec 3, 2025
…indow

This commit implements the core logic for canary statistics rollout,
allowing gradual deployment of newly collected full statistics.
Previously, all queries would immediately use the most recent full
statistics, which could cause performance regressions if the new full
statistics were inaccurate.

The implementation adds a `StatsCanaryWindow` field in table descriptors
and catalog interfaces to define the canary period, along with logic in
the statistics builder to skip "canary" statistics (the latest stats
within the canary window) when not using the canary path. The pick of using
the canary or stable path for a query is decided via a dice rolling process
with odds determined by the cluster setting `sql.stats.canary_fraction`.

Release note (sql change): implement canary full statistics rollout core logic, which
is configurable via the table-level storage paramter
(`sql_stats_canary_window`) and the cluster setting
`sql.stats.canary_fraction`.
… selection

This commit adds a new session variable `stats_as_of` that allows
controlling statistics selection based on a specific timestamp rather
than the current time. Previously, statistics selection was always
relative to the current wall clock time, making it difficult to get
consistent query plans for historical analysis or testing.

This feature is only for debugging and troubleshooting, and should not
be used in production.

The implementation is also integrated into the existing canary
statistics logic to respect the as-of timestamp when determining canary
window boundaries.

Release note (sql change): adds a new session variable `stats_as_of`
that allows controlling statistics selection based on a specific
timestamp rather than the current time.
This commit adds tests for usage of canary stats rollout in
makeTableStatistics(), which is the main entry point where statistics
are selected for query optimization. This commit focuses on unit
testing the makeTableStatistics() function and does not include
end-to-end logic tests, which would require additional changes to
Builder.maybeAnnotateWithEstimates() to support EXPLAIN ANALYZE
output showing which statistics were used during planning.

To enable testing, this commit adds:
- Handler in opttester for setting the canary window storage
parameter
- Testing knob for controlling the canary fraction setting
- Three new test files covering basic canary stats, histogram canary
stats, and multi-column canary stats scenarios

Release note: None
@ZhouXing19 ZhouXing19 requested a review from michae2 December 4, 2025 09:14
@ZhouXing19 ZhouXing19 marked this pull request as ready for review December 4, 2025 09:14
@ZhouXing19 ZhouXing19 requested review from a team as code owners December 4, 2025 09:14
ZhouXing19 added a commit to ZhouXing19/cockroach that referenced this pull request Dec 4, 2025
Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! It's a good PR, but there's a conflict with forecasting and merging.

@michae2 reviewed 7 of 12 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ZhouXing19)


pkg/sql/opt/cat/utils.go line 378 at r1 (raw file):

		(!stat.IsMerged() || sd.OptimizerUseMergedPartialStatistics) &&
		(!stat.IsForecast() || sd.OptimizerUseForecasts) &&
		(createdAtTs.Less(olderThan) || (inclusive && createdAtTs.Equal(olderThan)))

This approach makes sense, but stats merging and stats forecasting throw a wrench in things. Suppose that we have the following series of stats collections (in descending time order, newest first)

  1. Automatic partial (USING EXTREMES) 2025-12-01
  2. Automatic full 2025-11-01
  3. Automatic full 2025-10-01
  4. Automatic full 2025-09-01

After this, the list returned by stats.(*TableStatisticsCache).GetTableStats will look like:

  1. Forecast 2026-01-01
  2. Merged 2025-12-01
  3. Automatic partial (USING EXTREMES) 2025-12-01
  4. Automatic full 2025-11-01
  5. Automatic full 2025-10-01
  6. Automatic full 2025-09-01

(1) and (2) are computed in the stats cache. The Forecast (1) will be based on 2, 4, 5, and 6 and will be visible even if 2026-01-01 is in the future.

Now suppose the canary window is 1 day, today is 2025-12-02, and we automatically collected full stats earlier today on 2025-12-02 which should still be canary. GetTableStats will return:

  1. Forecast 2026-01-02
  2. Automatic full 2025-12-02 (canary)
  3. Automatic partial (USING EXTREMES) 2025-12-01
  4. Automatic full 2025-11-01
  5. Automatic full 2025-10-01
  6. Automatic full 2025-09-01

The Forecast here will be based on the new canary stats instead of the merged stats, and will be different from what it was before.

If our query is supposed to be using stable stats, FindLatestFullStat will return:

  1. Automatic full 2025-11-01
  2. Automatic full 2025-10-01
  3. Automatic full 2025-09-01

These are all the full stats collections < 2025-12-01, but by omitting the forecast we get different stats even when our query is supposed to be using stable stats.

We could change FindLatestFullStat to include the forecast, like:

  1. Forecast 2026-01-02
  2. Automatic full 2025-11-01
  3. Automatic full 2025-10-01
  4. Automatic full 2025-09-01

But this is still different than the stats were before, because the forecast is different.

If we really want the stable stats to be exactly the same as they were before the canary stats were collected, I think we're going to have to push the created_at check down into the stats cache, before partial stats merging and forecasting happen. And then merging and forecasting will have to happen on either the stable or canary stats.

Similar to the memo caching problem we ran into in #156307, I think we'll run into a stats caching problem here. In that case we decided not to cache canary memos. In this case, it might be easier to add a "canary" argument to GetTableStats which then is plumbed all the way down, similar to the "forecast" argument that is passed down.

@rafiss rafiss removed the request for review from a team December 19, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants