sql: introduce canary full stats rollout and stats_as_of session var#158704
sql: introduce canary full stats rollout and stats_as_of session var#158704ZhouXing19 wants to merge 3 commits intocockroachdb:masterfrom
Conversation
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
82149b6 to
9ed6ba2
Compare
…indow This commit implements the core logic for canary statistics rollout, allowing gradual deployment of newly collected full statistics. Previously, all queries would immediately use the most recent full statistics, which could cause performance regressions if the new full statistics were inaccurate. The implementation adds a `StatsCanaryWindow` field in table descriptors and catalog interfaces to define the canary period, along with logic in the statistics builder to skip "canary" statistics (the latest stats within the canary window) when not using the canary path. The pick of using the canary or stable path for a query is decided via a dice rolling process with odds determined by the cluster setting `sql.stats.canary_fraction`. Release note (sql change): implement canary full statistics rollout core logic, which is configurable via the table-level storage paramter (`sql_stats_canary_window`) and the cluster setting `sql.stats.canary_fraction`.
… selection This commit adds a new session variable `stats_as_of` that allows controlling statistics selection based on a specific timestamp rather than the current time. Previously, statistics selection was always relative to the current wall clock time, making it difficult to get consistent query plans for historical analysis or testing. This feature is only for debugging and troubleshooting, and should not be used in production. The implementation is also integrated into the existing canary statistics logic to respect the as-of timestamp when determining canary window boundaries. Release note (sql change): adds a new session variable `stats_as_of` that allows controlling statistics selection based on a specific timestamp rather than the current time.
This commit adds tests for usage of canary stats rollout in makeTableStatistics(), which is the main entry point where statistics are selected for query optimization. This commit focuses on unit testing the makeTableStatistics() function and does not include end-to-end logic tests, which would require additional changes to Builder.maybeAnnotateWithEstimates() to support EXPLAIN ANALYZE output showing which statistics were used during planning. To enable testing, this commit adds: - Handler in opttester for setting the canary window storage parameter - Testing knob for controlling the canary fraction setting - Three new test files covering basic canary stats, histogram canary stats, and multi-column canary stats scenarios Release note: None
9ed6ba2 to
1650822
Compare
michae2
left a comment
There was a problem hiding this comment.
Nice work! It's a good PR, but there's a conflict with forecasting and merging.
@michae2 reviewed 7 of 12 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ZhouXing19)
pkg/sql/opt/cat/utils.go line 378 at r1 (raw file):
(!stat.IsMerged() || sd.OptimizerUseMergedPartialStatistics) && (!stat.IsForecast() || sd.OptimizerUseForecasts) && (createdAtTs.Less(olderThan) || (inclusive && createdAtTs.Equal(olderThan)))
This approach makes sense, but stats merging and stats forecasting throw a wrench in things. Suppose that we have the following series of stats collections (in descending time order, newest first)
- Automatic partial (USING EXTREMES) 2025-12-01
- Automatic full 2025-11-01
- Automatic full 2025-10-01
- Automatic full 2025-09-01
After this, the list returned by stats.(*TableStatisticsCache).GetTableStats will look like:
- Forecast 2026-01-01
- Merged 2025-12-01
- Automatic partial (USING EXTREMES) 2025-12-01
- Automatic full 2025-11-01
- Automatic full 2025-10-01
- Automatic full 2025-09-01
(1) and (2) are computed in the stats cache. The Forecast (1) will be based on 2, 4, 5, and 6 and will be visible even if 2026-01-01 is in the future.
Now suppose the canary window is 1 day, today is 2025-12-02, and we automatically collected full stats earlier today on 2025-12-02 which should still be canary. GetTableStats will return:
- Forecast 2026-01-02
- Automatic full 2025-12-02 (canary)
- Automatic partial (USING EXTREMES) 2025-12-01
- Automatic full 2025-11-01
- Automatic full 2025-10-01
- Automatic full 2025-09-01
The Forecast here will be based on the new canary stats instead of the merged stats, and will be different from what it was before.
If our query is supposed to be using stable stats, FindLatestFullStat will return:
- Automatic full 2025-11-01
- Automatic full 2025-10-01
- Automatic full 2025-09-01
These are all the full stats collections < 2025-12-01, but by omitting the forecast we get different stats even when our query is supposed to be using stable stats.
We could change FindLatestFullStat to include the forecast, like:
- Forecast 2026-01-02
- Automatic full 2025-11-01
- Automatic full 2025-10-01
- Automatic full 2025-09-01
But this is still different than the stats were before, because the forecast is different.
If we really want the stable stats to be exactly the same as they were before the canary stats were collected, I think we're going to have to push the created_at check down into the stats cache, before partial stats merging and forecasting happen. And then merging and forecasting will have to happen on either the stable or canary stats.
Similar to the memo caching problem we ran into in #156307, I think we'll run into a stats caching problem here. In that case we decided not to cache canary memos. In this case, it might be easier to add a "canary" argument to GetTableStats which then is plumbed all the way down, similar to the "forecast" argument that is passed down.
Informs: #150015
sql/opt: implement canary full statistics rollout with configurable window
This commit implements the core logic for canary statistics rollout,
allowing gradual deployment of newly collected full statistics.
Previously, all queries would immediately use the most recent full
statistics, which could cause performance regressions if the new full
statistics were inaccurate.
The implementation adds a StatsCanaryWindow field in table descriptors
and catalog interfaces to define the canary period, along with logic in
the statistics builder to skip "canary" statistics (the latest stats
within the canary window) when not using the canary path. The pick of using
the canary or stable path for a query is decided via a dice rolling process
with odds determined by the cluster setting sql.stats.canary_fraction.
Release note (sql change): implement canary full statistics rollout core logic, which
is configurable via the table-level storage paramter
(sql_stats_canary_window) and the cluster setting
sql.stats.canary_fraction.
sql: implement stats_as_of session variable for time-based statistics selection
This commit adds a new session variable stats_as_of that allows
controlling statistics selection based on a specific timestamp rather
than the current time. Previously, statistics selection was always
relative to the current wall clock time, making it difficult to get
consistent query plans for historical analysis or testing.
This feature is only for debugging and troubleshooting, and should not
be used in production.
The implementation is also integrated into the existing canary
statistics logic to respect the as-of timestamp when determining canary
window boundaries.
Release note (sql change): adds a new session variable stats_as_of
that allows controlling statistics selection based on a specific
timestamp rather than the current time.
sql/opt: add tests for using canary stats rollout in makeTableStatistics
This commit adds tests for usage of canary stats rollout in
makeTableStatistics(), which is the main entry point where statistics
are selected for query optimization. This commit focuses on unit
testing the makeTableStatistics() function and does not include
end-to-end logic tests, which would require additional changes to
Builder.maybeAnnotateWithEstimates() to support EXPLAIN ANALYZE
output showing which statistics were used during planning.
To enable testing, this commit adds:
Handler in opttester for setting the canary window storage
parameter
Testing knob for controlling the canary fraction setting
Three new test files covering basic canary stats, histogram canary
stats, and multi-column canary stats scenarios
Release note: None