-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: separate SQL statistics from being refreshed with telemetry #45082
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool I like this approach.
You'll need to write some testing before this can merge though.
Also:
This has the effect of telemetry only reporting SQL
stats at the rate of max(sql_refresh_rate, telemetry_report_rate).
I don't believe this sentence is consistent with the rest of the text (and with the code). I'd recommend re-phrasing.
Reviewed 7 of 7 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
Added a test, PTAL! |
504fa50
to
fde8374
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok for the test but now the behavior got me a little bit confused.
As I am reading the code now, I am imagining the following behavior:
- user opens admin UI, fiddles with the "reset" button to calibrate some experiment (they want to clear the stats, then start their app, then look at what happens)
- a few seconds after they start their experiment, the telemetry loop activates and flushes the sql stats
- user doesn't see all the data they wanted in the admin UI and gets confused
Was this intentional?
Somehow I was working under the impression that the SQL-side reset would be the only place that pushes data to telemetry. The telemetry loop could pick up data from what SQL has pushed away, without ever triggering the SQL push on its own.
Also, separately from the above I strongly suggest a maximum value on the reset interval, for example 24 hours.
Reviewed 6 of 6 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @rohany)
pkg/roachpb/app_stats.proto, line 102 at r2 (raw file):
// N.B. When this changes, make sure to update (*NumericStat).AlmostEqual // in app_stats.go.
💯
pkg/server/stats_test.go, line 49 at r2 (raw file):
} // Collect stats from the SQL Server and ensure our queries are present.
SQL server (no capital needed - "SQL Server" is a different product)
Where are you seeing this? The telemetry loop doesn't touch the SQL stats at all.
Theres a max reset interval in here for both the sql stats and reported stats of 2 hours -- i can bump this up though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you seeing this? The telemetry loop doesn't touch the SQL stats at all.
Sorry I mis-read the code. It would help to show a test that does the telemetry at a high frequency with a fake server (say, every 20ms) and demonstrate that the app stats in SQL are preserved when this happens.
Theres a max reset interval in here for both the sql stats and reported stats of 2 hours -- i can bump this up though.
Yes I think admin UI users will be thankful to have the option to increase up to a day.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rohany)
| Yes I think admin UI users will be thankful to have the option to increase up to a day. Do we want this though? Most of the comments around this area seem to be wary of the memory build up of the stats with the 2 hour window. |
The default should remain 1 hour or less. |
Added a test. I think I misunderstood your comment re reset intervals earlier -- I made the maximum value for the reset intervals 24 hours. |
b84c115
to
59daf8d
Compare
Friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits
Reviewed 4 of 4 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @rohany)
pkg/server/stats_test.go, line 49 at r3 (raw file):
// Run a bunch of queries in a loop, and request telemetry in a loop. // At the end, ensure that SQL statistics are unaffected by telemetry. var wg sync.WaitGroup
As much as I love wait groups and goroutines, this test does not need them. You can interleave 4 calls as follows:
- insert
- maybe report diagnostics
- insert
- maybe report diagnostics
and check that you still have the 2 inserts at the end.
pkg/settings/duration.go, line 120 at r3 (raw file):
// RegisterPublicNonNegativeDurationSettingWithMaximum defines a new setting with // type duration, makes it public, and sets a maximum value. func RegisterPublicNonNegativeDurationSettingWithMaximum(
I don't dislike this new helper, but then I would request that you grep for all the other calls to Register.*DurationSetting
throughout the code and ensure that the other few that also use a maximum also use your new helper.
Or file an issue and tag it "easy" / "good first issue".
This PR fixes the issue that SQL statement statistics are refreshed when telemetry is reported. To do this, this PR introduces a new statistics pool for statistics to be reported to telemetry. Whenever the SQL stats are reset, they are first dumped into the reporting pool. The telemetry reporter then takes the statistics that have been aggregated so far and reports them whenever it executes. This has the effect of telemetry only reporting SQL stats at the rate of max(sql_refresh_rate, telemetry_report_rate). Release note (sql change): This PR introduces a new cluster setting to control the rate at which SQL statement statistics are refreshed, `diagnostics.sql_stat_reset.interval`. Additionally it renames the setting `diagnostics.forced_stat_reset.interval` to `diagnostics.forced_sql_stat_reset_interval`.
bors r=knz |
Build succeeded |
This PR fixes the issue that SQL statement statistics
are refreshed when telemetry is reported.
To do this, this PR introduces a new statistics
pool for statistics to be reported to telemetry.
Whenever the SQL stats are reset, they are first
dumped into the reporting pool. The telemetry
reporter then takes the statistics that have been
aggregated so far and reports them whenever it executes.
This has the effect of telemetry only reporting SQL
stats at the rate of max(sql_refresh_rate, telemetry_report_rate).
Release note (sql change): This PR introduces a new cluster
setting to control the rate at which SQL statement statistics
are refreshed,
diagnostics.sql_stat_reset.interval
. Additionallyit renames the setting
diagnostics.forced_stat_reset.interval
todiagnostics.forced_sql_stat_reset_interval
.