server: separate SQL statistics from being refreshed with telemetry #45082

rohany · 2020-02-13T16:28:06Z

This PR fixes the issue that SQL statement statistics
are refreshed when telemetry is reported.

To do this, this PR introduces a new statistics
pool for statistics to be reported to telemetry.
Whenever the SQL stats are reset, they are first
dumped into the reporting pool. The telemetry
reporter then takes the statistics that have been
aggregated so far and reports them whenever it executes.
This has the effect of telemetry only reporting SQL
stats at the rate of max(sql_refresh_rate, telemetry_report_rate).

Release note (sql change): This PR introduces a new cluster
setting to control the rate at which SQL statement statistics
are refreshed, diagnostics.sql_stat_reset.interval. Additionally
it renames the setting diagnostics.forced_stat_reset.interval to
diagnostics.forced_sql_stat_reset_interval.

cockroach-teamcity · 2020-02-13T16:28:15Z

This change is

knz

Cool I like this approach.

You'll need to write some testing before this can merge though.

Also:

This has the effect of telemetry only reporting SQL
stats at the rate of max(sql_refresh_rate, telemetry_report_rate).

I don't believe this sentence is consistent with the rest of the text (and with the code). I'd recommend re-phrasing.

Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

rohany · 2020-02-13T21:21:08Z

Added a test, PTAL!

knz

ok for the test but now the behavior got me a little bit confused.

As I am reading the code now, I am imagining the following behavior:

user opens admin UI, fiddles with the "reset" button to calibrate some experiment (they want to clear the stats, then start their app, then look at what happens)
a few seconds after they start their experiment, the telemetry loop activates and flushes the sql stats
user doesn't see all the data they wanted in the admin UI and gets confused

Was this intentional?

Somehow I was working under the impression that the SQL-side reset would be the only place that pushes data to telemetry. The telemetry loop could pick up data from what SQL has pushed away, without ever triggering the SQL push on its own.

Also, separately from the above I strongly suggest a maximum value on the reset interval, for example 24 hours.

Reviewed 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rohany)

pkg/roachpb/app_stats.proto, line 102 at r2 (raw file):

// N.B. When this changes, make sure to update (*NumericStat).AlmostEqual
// in app_stats.go.

💯

pkg/server/stats_test.go, line 49 at r2 (raw file):

	}

	// Collect stats from the SQL Server and ensure our queries are present.

SQL server (no capital needed - "SQL Server" is a different product)

rohany · 2020-02-14T15:02:41Z

a few seconds after they start their experiment, the telemetry loop activates and flushes the sql stats

Where are you seeing this? The telemetry loop doesn't touch the SQL stats at all.

Also, separately from the above I strongly suggest a maximum value on the reset interval, for example 24 hours.

Theres a max reset interval in here for both the sql stats and reported stats of 2 hours -- i can bump this up though.

knz

Where are you seeing this? The telemetry loop doesn't touch the SQL stats at all.

Sorry I mis-read the code. It would help to show a test that does the telemetry at a high frequency with a fake server (say, every 20ms) and demonstrate that the app stats in SQL are preserved when this happens.

Theres a max reset interval in here for both the sql stats and reported stats of 2 hours -- i can bump this up though.

Yes I think admin UI users will be thankful to have the option to increase up to a day.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rohany)

rohany · 2020-02-14T15:12:05Z

| Yes I think admin UI users will be thankful to have the option to increase up to a day.

Do we want this though? Most of the comments around this area seem to be wary of the memory build up of the stats with the 2 hour window.

knz · 2020-02-14T15:15:42Z

The default should remain 1 hour or less.
However if the users has a lot of RAM maybe they want more. Remember this entire effort is to enable users to see more data in the web UI. If they truly want a lot of data, we don't have to block them.

rohany · 2020-02-14T16:12:21Z

Added a test. I think I misunderstood your comment re reset intervals earlier -- I made the maximum value for the reset intervals 24 hours.

rohany · 2020-02-24T15:10:50Z

Friendly ping.

knz

LGTM with nits

Reviewed 4 of 4 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rohany)

pkg/server/stats_test.go, line 49 at r3 (raw file):

	// Run a bunch of queries in a loop, and request telemetry in a loop.
	// At the end, ensure that SQL statistics are unaffected by telemetry.
	var wg sync.WaitGroup

As much as I love wait groups and goroutines, this test does not need them. You can interleave 4 calls as follows:

insert
maybe report diagnostics
insert
maybe report diagnostics

and check that you still have the 2 inserts at the end.

pkg/settings/duration.go, line 120 at r3 (raw file):

// RegisterPublicNonNegativeDurationSettingWithMaximum defines a new setting with
// type duration, makes it public, and sets a maximum value.
func RegisterPublicNonNegativeDurationSettingWithMaximum(

I don't dislike this new helper, but then I would request that you grep for all the other calls to Register.*DurationSetting throughout the code and ensure that the other few that also use a maximum also use your new helper.

Or file an issue and tag it "easy" / "good first issue".

This PR fixes the issue that SQL statement statistics are refreshed when telemetry is reported. To do this, this PR introduces a new statistics pool for statistics to be reported to telemetry. Whenever the SQL stats are reset, they are first dumped into the reporting pool. The telemetry reporter then takes the statistics that have been aggregated so far and reports them whenever it executes. This has the effect of telemetry only reporting SQL stats at the rate of max(sql_refresh_rate, telemetry_report_rate). Release note (sql change): This PR introduces a new cluster setting to control the rate at which SQL statement statistics are refreshed, `diagnostics.sql_stat_reset.interval`. Additionally it renames the setting `diagnostics.forced_stat_reset.interval` to `diagnostics.forced_sql_stat_reset_interval`.

rohany · 2020-02-24T20:46:31Z

bors r=knz

craig · 2020-02-24T21:41:32Z

Build succeeded

GitHub CI (Cockroach)

rohany requested a review from knz February 13, 2020 16:28

knz reviewed Feb 13, 2020

View reviewed changes

rohany force-pushed the stats branch from 959a205 to 3f05230 Compare February 13, 2020 21:20

rohany force-pushed the stats branch from 3f05230 to c639ece Compare February 13, 2020 21:49

rohany added this to Open PRs in SQL Features (Deprecated - use SQL Experience board) via automation Feb 13, 2020

rohany force-pushed the stats branch 2 times, most recently from 504fa50 to fde8374 Compare February 14, 2020 01:27

knz reviewed Feb 14, 2020

View reviewed changes

rohany force-pushed the stats branch from fde8374 to a0dc1c7 Compare February 14, 2020 16:11

rohany force-pushed the stats branch 3 times, most recently from b84c115 to 59daf8d Compare February 17, 2020 19:27

knz approved these changes Feb 24, 2020

View reviewed changes

rohany force-pushed the stats branch from 59daf8d to ac3c723 Compare February 24, 2020 19:54

piyush-singh mentioned this pull request Feb 24, 2020

sql,ui: provide a feature to reset the statements page immediately #33315

Closed

craig bot merged commit 7e5cbcc into cockroachdb:master Feb 24, 2020

SQL Features (Deprecated - use SQL Experience board) automation moved this from Open PRs to Done Feb 24, 2020

jseldess mentioned this pull request Mar 19, 2020

server: separate SQL statistics from being refreshed with telemetry cockroachdb/docs#6902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: separate SQL statistics from being refreshed with telemetry #45082

server: separate SQL statistics from being refreshed with telemetry #45082

rohany commented Feb 13, 2020

cockroach-teamcity commented Feb 13, 2020

knz left a comment

rohany commented Feb 13, 2020

knz left a comment

rohany commented Feb 14, 2020 •

edited

Loading

knz left a comment

rohany commented Feb 14, 2020

knz commented Feb 14, 2020

rohany commented Feb 14, 2020

rohany commented Feb 24, 2020

knz left a comment

rohany commented Feb 24, 2020

craig bot commented Feb 24, 2020

server: separate SQL statistics from being refreshed with telemetry #45082

server: separate SQL statistics from being refreshed with telemetry #45082

Conversation

rohany commented Feb 13, 2020

cockroach-teamcity commented Feb 13, 2020

knz left a comment

Choose a reason for hiding this comment

rohany commented Feb 13, 2020

knz left a comment

Choose a reason for hiding this comment

rohany commented Feb 14, 2020 • edited Loading

knz left a comment

Choose a reason for hiding this comment

rohany commented Feb 14, 2020

knz commented Feb 14, 2020

rohany commented Feb 14, 2020

rohany commented Feb 24, 2020

knz left a comment

Choose a reason for hiding this comment

rohany commented Feb 24, 2020

craig bot commented Feb 24, 2020

Build succeeded

rohany commented Feb 14, 2020 •

edited

Loading