Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: separate SQL statistics from being refreshed with telemetry #45082

Merged
merged 1 commit into from
Feb 24, 2020

Conversation

rohany
Copy link
Contributor

@rohany rohany commented Feb 13, 2020

This PR fixes the issue that SQL statement statistics
are refreshed when telemetry is reported.

To do this, this PR introduces a new statistics
pool for statistics to be reported to telemetry.
Whenever the SQL stats are reset, they are first
dumped into the reporting pool. The telemetry
reporter then takes the statistics that have been
aggregated so far and reports them whenever it executes.
This has the effect of telemetry only reporting SQL
stats at the rate of max(sql_refresh_rate, telemetry_report_rate).

Release note (sql change): This PR introduces a new cluster
setting to control the rate at which SQL statement statistics
are refreshed, diagnostics.sql_stat_reset.interval. Additionally
it renames the setting diagnostics.forced_stat_reset.interval to
diagnostics.forced_sql_stat_reset_interval.

@rohany rohany requested a review from knz February 13, 2020 16:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I like this approach.

You'll need to write some testing before this can merge though.

Also:

This has the effect of telemetry only reporting SQL
stats at the rate of max(sql_refresh_rate, telemetry_report_rate).

I don't believe this sentence is consistent with the rest of the text (and with the code). I'd recommend re-phrasing.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@rohany
Copy link
Contributor Author

rohany commented Feb 13, 2020

Added a test, PTAL!

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for the test but now the behavior got me a little bit confused.

As I am reading the code now, I am imagining the following behavior:

  • user opens admin UI, fiddles with the "reset" button to calibrate some experiment (they want to clear the stats, then start their app, then look at what happens)
  • a few seconds after they start their experiment, the telemetry loop activates and flushes the sql stats
  • user doesn't see all the data they wanted in the admin UI and gets confused

Was this intentional?

Somehow I was working under the impression that the SQL-side reset would be the only place that pushes data to telemetry. The telemetry loop could pick up data from what SQL has pushed away, without ever triggering the SQL push on its own.

Also, separately from the above I strongly suggest a maximum value on the reset interval, for example 24 hours.

Reviewed 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rohany)


pkg/roachpb/app_stats.proto, line 102 at r2 (raw file):

// N.B. When this changes, make sure to update (*NumericStat).AlmostEqual
// in app_stats.go.

💯


pkg/server/stats_test.go, line 49 at r2 (raw file):

	}

	// Collect stats from the SQL Server and ensure our queries are present.

SQL server (no capital needed - "SQL Server" is a different product)

@rohany
Copy link
Contributor Author

rohany commented Feb 14, 2020

a few seconds after they start their experiment, the telemetry loop activates and flushes the sql stats

Where are you seeing this? The telemetry loop doesn't touch the SQL stats at all.

Also, separately from the above I strongly suggest a maximum value on the reset interval, for example 24 hours.

Theres a max reset interval in here for both the sql stats and reported stats of 2 hours -- i can bump this up though.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you seeing this? The telemetry loop doesn't touch the SQL stats at all.

Sorry I mis-read the code. It would help to show a test that does the telemetry at a high frequency with a fake server (say, every 20ms) and demonstrate that the app stats in SQL are preserved when this happens.

Theres a max reset interval in here for both the sql stats and reported stats of 2 hours -- i can bump this up though.

Yes I think admin UI users will be thankful to have the option to increase up to a day.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rohany)

@rohany
Copy link
Contributor Author

rohany commented Feb 14, 2020

| Yes I think admin UI users will be thankful to have the option to increase up to a day.

Do we want this though? Most of the comments around this area seem to be wary of the memory build up of the stats with the 2 hour window.

@knz
Copy link
Contributor

knz commented Feb 14, 2020

The default should remain 1 hour or less.
However if the users has a lot of RAM maybe they want more. Remember this entire effort is to enable users to see more data in the web UI. If they truly want a lot of data, we don't have to block them.

@rohany
Copy link
Contributor Author

rohany commented Feb 14, 2020

Added a test. I think I misunderstood your comment re reset intervals earlier -- I made the maximum value for the reset intervals 24 hours.

@rohany rohany force-pushed the stats branch 3 times, most recently from b84c115 to 59daf8d Compare February 17, 2020 19:27
@rohany
Copy link
Contributor Author

rohany commented Feb 24, 2020

Friendly ping.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits

Reviewed 4 of 4 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rohany)


pkg/server/stats_test.go, line 49 at r3 (raw file):

	// Run a bunch of queries in a loop, and request telemetry in a loop.
	// At the end, ensure that SQL statistics are unaffected by telemetry.
	var wg sync.WaitGroup

As much as I love wait groups and goroutines, this test does not need them. You can interleave 4 calls as follows:

  1. insert
  2. maybe report diagnostics
  3. insert
  4. maybe report diagnostics

and check that you still have the 2 inserts at the end.


pkg/settings/duration.go, line 120 at r3 (raw file):

// RegisterPublicNonNegativeDurationSettingWithMaximum defines a new setting with
// type duration, makes it public, and sets a maximum value.
func RegisterPublicNonNegativeDurationSettingWithMaximum(

I don't dislike this new helper, but then I would request that you grep for all the other calls to Register.*DurationSetting throughout the code and ensure that the other few that also use a maximum also use your new helper.

Or file an issue and tag it "easy" / "good first issue".

This PR fixes the issue that SQL statement statistics
are refreshed when telemetry is reported.

To do this, this PR introduces a new statistics
pool for statistics to be reported to telemetry.
Whenever the SQL stats are reset, they are first
dumped into the reporting pool. The telemetry
reporter then takes the statistics that have been
aggregated so far and reports them whenever it executes.
This has the effect of telemetry only reporting SQL
stats at the rate of max(sql_refresh_rate, telemetry_report_rate).

Release note (sql change): This PR introduces a new cluster
setting to control the rate at which SQL statement statistics
are refreshed, `diagnostics.sql_stat_reset.interval`. Additionally
it renames the setting `diagnostics.forced_stat_reset.interval` to
`diagnostics.forced_sql_stat_reset_interval`.
@rohany
Copy link
Contributor Author

rohany commented Feb 24, 2020

bors r=knz

@craig
Copy link
Contributor

craig bot commented Feb 24, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants