stats: adaptive throttling of auto stats #35698

RaduBerinde · 2019-03-13T17:44:42Z

We currently throttle statistics by doing a bit of work and then
sleeping 9x longer than it took to do the work; this is done to avoid
impacting running workloads.

There are two problems with this:

the stats can take a very long time even if the cluster is idle.
This is counter-intuitive and on large datasets it can take many
hours until all the stats are populated.
even worse, the throttling mechanism is very inefficient when CPU
load is low; by sleeping for relatively long periods, we allow the
CPU to go into a deeper power-saving state which makes it run
slower for a little while after it wakes up (all the caches are
gone, and other things like that). This effect is more significant
than it sounds: we see stat runs on idle clusters that take 10
minutes without throttling but 2-3 hours with throttling (instead
of 90 minutes).

This change makes the throttling adaptive. Throughout the run, we look
at the current value of the sys.cpu.combined.percent-normalized
metric, which is the average CPU usage of the CRDB process over all
cores in the last 10 seconds (it is updated every 10 seconds). When
this usage is under 25%, we don't throttle; when it is over 75%, we
throttle to the maximum configured value; in-between we interpolate
linearly.

This is not something that can be tested reliably in an automated
fashion. I wrote a manual test that can be run with a special flag.
The intention is that we will run this test manually if we make any
changes to the throttling logic.

Fixes #35346.

Release note (bug fix): Automatic statistic jobs should be much faster
on clusters with low load.

cockroach-teamcity · 2019-03-13T17:44:51Z

This change is

rytaft

Nice!

Reviewed 9 of 9 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @RaduBerinde)

pkg/sql/distsqlpb/processors.proto, line 776 at r1 (raw file):

  // Currently, this field is set only for automatic statistics based on the
  // value of the cluster setting
  // sql.stats.experimental_automatic_collection.fraction_idle.

perhaps we should update the name of the cluster setting too?

pkg/sql/distsqlrun/sampler.go, line 218 at r1 (raw file):

			if s.maxFractionIdle > 0 {
				// Look at CRDB's average CPU usage in the last 10 seconds (0 to 1):

[nit] does (0 to 1) refer to the range of possible CPU usage values? It's a bit unclear as-is.

pkg/sql/stats/automatic_stats_manual_test.go, line 58 at r1 (raw file):

//     automatic_stats_manual_test.go:72: Create stats took 15.886412857s
//
func TestAdaptiveThrottling(t *testing.T) {

Nice test! Isn't this the sort of thing that roachtests are used for? Maybe it could be run on a nightly or weekly basis?

RaduBerinde · 2019-03-13T19:53:57Z

Switched to using a RuntimeStats interface so we don't directly use server/status from sql.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @rytaft)

pkg/sql/distsqlpb/processors.proto, line 776 at r1 (raw file):

Previously, rytaft wrote…

perhaps we should update the name of the cluster setting too?

Done.

pkg/sql/distsqlrun/sampler.go, line 218 at r1 (raw file):

Previously, rytaft wrote…

[nit] does (0 to 1) refer to the range of possible CPU usage values? It's a bit unclear as-is.

Removed since the scale is not relevant in this code, it's only relevant for the values of the constants (0.25, 0.75) where it's already clear.

pkg/sql/stats/automatic_stats_manual_test.go, line 58 at r1 (raw file):

Previously, rytaft wrote…

Nice test! Isn't this the sort of thing that roachtests are used for? Maybe it could be run on a nightly or weekly basis?

It would need to have a passing criteria if we want to run them periodically but it kind of has to be eyeballed. The "synthetic" cpu usage would be difficult to achieve with roachtests (here we are inside the cockroach process).

rytaft

Reviewed 4 of 4 files at r2, 6 of 6 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball)

pkg/sql/stats/automatic_stats_manual_test.go, line 58 at r1 (raw file):

Previously, RaduBerinde wrote…

It would need to have a passing criteria if we want to run them periodically but it kind of has to be eyeballed. The "synthetic" cpu usage would be difficult to achieve with roachtests (here we are inside the cockroach process).

ok!

We currently throttle statistics by doing a bit of work and then sleeping 9x longer than it took to do the work; this is done to avoid impacting running workloads. There are two problems with this: - the stats can take a very long time even if the cluster is idle. This is counter-intuitive and on large datasets it can take many hours until all the stats are populated. - even worse, the throttling mechanism is very inefficient when CPU load is low; by sleeping for relatively long periods, we allow the CPU to go into a deeper power-saving state which makes it run slower for a little while after it wakes up (all the caches are gone, and other things like that). This effect is more significant than it sounds: we see stat runs on idle clusters that take 10 minutes without throttling but 2-3 hours with throttling (instead of 90 minutes). This change makes the throttling adaptive. Throughout the run, we look at the current value of the `sys.cpu.combined.percent-normalized` metric, which is the average CPU usage of the CRDB process over all cores in the last 10 seconds (it is updated every 10 seconds). When this usage is under 25%, we don't throttle; when it is over 75%, we throttle to the maximum configured value; in-between we interpolate linearly. This is not something that can be tested reliably in an automated fashion. I wrote a manual test that can be run with a special flag. The intention is that we will run this test manually if we make any changes to the throttling logic. Fixes cockroachdb#35346. Release note (bug fix): Automatic statistic jobs should be much faster on clusters with low load.

RaduBerinde · 2019-03-14T18:04:11Z

bors r+

craig · 2019-03-14T18:27:50Z

Build failed

GitHub CI (Cockroach)

RaduBerinde · 2019-03-14T18:32:52Z

Filed #35747 for the test flake.

bors r+

craig · 2019-03-14T19:09:18Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2019-03-14T19:46:40Z

Build failed

GitHub CI (Cockroach)

RaduBerinde · 2019-03-14T20:10:33Z

Hit #35549.

bors r+

craig · 2019-03-14T20:32:05Z

Build failed

GitHub CI (Cockroach)

RaduBerinde · 2019-03-14T20:59:09Z

Hit #35758.

bors r+

craig · 2019-03-14T21:43:30Z

Build failed (retrying...)

GitHub CI (Cockroach)

35698: stats: adaptive throttling of auto stats r=RaduBerinde a=RaduBerinde We currently throttle statistics by doing a bit of work and then sleeping 9x longer than it took to do the work; this is done to avoid impacting running workloads. There are two problems with this: - the stats can take a very long time even if the cluster is idle. This is counter-intuitive and on large datasets it can take many hours until all the stats are populated. - even worse, the throttling mechanism is very inefficient when CPU load is low; by sleeping for relatively long periods, we allow the CPU to go into a deeper power-saving state which makes it run slower for a little while after it wakes up (all the caches are gone, and other things like that). This effect is more significant than it sounds: we see stat runs on idle clusters that take 10 minutes without throttling but 2-3 hours with throttling (instead of 90 minutes). This change makes the throttling adaptive. Throughout the run, we look at the current value of the `sys.cpu.combined.percent-normalized` metric, which is the average CPU usage of the CRDB process over all cores in the last 10 seconds (it is updated every 10 seconds). When this usage is under 25%, we don't throttle; when it is over 75%, we throttle to the maximum configured value; in-between we interpolate linearly. This is not something that can be tested reliably in an automated fashion. I wrote a manual test that can be run with a special flag. The intention is that we will run this test manually if we make any changes to the throttling logic. Fixes #35346. Release note (bug fix): Automatic statistic jobs should be much faster on clusters with low load. Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

craig · 2019-03-14T22:17:40Z

Build succeeded

GitHub CI (Cockroach)

RaduBerinde requested review from rytaft, a team and andy-kimball March 13, 2019 17:44

RaduBerinde force-pushed the adaptive-throttle branch from ae54dc3 to e2ca090 Compare March 13, 2019 19:53

rytaft approved these changes Mar 13, 2019

View reviewed changes

RaduBerinde force-pushed the adaptive-throttle branch from e2ca090 to 6bba1e6 Compare March 13, 2019 20:04

RaduBerinde commented Mar 13, 2019

View reviewed changes

rytaft approved these changes Mar 13, 2019

View reviewed changes

RaduBerinde force-pushed the adaptive-throttle branch from 6bba1e6 to e47cb3e Compare March 14, 2019 14:06

craig bot merged commit e47cb3e into cockroachdb:master Mar 14, 2019

RaduBerinde deleted the adaptive-throttle branch March 15, 2019 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: adaptive throttling of auto stats #35698

stats: adaptive throttling of auto stats #35698

RaduBerinde commented Mar 13, 2019

cockroach-teamcity commented Mar 13, 2019

rytaft left a comment

RaduBerinde commented Mar 13, 2019

RaduBerinde left a comment

rytaft left a comment

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

craig bot commented Mar 14, 2019

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

craig bot commented Mar 14, 2019

stats: adaptive throttling of auto stats #35698

stats: adaptive throttling of auto stats #35698

Conversation

RaduBerinde commented Mar 13, 2019

cockroach-teamcity commented Mar 13, 2019

rytaft left a comment

Choose a reason for hiding this comment

RaduBerinde commented Mar 13, 2019

RaduBerinde left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

Build failed

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

Build failed (retrying...)

craig bot commented Mar 14, 2019

Build failed

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

Build failed

RaduBerinde commented Mar 14, 2019

craig bot commented Mar 14, 2019

Build failed (retrying...)

craig bot commented Mar 14, 2019

Build succeeded