Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: adaptive throttling of auto stats #35698

Merged
merged 1 commit into from Mar 14, 2019

Conversation

RaduBerinde
Copy link
Member

We currently throttle statistics by doing a bit of work and then
sleeping 9x longer than it took to do the work; this is done to avoid
impacting running workloads.

There are two problems with this:

  • the stats can take a very long time even if the cluster is idle.
    This is counter-intuitive and on large datasets it can take many
    hours until all the stats are populated.

  • even worse, the throttling mechanism is very inefficient when CPU
    load is low; by sleeping for relatively long periods, we allow the
    CPU to go into a deeper power-saving state which makes it run
    slower for a little while after it wakes up (all the caches are
    gone, and other things like that). This effect is more significant
    than it sounds: we see stat runs on idle clusters that take 10
    minutes without throttling but 2-3 hours with throttling (instead
    of 90 minutes).

This change makes the throttling adaptive. Throughout the run, we look
at the current value of the sys.cpu.combined.percent-normalized
metric, which is the average CPU usage of the CRDB process over all
cores in the last 10 seconds (it is updated every 10 seconds). When
this usage is under 25%, we don't throttle; when it is over 75%, we
throttle to the maximum configured value; in-between we interpolate
linearly.

This is not something that can be tested reliably in an automated
fashion. I wrote a manual test that can be run with a special flag.
The intention is that we will run this test manually if we make any
changes to the throttling logic.

Fixes #35346.

Release note (bug fix): Automatic statistic jobs should be much faster
on clusters with low load.

@RaduBerinde RaduBerinde requested review from rytaft, a team and andy-kimball March 13, 2019 17:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

Reviewed 9 of 9 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @RaduBerinde)


pkg/sql/distsqlpb/processors.proto, line 776 at r1 (raw file):

  // Currently, this field is set only for automatic statistics based on the
  // value of the cluster setting
  // sql.stats.experimental_automatic_collection.fraction_idle.

perhaps we should update the name of the cluster setting too?


pkg/sql/distsqlrun/sampler.go, line 218 at r1 (raw file):

			if s.maxFractionIdle > 0 {
				// Look at CRDB's average CPU usage in the last 10 seconds (0 to 1):

[nit] does (0 to 1) refer to the range of possible CPU usage values? It's a bit unclear as-is.


pkg/sql/stats/automatic_stats_manual_test.go, line 58 at r1 (raw file):

//     automatic_stats_manual_test.go:72: Create stats took 15.886412857s
//
func TestAdaptiveThrottling(t *testing.T) {

Nice test! Isn't this the sort of thing that roachtests are used for? Maybe it could be run on a nightly or weekly basis?

@RaduBerinde
Copy link
Member Author

Switched to using a RuntimeStats interface so we don't directly use server/status from sql.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @rytaft)


pkg/sql/distsqlpb/processors.proto, line 776 at r1 (raw file):

Previously, rytaft wrote…

perhaps we should update the name of the cluster setting too?

Done.


pkg/sql/distsqlrun/sampler.go, line 218 at r1 (raw file):

Previously, rytaft wrote…

[nit] does (0 to 1) refer to the range of possible CPU usage values? It's a bit unclear as-is.

Removed since the scale is not relevant in this code, it's only relevant for the values of the constants (0.25, 0.75) where it's already clear.


pkg/sql/stats/automatic_stats_manual_test.go, line 58 at r1 (raw file):

Previously, rytaft wrote…

Nice test! Isn't this the sort of thing that roachtests are used for? Maybe it could be run on a nightly or weekly basis?

It would need to have a passing criteria if we want to run them periodically but it kind of has to be eyeballed. The "synthetic" cpu usage would be difficult to achieve with roachtests (here we are inside the cockroach process).

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, 6 of 6 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball)


pkg/sql/stats/automatic_stats_manual_test.go, line 58 at r1 (raw file):

Previously, RaduBerinde wrote…

It would need to have a passing criteria if we want to run them periodically but it kind of has to be eyeballed. The "synthetic" cpu usage would be difficult to achieve with roachtests (here we are inside the cockroach process).

ok!

We currently throttle statistics by doing a bit of work and then
sleeping 9x longer than it took to do the work; this is done to avoid
impacting running workloads.

There are two problems with this:

 - the stats can take a very long time even if the cluster is idle.
   This is counter-intuitive and on large datasets it can take many
   hours until all the stats are populated.

 - even worse, the throttling mechanism is very inefficient when CPU
   load is low; by sleeping for relatively long periods, we allow the
   CPU to go into a deeper power-saving state which makes it run
   slower for a little while after it wakes up (all the caches are
   gone, and other things like that). This effect is more significant
   than it sounds: we see stat runs on idle clusters that take 10
   minutes without throttling but 2-3 hours with throttling (instead
   of 90 minutes).

This change makes the throttling adaptive. Throughout the run, we look
at the current value of the `sys.cpu.combined.percent-normalized`
metric, which is the average CPU usage of the CRDB process over all
cores in the last 10 seconds (it is updated every 10 seconds). When
this usage is under 25%, we don't throttle; when it is over 75%, we
throttle to the maximum configured value; in-between we interpolate
linearly.

This is not something that can be tested reliably in an automated
fashion. I wrote a manual test that can be run with a special flag.
The intention is that we will run this test manually if we make any
changes to the throttling logic.

Fixes cockroachdb#35346.

Release note (bug fix): Automatic statistic jobs should be much faster
on clusters with low load.
@RaduBerinde
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build failed

@RaduBerinde
Copy link
Member Author

Filed #35747 for the test flake.

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build failed

@RaduBerinde
Copy link
Member Author

Hit #35549.

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build failed

@RaduBerinde
Copy link
Member Author

Hit #35758.

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request Mar 14, 2019
35698: stats: adaptive throttling of auto stats r=RaduBerinde a=RaduBerinde

We currently throttle statistics by doing a bit of work and then
sleeping 9x longer than it took to do the work; this is done to avoid
impacting running workloads.

There are two problems with this:

 - the stats can take a very long time even if the cluster is idle.
   This is counter-intuitive and on large datasets it can take many
   hours until all the stats are populated.

 - even worse, the throttling mechanism is very inefficient when CPU
   load is low; by sleeping for relatively long periods, we allow the
   CPU to go into a deeper power-saving state which makes it run
   slower for a little while after it wakes up (all the caches are
   gone, and other things like that). This effect is more significant
   than it sounds: we see stat runs on idle clusters that take 10
   minutes without throttling but 2-3 hours with throttling (instead
   of 90 minutes).

This change makes the throttling adaptive. Throughout the run, we look
at the current value of the `sys.cpu.combined.percent-normalized`
metric, which is the average CPU usage of the CRDB process over all
cores in the last 10 seconds (it is updated every 10 seconds). When
this usage is under 25%, we don't throttle; when it is over 75%, we
throttle to the maximum configured value; in-between we interpolate
linearly.

This is not something that can be tested reliably in an automated
fashion. I wrote a manual test that can be run with a special flag.
The intention is that we will run this test manually if we make any
changes to the throttling logic.

Fixes #35346.

Release note (bug fix): Automatic statistic jobs should be much faster
on clusters with low load.

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build succeeded

@craig craig bot merged commit e47cb3e into cockroachdb:master Mar 14, 2019
@RaduBerinde RaduBerinde deleted the adaptive-throttle branch March 15, 2019 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt: Create statistics jobs taking too long
3 participants