Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(daemon): avoid acquiring state lock in health check endpoint #369

Merged
merged 1 commit into from Feb 28, 2024

Conversation

benhoyt
Copy link
Contributor

@benhoyt benhoyt commented Feb 27, 2024

This PR avoids acquiring state lock in health check endpoint, a localised fix for the issue described in https://bugs.launchpad.net/juju/+bug/2052517 and #366.

I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface.

In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR.

I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!).

Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time.

$ go run ./cmd/hithealth/
     0 > 20ms: 0.028897s
     1 > 20ms: 0.192796s
     2 > 20ms: 0.268904s
     3 > 20ms: 0.082985s
     4 > 20ms: 0.030554s
     5 > 20ms: 0.156912s
     6 > 20ms: 0.245212s
     7 > 20ms: 0.047436s
     8 > 20ms: 0.099474s
     9 > 20ms: 0.070440s
    10 > 20ms: 0.060641s
    11 > 20ms: 0.202535s
    12 > 20ms: 0.206226s
    ...

With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load:

$ go run ./cmd/hithealth/
 13891 > 20ms: 0.023703s
 15923 > 20ms: 0.024769s
 21915 > 20ms: 0.076423s
 ...

I don't love the addition of the (exported) State.LockCount method, but
I couldn't think of another way to ensure that State.Lock was not being
called, apart from significant surgery allowing us to swap out the
state lock with an interface.

In addition, I'm not sure we need the maintenance/warnings stuff at all
in Pebble, as they're not used, but leaving that for a separate
discussion and PR.

I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to
cmd/hithealth/main.go, running a Pebble server, and then running a
task to heavily load my CPU cores (though the latter doesn't seem to
make much of a difference!).

Without the fix, GET /v1/health response times are all over 20ms, and
commonly in the 100-300ms range. In other words, very long for a
do-nothing health check endpoint. This is because the other clients
(in the hithealth script) are starting services which modifies state
and holds the state lock for a relatively long time.

$ go run ./cmd/hithealth/
     0 > 20ms: 0.028897s
     1 > 20ms: 0.192796s
     2 > 20ms: 0.268904s
     3 > 20ms: 0.082985s
     4 > 20ms: 0.030554s
     5 > 20ms: 0.156912s
     6 > 20ms: 0.245212s
     7 > 20ms: 0.047436s
     8 > 20ms: 0.099474s
     9 > 20ms: 0.070440s
    10 > 20ms: 0.060641s
    11 > 20ms: 0.202535s
    12 > 20ms: 0.206226s
    ...

With the fix, GET /v1/health doesn't hold the state lock at all.
Response times are only over 20ms every 10,000 or so requests, and
never above 100ms even under heavy load:

$ go run ./cmd/hithealth/
 13891 > 20ms: 0.023703s
 15923 > 20ms: 0.024769s
 21915 > 20ms: 0.076423s
 ...
@flotter flotter self-requested a review February 27, 2024 19:20
@flotter
Copy link
Contributor

flotter commented Feb 27, 2024

@benhoyt I only have a general question. The amount of latency on the health requested state lock (~200ms) in the worse case seems quote bad. I wonder if the WaitChange() that is also part of the service starting loop (10x concurrently) in your test script is is also attributing to state lock contention. I am probably missing some data on why this happens, but I would be surprised if any one state lock user holds the lock for a very long time, so even if a lot of state lock operations are interleaved, I am still a bit surprised that a single state lock request has to wait that long (even the average feels very long).

@flotter flotter removed their request for review February 27, 2024 19:46
@benhoyt
Copy link
Contributor Author

benhoyt commented Feb 27, 2024

@flotter Thanks for the review. Yes, the WaitChange() could well be part of the contention, but I think that's okay -- it just adds "load" to the system and to the state lock. Here it's mainly the health check endpoint we're concerned about making super-fast. It's the State.Unlock() method that is slow, but only when state has been modified and needs to be written.

I did some further timing tests of the different slow parts of State.unlock, recording the mean time on my emulated "loaded system" for the JSON marshal and the mean time for the checkpoint (which uses osutil.AtomicWrite to write the bytes to disk and calls fsync), and here are the timings:

with fsync: avg marshal=0.011758 checkpoint=0.015349
  no fsync: avg marshal=0.010221 checkpoint=0.000594
with fsync: max marshal=0.068619 checkpoint=0.032137
  no fsync: max marshal=0.067646 checkpoint=0.000805

Two things to note: 1) It's quite slow just to do the JSON marshalling (this is because state has grown quite large, like several hundred KB of JSON, with quite a few Changes being recorded rapidly), and 2) The checkpoint/save is slow, not due to the file writing, but due to the fsync.

I'd like to consider improving either or both of these separately, for example not doing the fsync (maybe a bad idea?) or only saving every (say) 100ms at most. Or moving to a real database like SQLite... :-)

@benhoyt
Copy link
Contributor Author

benhoyt commented Feb 28, 2024

Going to merge this as is, seeing this is a fairly high-priority issue -- we can always improve on this implementation detail later.

@benhoyt benhoyt merged commit 80d4435 into canonical:master Feb 28, 2024
15 checks passed
@benhoyt benhoyt deleted the health-no-state-lock branch February 28, 2024 03:36
benhoyt added a commit that referenced this pull request Feb 28, 2024
I don't love the addition of the (exported) State.LockCount method, but
I couldn't think of another way to ensure that State.Lock was not being
called, apart from significant surgery allowing us to swap out the
state lock with an interface.

In addition, I'm not sure we need the maintenance/warnings stuff at all
in Pebble, as they're not used, but leaving that for a separate
discussion and PR.

I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to
cmd/hithealth/main.go, running a Pebble server, and then running a
task to heavily load my CPU cores (though the latter doesn't seem to
make much of a difference!).

Without the fix, GET /v1/health response times are all over 20ms, and
commonly in the 100-300ms range. In other words, very long for a
do-nothing health check endpoint. This is because the other clients
(in the hithealth script) are starting services which modifies state
and holds the state lock for a relatively long time.

$ go run ./cmd/hithealth/
     0 > 20ms: 0.028897s
     1 > 20ms: 0.192796s
     2 > 20ms: 0.268904s
     3 > 20ms: 0.082985s
     4 > 20ms: 0.030554s
     5 > 20ms: 0.156912s
     6 > 20ms: 0.245212s
     7 > 20ms: 0.047436s
     8 > 20ms: 0.099474s
     9 > 20ms: 0.070440s
    10 > 20ms: 0.060641s
    11 > 20ms: 0.202535s
    12 > 20ms: 0.206226s
    ...

With the fix, GET /v1/health doesn't hold the state lock at all.
Response times are only over 20ms every 10,000 or so requests, and
never above 100ms even under heavy load:

$ go run ./cmd/hithealth/
 13891 > 20ms: 0.023703s
 15923 > 20ms: 0.024769s
 21915 > 20ms: 0.076423s
 ...
benhoyt added a commit to benhoyt/juju that referenced this pull request Feb 28, 2024
This Pebble version adds the following since v1.7.1:

- canonical/pebble#368: replace "select=all" with "users=all". Strictly
speaking this is a breaking change, however, it was a design mistake
and we want to fix it as soon as possible. It's almost certainly that
no one is using this feature yet.
- canonical/pebble#369: avoid acquiring state lock in health check
endpoint. This fixes (or is one aspect of the fix for) the issue
described in https://bugs.launchpad.net/juju/+bug/2052517, so that the
GET /v1/health endpoint returns much quicker even when under load.

In addition, this has the side effect of bumping up a few of the
golang.org/x packages, as Pebble depends on higher versions.
jujubot pushed a commit to juju/juju that referenced this pull request Feb 28, 2024
This Pebble version adds the following since v1.7.1:

- canonical/pebble#368: replace "select=all" with "users=all". Strictly
speaking this is a breaking change, however, it was a design mistake
and we want to fix it as soon as possible. It's almost certainly that
no one is using this feature yet.
- canonical/pebble#369: avoid acquiring state lock in health check
endpoint. This fixes (or is one aspect of the fix for) the issue
described in https://bugs.launchpad.net/juju/+bug/2052517, so that the
GET /v1/health endpoint returns much quicker even when under load.
jujubot added a commit to juju/juju that referenced this pull request Feb 28, 2024
#16981

This Pebble version adds the following since v1.7.1:

- canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet.
- canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load.
jujubot added a commit to juju/juju that referenced this pull request Feb 28, 2024
#16982

This Pebble version adds the following since v1.7.1:

- canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet.
- canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load.

In addition, this has the side effect of bumping up a few of the golang.org/x packages, as Pebble depends on higher versions.
Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving retroactively, it seems fine. Thank-you for trying to get to the bottom of this.

//
// - https://github.com/canonical/pebble/issues/366
// - https://bugs.launchpad.net/juju/+bug/2052517
func (s *apiSuite) TestStateLockNotHeld(c *C) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding the LockCount, I would have just acquired the lock, then made the call to the health endpoint. If it times out, then it depends on the lock. But at least the lock count allows you to find when you should start testing from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a very good idea! And if I do it that way it doesn't actually matter when I start testing from: whether the test acquires the lock before or after the overlord startup goroutine did or not, it doesn't matter. Fixed in #373 (and while I was in there, I found that the /v1/health error case still acquired the state lock, so I fixed that too).

benhoyt added a commit to benhoyt/pebble that referenced this pull request Feb 29, 2024
As Harry pointed out at
canonical#369 (comment),
there's a much simpler way to test this without needing a new State
method like LockCount. Just acquire the state lock, then call the
endpoint. If it times out, we know it was trying to acquire the lock.

In addition, fix an issue where the health endpoint would still hold
the state lock if it returned an error. Fix those and add a test for
that too.
benhoyt added a commit that referenced this pull request Mar 28, 2024
As Harry pointed out at
#369 (comment),
there's a much simpler way to test this without needing a new State
method like LockCount. Just acquire the state lock, then call the
endpoint. If it times out, we know it was trying to acquire the lock.

In addition, fix an issue where the health endpoint would still hold the
state lock if it returned an error. Fix those and add a test for that
too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants