fix(daemon): avoid acquiring state lock in health check endpoint #369

benhoyt · 2024-02-27T02:12:58Z

This PR avoids acquiring state lock in health check endpoint, a localised fix for the issue described in https://bugs.launchpad.net/juju/+bug/2052517 and #366.

I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface.

In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR.

I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!).

Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time.

$ go run ./cmd/hithealth/
     0 > 20ms: 0.028897s
     1 > 20ms: 0.192796s
     2 > 20ms: 0.268904s
     3 > 20ms: 0.082985s
     4 > 20ms: 0.030554s
     5 > 20ms: 0.156912s
     6 > 20ms: 0.245212s
     7 > 20ms: 0.047436s
     8 > 20ms: 0.099474s
     9 > 20ms: 0.070440s
    10 > 20ms: 0.060641s
    11 > 20ms: 0.202535s
    12 > 20ms: 0.206226s
    ...

With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load:

$ go run ./cmd/hithealth/
 13891 > 20ms: 0.023703s
 15923 > 20ms: 0.024769s
 21915 > 20ms: 0.076423s
 ...

I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface. In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR. I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!). Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time. $ go run ./cmd/hithealth/ 0 > 20ms: 0.028897s 1 > 20ms: 0.192796s 2 > 20ms: 0.268904s 3 > 20ms: 0.082985s 4 > 20ms: 0.030554s 5 > 20ms: 0.156912s 6 > 20ms: 0.245212s 7 > 20ms: 0.047436s 8 > 20ms: 0.099474s 9 > 20ms: 0.070440s 10 > 20ms: 0.060641s 11 > 20ms: 0.202535s 12 > 20ms: 0.206226s ... With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load: $ go run ./cmd/hithealth/ 13891 > 20ms: 0.023703s 15923 > 20ms: 0.024769s 21915 > 20ms: 0.076423s ...

flotter · 2024-02-27T19:42:04Z

@benhoyt I only have a general question. The amount of latency on the health requested state lock (~200ms) in the worse case seems quote bad. I wonder if the WaitChange() that is also part of the service starting loop (10x concurrently) in your test script is is also attributing to state lock contention. I am probably missing some data on why this happens, but I would be surprised if any one state lock user holds the lock for a very long time, so even if a lot of state lock operations are interleaved, I am still a bit surprised that a single state lock request has to wait that long (even the average feels very long).

benhoyt · 2024-02-27T19:59:46Z

@flotter Thanks for the review. Yes, the WaitChange() could well be part of the contention, but I think that's okay -- it just adds "load" to the system and to the state lock. Here it's mainly the health check endpoint we're concerned about making super-fast. It's the State.Unlock() method that is slow, but only when state has been modified and needs to be written.

I did some further timing tests of the different slow parts of State.unlock, recording the mean time on my emulated "loaded system" for the JSON marshal and the mean time for the checkpoint (which uses osutil.AtomicWrite to write the bytes to disk and calls fsync), and here are the timings:

with fsync: avg marshal=0.011758 checkpoint=0.015349
  no fsync: avg marshal=0.010221 checkpoint=0.000594
with fsync: max marshal=0.068619 checkpoint=0.032137
  no fsync: max marshal=0.067646 checkpoint=0.000805

Two things to note: 1) It's quite slow just to do the JSON marshalling (this is because state has grown quite large, like several hundred KB of JSON, with quite a few Changes being recorded rapidly), and 2) The checkpoint/save is slow, not due to the file writing, but due to the fsync.

I'd like to consider improving either or both of these separately, for example not doing the fsync (maybe a bad idea?) or only saving every (say) 100ms at most. Or moving to a real database like SQLite... :-)

benhoyt · 2024-02-28T03:35:47Z

Going to merge this as is, seeing this is a fairly high-priority issue -- we can always improve on this implementation detail later.

I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface. In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR. I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!). Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time. $ go run ./cmd/hithealth/ 0 > 20ms: 0.028897s 1 > 20ms: 0.192796s 2 > 20ms: 0.268904s 3 > 20ms: 0.082985s 4 > 20ms: 0.030554s 5 > 20ms: 0.156912s 6 > 20ms: 0.245212s 7 > 20ms: 0.047436s 8 > 20ms: 0.099474s 9 > 20ms: 0.070440s 10 > 20ms: 0.060641s 11 > 20ms: 0.202535s 12 > 20ms: 0.206226s ... With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load: $ go run ./cmd/hithealth/ 13891 > 20ms: 0.023703s 15923 > 20ms: 0.024769s 21915 > 20ms: 0.076423s ...

This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load. In addition, this has the side effect of bumping up a few of the golang.org/x packages, as Pebble depends on higher versions.

This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load.

#16981 This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load.

#16982 This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load. In addition, this has the side effect of bumping up a few of the golang.org/x packages, as Pebble depends on higher versions.

hpidcock

Approving retroactively, it seems fine. Thank-you for trying to get to the bottom of this.

hpidcock · 2024-02-28T06:29:10Z

internals/daemon/api_health_test.go

+//
+// - https://github.com/canonical/pebble/issues/366
+// - https://bugs.launchpad.net/juju/+bug/2052517
+func (s *apiSuite) TestStateLockNotHeld(c *C) {


Rather than adding the LockCount, I would have just acquired the lock, then made the call to the health endpoint. If it times out, then it depends on the lock. But at least the lock count allows you to find when you should start testing from.

Ah, that's a very good idea! And if I do it that way it doesn't actually matter when I start testing from: whether the test acquires the lock before or after the overlord startup goroutine did or not, it doesn't matter. Fixed in #373 (and while I was in there, I found that the /v1/health error case still acquired the state lock, so I fixed that too).

As Harry pointed out at canonical#369 (comment), there's a much simpler way to test this without needing a new State method like LockCount. Just acquire the state lock, then call the endpoint. If it times out, we know it was trying to acquire the lock. In addition, fix an issue where the health endpoint would still hold the state lock if it returned an error. Fix those and add a test for that too.

As Harry pointed out at #369 (comment), there's a much simpler way to test this without needing a new State method like LockCount. Just acquire the state lock, then call the endpoint. If it times out, we know it was trying to acquire the lock. In addition, fix an issue where the health endpoint would still hold the state lock if it returned an error. Fix those and add a test for that too.

flotter self-requested a review February 27, 2024 19:20

flotter removed their request for review February 27, 2024 19:46

benhoyt merged commit 80d4435 into canonical:master Feb 28, 2024
15 checks passed

benhoyt deleted the health-no-state-lock branch February 28, 2024 03:36

benhoyt mentioned this pull request Feb 28, 2024

Update to Pebble v1.7.2 juju/juju#16981

Merged

benhoyt mentioned this pull request Feb 28, 2024

Bump Pebble version to v1.9.0 juju/juju#16982

Merged

hpidcock reviewed Feb 28, 2024

View reviewed changes

benhoyt mentioned this pull request Feb 29, 2024

fix(daemon): improve health state lock test, remove LockCount #373

Merged

benhoyt mentioned this pull request Mar 13, 2024

Avoid acquiring state lock in Command.ServeHTTP #366

Open

benhoyt mentioned this pull request Apr 18, 2024

feat: perform health checks via changes and tasks #409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(daemon): avoid acquiring state lock in health check endpoint #369

fix(daemon): avoid acquiring state lock in health check endpoint #369

benhoyt commented Feb 27, 2024 •

edited

flotter commented Feb 27, 2024

benhoyt commented Feb 27, 2024 •

edited

benhoyt commented Feb 28, 2024

hpidcock left a comment

hpidcock Feb 28, 2024

benhoyt Feb 29, 2024

fix(daemon): avoid acquiring state lock in health check endpoint #369

fix(daemon): avoid acquiring state lock in health check endpoint #369

Conversation

benhoyt commented Feb 27, 2024 • edited

flotter commented Feb 27, 2024

benhoyt commented Feb 27, 2024 • edited

benhoyt commented Feb 28, 2024

hpidcock left a comment

Choose a reason for hiding this comment

hpidcock Feb 28, 2024

Choose a reason for hiding this comment

benhoyt Feb 29, 2024

Choose a reason for hiding this comment

benhoyt commented Feb 27, 2024 •

edited

benhoyt commented Feb 27, 2024 •

edited