-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Sidekiq worker resiliency #74
Conversation
PostcodesCollectionWorker was designed to loop through the entire database (which will be 1.8 million postcodes), and iterate over these continually. But [Sidekiq is not designed for never-ending jobs](https://github.com/mperham/sidekiq/wiki/Ent-Rolling-Restarts), and we're seeing related errors appear in the Sidekiq logs: ``` {"@timestamp":"2022-05-12T08:40:49Z","@fields":{"pid":8091,"tid":"TID-1eqe0","context":" PostcodesCollectionWorker JID-325d45d1323979ec6b7b69d3","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"done","@Severity":"INFO","@run_time":1.014,"@message":"done: 1.014 sec"} {"@timestamp":"2022-05-12T08:40:50Z","@fields":{"pid":8091,"tid":"TID-1eqe0","context":" PostcodesCollectionWorker JID-8dc5d575d278d5d9694a605c","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"} {"@timestamp":"2022-05-12T08:40:51Z","@fields":{"pid":8091,"tid":"TID-1eqe0","context":" PostcodesCollectionWorker JID-8dc5d575d278d5d9694a605c","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"done","@Severity":"INFO","@run_time":1.012,"@message":"done: 1.012 sec"} {"@timestamp":"2022-05-12T09:28:11Z","@fields":{"pid":26319,"tid":"TID-1ap4","context":" PostcodesCollectionWorker JID-f70ee78bb8addfb1ebef5418","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"} {"@timestamp":"2022-05-12T09:33:01Z","@fields":{"pid":26319,"tid":"TID-18xs","context":"","program_name":null,"worker":null},"@type":"sidekiq","@status":null,"@Severity":"WARN","@run_time":null,"@message":"Work still in progress [#\u003cstruct Sidekiq::BasicFetch::UnitOfWork queue=\"queue:default\", job=\"{\\\"class\\\":\\\"PostcodesCollectionWorker\\\",\\\"args\\\":[true],\\\"retry\\\":true,\\\"queue\\\":\\\"default\\\",\\\"jid\\\":\\\"f70ee78bb8addfb1ebef5418\\\",\\\"created_at\\\":1652347691.4233317,\\\"enqueued_at\\\":1652347691.4233935}\"\u003e]"} {"@timestamp":"2022-05-12T09:51:55Z","@fields":{"pid":29417,"tid":"TID-1ap4","context":" PostcodesCollectionWorker JID-64f0598e28fe452cb4a5d348","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"} {"@timestamp":"2022-05-12T09:55:42Z","@fields":{"pid":29417,"tid":"TID-18xs","context":"","program_name":null,"worker":null},"@type":"sidekiq","@status":null,"@Severity":"WARN","@run_time":null,"@message":"Work still in progress [#\u003cstruct Sidekiq::BasicFetch::UnitOfWork queue=\"queue:default\", job=\"{\\\"class\\\":\\\"PostcodesCollectionWorker\\\",\\\"args\\\":[true],\\\"retry\\\":true,\\\"queue\\\":\\\"default\\\",\\\"jid\\\":\\\"64f0598e28fe452cb4a5d348\\\",\\\"created_at\\\":1652347982.81896,\\\"enqueued_at\\\":1652347982.8189998}\"\u003e]"} {"@timestamp":"2022-05-12T10:02:34Z","@fields":{"pid":30523,"tid":"TID-1b80","context":" PostcodesCollectionWorker JID-64f0598e28fe452cb4a5d348","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"} {"@timestamp":"2022-05-12T11:52:47Z","@fields":{"pid":30523,"tid":"TID-18xs","context":"","program_name":null,"worker":null},"@type":"sidekiq","@status":null,"@Severity":"WARN","@run_time":null,"@message":"Work still in progress [#\u003cstruct Sidekiq::BasicFetch::UnitOfWork queue=\"queue:default\", job=\"{\\\"class\\\":\\\"PostcodesCollectionWorker\\\",\\\"args\\\":[true],\\\"retry\\\":true,\\\"queue\\\":\\\"default\\\",\\\"jid\\\":\\\"64f0598e28fe452cb4a5d348\\\",\\\"created_at\\\":1652347982.81896,\\\"enqueued_at\\\":1652347982.8189998}\"\u003e]"} {"@timestamp":"2022-05-12T11:52:47Z","@fields":{"pid":30523,"tid":"TID-1b80","context":" PostcodesCollectionWorker JID-64f0598e28fe452cb4a5d348","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"fail","@Severity":"INFO","@run_time":6613.069,"@message":"fail: 6613.069 sec"} ``` Rather than battle with the system and try to keep one job alive forever, this commit re-architects PostcodesCollectionWorker so that it is designed to be short-lived, and takes advantage of Sidekiq's concurrency. PostcodesCollectionWorker now spawns three ProcessPostcodeWorker workers at a time, sleeps for a second, and then spawns another PostcodesCollectionWorker before completing. We now concurrently process 3 postcodes per second, and queue a new batch of postcodes via a new PostcodesCollectionWorker every second. The choice of 1 per second was arbitrary. In the original architectural proposal, we don't specify what the rate limit should be - just noting that [OS Places API has a rate limit of 600 per minute](https://docs.google.com/document/d/1p29g1SQgi2obQnPPsl9amx7xmrJonyH4tRbVMAhXy-A/edit#heading=h.67v3c75qtis3), and that that could update all 1.8 million postcodes "in a few days". At 1 per second, it would take 21 days to update every postcode. At 10 per second (the theoretical maximum allowed by OS Places API), we'd risk breaching the limit due to Sidekiq running on multiple EC2 instances and different workers invoking other workers, i.e. the rate-limit we set is only approximate. A limit of 3 means an update time of around a week for the whole database, and allows some wiggle room for scaling. NB, I've removed the "sleep for 10 seconds if DB empty" rule; the 1 second delay between jobs is enough to avoid spamming the queue.
We had originally planned to have one long-running PostcodesCollectionWorker, and process one postcode worker at a time (spawned from the above). But we actually now want to process 3 postcodes at a time (see previous commit). Whilst we could set a concurrency of 3 to enforce this, that does mean changing the value in multiple places if we were ever to change this number in the future. By setting it to 10, we've introduced a safeguard of sorts (OS Places API has a rate limit of 600 per minute, i.e. 10 per second), and our use of `ApplicationRecord`'s `with_advisory_lock` should prevent us ever spawning more than one active `PostcodesCollectionWorker` at a time, which then itself spawns the number of workers corresponding to `POSTCODES_PER_SECOND` (currently `3`). So even with a concurrency set to 10, we shouldn't expect to see more than around 4 workers at any one time, because of how we've architected the workers themselves. We can easily boost the concurrency numbers by changing the `POSTCODES_PER_SECOND` in future (provided it is less than 10).
We were seeing failures in the workers, but nothing in Sentry. I've also back-filled a missing test for the 'postcode update' happy path.
Previously, the postcode collections work and the postcode update work would be given equal priority, which can lead to lots of PostcodesCollectionWorker workers being invoked, when we really only want one. By putting the ProcessPostcodeWorker workers in a significantly higher priority queue, in theory all ProcessPostcodeWorker workers should be processed before the next PostcodesCollectionWorker is processed, which should mean we're updating postcodes at a consistent rate (and well below OS Places API's rate limits).
I wonder if using an exponential backoff strategy would be more resilient and future-proof, in case their API rate-limiting gets more strict. |
The workers already use exponential backoff by default, except that we've set max_retries to 1, so no job will attempt to retry more than once. This PR is about improving resiliency of the workers to our environment, i.e. datasyncs etc which cause workers to die. |
I know that's how Sidekiq works by default, was thinking of errors from OS Places API, but you're right that the app will not stop working and we'll get lots of alerts so should be easy to fix if it happens. |
In #74, we added a manual call to `GovukError.notify` when an exception was raised in Sidekiq. This was because Sentry would not receive any errors from Sidekiq. However, govuk_app_config does support Sidekiq Sentry error reporting - it just needs the `sentry-sidekiq` gem adding. This is explicitly warned about as of version 4.6.0: https://github.com/alphagov/govuk_app_config/blob/main/CHANGELOG.md#460 By adding this dependency, we should be able to remove the duplicate config.
In #74, we added a manual call to `GovukError.notify` when an exception was raised in Sidekiq. This was because Sentry would not receive any errors from Sidekiq. However, govuk_app_config does support Sidekiq Sentry error reporting - it just needs the `sentry-sidekiq` gem adding. This is explicitly warned about as of version 4.6.0: https://github.com/alphagov/govuk_app_config/blob/main/CHANGELOG.md#460 By adding this dependency, we should be able to remove the duplicate config.
We were seeing the
PostcodesCollectionWorker
die after running for a few hours (possibly related to the nightly datasync).Rather than have one long-running worker iterate through the entire 1.8 million postcodes in the database, we've now re-architected it to be short-lived, fetching the 3 oldest postcodes and triggering jobs to update those.
This also bumps our postcode updating rate from around 1 per second to around 3 per second, which cuts the total update time from around 21 days to around 7 days. We could in theory go even closer to the OS Places API 10-requests-per-second rate limit, but as Sidekiq is running across multiple instances and this figure is only approximate, it is safer set to the lower value of 3.
Finally, this PR splits the Sidekiq worker queues into two, so that we can allocate more priority to the updating of postcodes.
Trello: https://trello.com/c/jE4Qghdd/2902-fix-sidekiq-workers-locations-api