Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Sidekiq worker resiliency #74

Merged
merged 4 commits into from
May 17, 2022
Merged

Conversation

ChrisBAshton
Copy link
Contributor

We were seeing the PostcodesCollectionWorker die after running for a few hours (possibly related to the nightly datasync).

Rather than have one long-running worker iterate through the entire 1.8 million postcodes in the database, we've now re-architected it to be short-lived, fetching the 3 oldest postcodes and triggering jobs to update those.

This also bumps our postcode updating rate from around 1 per second to around 3 per second, which cuts the total update time from around 21 days to around 7 days. We could in theory go even closer to the OS Places API 10-requests-per-second rate limit, but as Sidekiq is running across multiple instances and this figure is only approximate, it is safer set to the lower value of 3.

Finally, this PR splits the Sidekiq worker queues into two, so that we can allocate more priority to the updating of postcodes.

Trello: https://trello.com/c/jE4Qghdd/2902-fix-sidekiq-workers-locations-api

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

PostcodesCollectionWorker was designed to loop through the entire
database (which will be 1.8 million postcodes), and iterate over
these continually. But [Sidekiq is not designed for never-ending
jobs](https://github.com/mperham/sidekiq/wiki/Ent-Rolling-Restarts),
and we're seeing related errors appear in the Sidekiq logs:

```
{"@timestamp":"2022-05-12T08:40:49Z","@fields":{"pid":8091,"tid":"TID-1eqe0","context":" PostcodesCollectionWorker JID-325d45d1323979ec6b7b69d3","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"done","@Severity":"INFO","@run_time":1.014,"@message":"done: 1.014 sec"}
{"@timestamp":"2022-05-12T08:40:50Z","@fields":{"pid":8091,"tid":"TID-1eqe0","context":" PostcodesCollectionWorker JID-8dc5d575d278d5d9694a605c","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"}
{"@timestamp":"2022-05-12T08:40:51Z","@fields":{"pid":8091,"tid":"TID-1eqe0","context":" PostcodesCollectionWorker JID-8dc5d575d278d5d9694a605c","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"done","@Severity":"INFO","@run_time":1.012,"@message":"done: 1.012 sec"}
{"@timestamp":"2022-05-12T09:28:11Z","@fields":{"pid":26319,"tid":"TID-1ap4","context":" PostcodesCollectionWorker JID-f70ee78bb8addfb1ebef5418","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"}
{"@timestamp":"2022-05-12T09:33:01Z","@fields":{"pid":26319,"tid":"TID-18xs","context":"","program_name":null,"worker":null},"@type":"sidekiq","@status":null,"@Severity":"WARN","@run_time":null,"@message":"Work still in progress [#\u003cstruct Sidekiq::BasicFetch::UnitOfWork queue=\"queue:default\", job=\"{\\\"class\\\":\\\"PostcodesCollectionWorker\\\",\\\"args\\\":[true],\\\"retry\\\":true,\\\"queue\\\":\\\"default\\\",\\\"jid\\\":\\\"f70ee78bb8addfb1ebef5418\\\",\\\"created_at\\\":1652347691.4233317,\\\"enqueued_at\\\":1652347691.4233935}\"\u003e]"}
{"@timestamp":"2022-05-12T09:51:55Z","@fields":{"pid":29417,"tid":"TID-1ap4","context":" PostcodesCollectionWorker JID-64f0598e28fe452cb4a5d348","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"}
{"@timestamp":"2022-05-12T09:55:42Z","@fields":{"pid":29417,"tid":"TID-18xs","context":"","program_name":null,"worker":null},"@type":"sidekiq","@status":null,"@Severity":"WARN","@run_time":null,"@message":"Work still in progress [#\u003cstruct Sidekiq::BasicFetch::UnitOfWork queue=\"queue:default\", job=\"{\\\"class\\\":\\\"PostcodesCollectionWorker\\\",\\\"args\\\":[true],\\\"retry\\\":true,\\\"queue\\\":\\\"default\\\",\\\"jid\\\":\\\"64f0598e28fe452cb4a5d348\\\",\\\"created_at\\\":1652347982.81896,\\\"enqueued_at\\\":1652347982.8189998}\"\u003e]"}
{"@timestamp":"2022-05-12T10:02:34Z","@fields":{"pid":30523,"tid":"TID-1b80","context":" PostcodesCollectionWorker JID-64f0598e28fe452cb4a5d348","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"start","@Severity":"INFO","@run_time":null,"@message":"start"}
{"@timestamp":"2022-05-12T11:52:47Z","@fields":{"pid":30523,"tid":"TID-18xs","context":"","program_name":null,"worker":null},"@type":"sidekiq","@status":null,"@Severity":"WARN","@run_time":null,"@message":"Work still in progress [#\u003cstruct Sidekiq::BasicFetch::UnitOfWork queue=\"queue:default\", job=\"{\\\"class\\\":\\\"PostcodesCollectionWorker\\\",\\\"args\\\":[true],\\\"retry\\\":true,\\\"queue\\\":\\\"default\\\",\\\"jid\\\":\\\"64f0598e28fe452cb4a5d348\\\",\\\"created_at\\\":1652347982.81896,\\\"enqueued_at\\\":1652347982.8189998}\"\u003e]"}
{"@timestamp":"2022-05-12T11:52:47Z","@fields":{"pid":30523,"tid":"TID-1b80","context":" PostcodesCollectionWorker JID-64f0598e28fe452cb4a5d348","program_name":null,"worker":"PostcodesCollectionWorker"},"@type":"sidekiq","@status":"fail","@Severity":"INFO","@run_time":6613.069,"@message":"fail: 6613.069 sec"}
```

Rather than battle with the system and try to keep one job alive
forever, this commit re-architects PostcodesCollectionWorker so
that it is designed to be short-lived, and takes advantage of
Sidekiq's concurrency. PostcodesCollectionWorker now spawns three
ProcessPostcodeWorker workers at a time, sleeps for a second, and
then spawns another PostcodesCollectionWorker before completing.

We now concurrently process 3 postcodes per second, and queue a
new batch of postcodes via a new PostcodesCollectionWorker every
second.

The choice of 1 per second was arbitrary. In the original
architectural proposal, we don't specify what the rate limit should
be - just noting that [OS Places API has a rate limit of 600 per
minute](https://docs.google.com/document/d/1p29g1SQgi2obQnPPsl9amx7xmrJonyH4tRbVMAhXy-A/edit#heading=h.67v3c75qtis3),
and that that could update all 1.8 million postcodes "in a few days".

At 1 per second, it would take 21 days to update every postcode.
At 10 per second (the theoretical maximum allowed by OS Places API),
we'd risk breaching the limit due to Sidekiq running on multiple EC2
instances and different workers invoking other workers, i.e. the
rate-limit we set is only approximate.

A limit of 3 means an update time of around a week for the whole
database, and allows some wiggle room for scaling.

NB, I've removed the "sleep for 10 seconds if DB empty" rule;
the 1 second delay between jobs is enough to avoid spamming the
queue.
We had originally planned to have one long-running PostcodesCollectionWorker,
and process one postcode worker at a time (spawned from the above).
But we actually now want to process 3 postcodes at a time (see
previous commit).

Whilst we could set a concurrency of 3 to enforce this, that does
mean changing the value in multiple places if we were ever to change
this number in the future. By setting it to 10, we've introduced
a safeguard of sorts (OS Places API has a rate limit of 600 per
minute, i.e. 10 per second), and our use of `ApplicationRecord`'s
`with_advisory_lock` should prevent us ever spawning more than
one active `PostcodesCollectionWorker` at a time, which then
itself spawns the number of workers corresponding to
`POSTCODES_PER_SECOND` (currently `3`).

So even with a concurrency set to 10, we shouldn't expect to see
more than around 4 workers at any one time, because of how we've
architected the workers themselves. We can easily boost the
concurrency numbers by changing the `POSTCODES_PER_SECOND` in
future (provided it is less than 10).
We were seeing failures in the workers, but nothing in Sentry.
I've also back-filled a missing test for the 'postcode update'
happy path.
Previously, the postcode collections work and the postcode update
work would be given equal priority, which can lead to lots of
PostcodesCollectionWorker workers being invoked, when we really
only want one.

By putting the ProcessPostcodeWorker workers in a significantly
higher priority queue, in theory all ProcessPostcodeWorker workers
should be processed before the next PostcodesCollectionWorker is
processed, which should mean we're updating postcodes at a
consistent rate (and well below OS Places API's rate limits).
@MuriloDalRi
Copy link
Contributor

I wonder if using an exponential backoff strategy would be more resilient and future-proof, in case their API rate-limiting gets more strict.

@ChrisBAshton
Copy link
Contributor Author

The workers already use exponential backoff by default, except that we've set max_retries to 1, so no job will attempt to retry more than once.

This PR is about improving resiliency of the workers to our environment, i.e. datasyncs etc which cause workers to die.
I think what you're suggesting is a separate change to make our workers more resilient to Os Places API exceptions, which is fair, but probably something for a future PR. For now, if OS Places API limits change, we'll see lots of failed workers and we'll get a Slack notification, but nothing will fall over, Locations API would carry on working.

@MuriloDalRi
Copy link
Contributor

I know that's how Sidekiq works by default, was thinking of errors from OS Places API, but you're right that the app will not stop working and we'll get lots of alerts so should be easy to fix if it happens.

@ChrisBAshton ChrisBAshton merged commit 9780900 into main May 17, 2022
@ChrisBAshton ChrisBAshton deleted the improve-worker-resiliency branch May 17, 2022 11:49
ChrisBAshton added a commit that referenced this pull request May 20, 2022
In #74, we added a manual call to `GovukError.notify` when an
exception was raised in Sidekiq. This was because Sentry would
not receive any errors from Sidekiq.

However, govuk_app_config does support Sidekiq Sentry error
reporting - it just needs the `sentry-sidekiq` gem adding.
This is explicitly warned about as of version 4.6.0:
https://github.com/alphagov/govuk_app_config/blob/main/CHANGELOG.md#460

By adding this dependency, we should be able to remove the
duplicate config.
ChrisBAshton added a commit that referenced this pull request May 20, 2022
In #74, we added a manual call to `GovukError.notify` when an
exception was raised in Sidekiq. This was because Sentry would
not receive any errors from Sidekiq.

However, govuk_app_config does support Sidekiq Sentry error
reporting - it just needs the `sentry-sidekiq` gem adding.
This is explicitly warned about as of version 4.6.0:
https://github.com/alphagov/govuk_app_config/blob/main/CHANGELOG.md#460

By adding this dependency, we should be able to remove the
duplicate config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants