Discord update_all_groups/update_all_nicknames running out of retries #1064

mcmilj · 2018-05-14T12:56:46Z

I've noticed that, if you have a large number of discord members, when this task is running there are a lot of failures (which strangely do not appear in the admin notifications).

For example:

[2018-05-14 12:50:40,219: ERROR/ForkPoolWorker-1] Task discord.update_nickname[aaaaaaaa] raised unexpected: MaxRetriesExceededError("Can't retry discord.update_nickname[aaaaaaaa] args:(127,) kwargs:{}",)
Traceback (most recent call last):
  File "/home/allianceauth/venv/lib/python3.5/site-packages/allianceauth/services/modules/discord/manager.py", line 115, in decorated
    global_ratelimit=bool(existing_global_backoff)
allianceauth.services.modules.discord.manager.api_backoff.<locals>.PerformBackoff

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/allianceauth/venv/lib/python3.5/site-packages/allianceauth/services/modules/discord/tasks.py", line 112, in update_nickname
    DiscordOAuthManager.update_nickname(user.discord.uid, DiscordTasks.get_nickname(user))
  File "/home/allianceauth/venv/lib/python3.5/site-packages/allianceauth/services/modules/discord/manager.py", line 149, in decorated
    raise DiscordApiBackoff(retry_after=bo.retry_after, global_ratelimit=bo.global_ratelimit)
allianceauth.services.modules.discord.manager.DiscordApiBackoff

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/allianceauth/venv/lib/python3.5/site-packages/celery/app/trace.py", line 374, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/allianceauth/venv/lib/python3.5/site-packages/celery_once/tasks.py", line 67, in __call__
    return super(QueueOnce, self).__call__(*args, **kwargs)
  File "/home/allianceauth/venv/lib/python3.5/site-packages/celery/app/trace.py", line 629, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/allianceauth/venv/lib/python3.5/site-packages/allianceauth/services/modules/discord/tasks.py", line 116, in update_nickname
    raise self.retry(countdown=bo.retry_after_seconds)
  File "/home/allianceauth/venv/lib/python3.5/site-packages/celery/app/task.py", line 672, in retry
    self.name, request.id, S.args, S.kwargs))
celery.exceptions.MaxRetriesExceededError: Can't retry discord.update_nickname[aaaaaaaa] args:(127,) kwargs:{}

This seems to happen because each nickname/group update is raised as a task and run in parallel. When the ratelimit is reached, tasks all receive the same backoff time and therefore they all retry at the same time, which hits the rate limit again. This quickly causes the maximum number of retries to be hit.

I don't know a huge amount about celery but this should be fixable by somehow getting these updates to run sequentially.

The text was updated successfully, but these errors were encountered:

Adarnof · 2018-05-16T01:16:00Z

Rate limiting in celery is a total PITA. I've looked into it before and the internet says there's no real good way. The nature of group updates prevents auth from knowing to queue and process them sequentially.

The only way I could see this being more appropriately handled is if we routed all discord-related tasks to a single queue and rate limited a single worker pulling from it. That would be a big documentation change but not too much code work. I just don't trust users to read the docs.

mcmilj · 2018-05-16T09:09:07Z

What about running the updates as a single task? Or is that a similarly awkward change?

Adarnof · 2018-05-16T15:18:50Z

It's the "as a single task" part which is problematic. Whever someone's group changes on auth it triggers a service group update. These service group updates don't apply to all users, only the user who had a group change. As auth can't predict when groups will change it can't bundle these requests together to perform them as a single task.

Unfortunately for large organizations the way we currently handle Discord is definitely not optimal.

A proposed solution has been to write our own "worker" of sorts based on an async discord client package that already handles rate limiting properly, and implement our own messaging broker to send the requests across. That's a lot of work but would probably fix it. I'd poked around at it a few months ago but work stalled when the XML API deadline got close

I'm wondering if a different approach would be even better: instead of writing a web API to send requests across, why not create a celery "worker" that pulls from a normal queue (so I can skip this whole messaging broker problem) and knows to stall processing of specific types of tasks when it hits a rate limit. I'll have to do some more reading but I think that should be possible, and as I type this out I'm beginning to think that's the better answer.

basraah · 2018-05-16T22:07:04Z

why not create a celery "worker" that pulls from a normal queue (so I can skip this whole messaging broker problem) and knows to stall processing of specific types of tasks when it hits a rate limit.

Why not switch to a task worker that can handle rate limiting properly instead of hacky work arounds?

https://dramatiq.io/cookbook.html#rate-limiting

Adarnof · 2018-05-18T16:30:04Z

Celery can (almost) handle it properly through bootsteps. This sample code almost does what we want by scaling rate limits by number of workers. Unfortunately Discord has dynamic rate limits so this specific example wouldn't work for us, but the idea is there.

k9by · 2018-05-24T11:24:00Z

Maybe I am missing something, but wouldn't this celery task option work if it was exposed at the task level?

http://docs.celeryproject.org/en/latest/userguide/tasks.html

Task.rate_limit
Set the rate limit for this task type (limits the number of tasks that can be run in a given time frame). Tasks will still complete when a rate limit is in effect, but it may take some time before it’s allowed to start.

If this is None no rate limit is in effect. If it is an integer or float, it is interpreted as “tasks per second”.

The rate limits can be specified in seconds, minutes or hours by appending “/s”, “/m” or “/h” to the value. Tasks will be evenly distributed over the specified time frame.

Example: “100/m” (hundred tasks a minute). This will enforce a minimum delay of 600ms between starting two tasks on the same worker instance.

Default is the task_default_rate_limit setting: if not specified means rate limiting for tasks is disabled by default.

Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.

Adarnof added services celery labels May 16, 2018

Adarnof mentioned this issue May 24, 2018

Group updates failing to push to Discourse #1067

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discord update_all_groups/update_all_nicknames running out of retries #1064

Discord update_all_groups/update_all_nicknames running out of retries #1064

mcmilj commented May 14, 2018

Adarnof commented May 16, 2018

mcmilj commented May 16, 2018 •

edited

Loading

Adarnof commented May 16, 2018

basraah commented May 16, 2018

Adarnof commented May 18, 2018

k9by commented May 24, 2018

Discord update_all_groups/update_all_nicknames running out of retries #1064

Discord update_all_groups/update_all_nicknames running out of retries #1064

Comments

mcmilj commented May 14, 2018

Adarnof commented May 16, 2018

mcmilj commented May 16, 2018 • edited Loading

Adarnof commented May 16, 2018

basraah commented May 16, 2018

Adarnof commented May 18, 2018

k9by commented May 24, 2018

mcmilj commented May 16, 2018 •

edited

Loading