Revoked tasks storage & communication improvements #6075

jheld · 2020-05-06T01:20:11Z

Checklist

I have checked the issues list
for similar or identical enhancement to an existing feature.
I have checked the pull requests list
for existing proposed enhancements.
I have checked the commit log
to find out if the if the same enhancement was already implemented in the
master branch.
I have included all related issues and possible duplicate issues in this issue
(If there are none, check this box anyway).

Related Issues and Possible Duplicates

Related Issues

None

Possible Duplicates

None

Brief Summary

Part of this is a question about the current implementation around revoked task storage and communication, and then depending on the discussion, possible improvements (including documentation).

This conversation is more related to the redis broker (implementation discussion may be more related to it, for brevity & consistency), but I imagine the same issue can be seen on other brokers.

When a worker starts up, it asks (via mingle) all of the workers in the system for their revoked task lists. It also sends it own in that message. It then receives the revoked task lists from all workers that were able to respond, and then it merges those lists one by one into its own.

For a fairly healthy celery cluster, this means that it sends it revoked tasks to all other workers (if persistent mode is disabled there will be no revoked tasks yet), and it also then receives what will likely be 99% identical revoked tasks lists from all other workers in the cluster (other than the ones that are just coming up like this worker).

If for instance you have the default 500k revoked tasks list fully saturated (or really, just any large number), that is a lot of messages and data flying through the system at about the same time. In redis' case, depending on the number of existing workers in the cluster (whether they're being killed or not), the number of new workers coming up, and how many revoked tasks are generally in each [existing and fully started up] worker, if the redis' size isn't "large" (under several GB), we risk a redis OOM error (my company has seen this so many times, and now I'm working through extensive debugging, tests, and improvements). This can cripple a celery cluster, and any other process that relies on celery or uses the same broker.

So, part of this is a question in implementation, and the other part is, assuming we wish to make alternative implementations, the discussion of those.

Design

Architectural Considerations

None

I don't think persistent mode should be discussed here (at least in its current standard implementation); I think, I can verify of course, that it still does the send hello revoked & clocks sync messaging at start up (it may have revoked IDs in this case).

If the general assumption (and this be considered to be a non-starter) is that a healthy cluster will have each worker sharing the same or very similar revoked task IDs, then as above, we're doing a lot of work for very little (if any) gain.

Celery has leader election. Could we instead use the leader to be the process (or let it delegate a random worker on the fly) that responds back with its revoked task IDs?

We would still have each worker hold its local copy of the revoked task IDs, but the transmission/communication via mingle will shrink down in complexity.

That's assuming at least 2 things:

we can rely on a healthy cluster (or the notion of a leader/delegator)
we can use the leader system -- I'm not entirely sure what the leader system is designed to do yet, but I'm very interested in it (I'm hoping to look more into it).

Another approach for storing these IDs (and their times), would be to allow having it support different backends (e.g. redis, dynamodb, elasticsearch, etc), and so the IDs are stored there, and TTL/expiration (and probably also max len) could be supported the same as ever -- and better than celery currently does it.

Having it stored elsewhere would be generally much better in terms of data storage, and also would remove the need to keep the data locally. That said, data locality is not necessarily a bad thing, and so I could see on worker startup syncing against the backend, and when revokes go through, also send them to the backend. This should get rid of (I think?) the startup thundering herd issue.

Yes it true projects should try to write their code and tasks in such a way that revokes aren't happening too often. It is a valiant goal -- reduction of their bugs, changes to their retry policies, etc, but the tooling should also try to improve itself if it can. This feels nearly like the tool is broken (it's not), but it has that side-effect. Deleting pidbox queues is not a good policy either -- when workers die/stop listening to the broker because it responds with OOM, you then have orphaned pidbox [reply] queues just...sitting there. That's more of a non-amqp issue, but it's a huge use case.

I may try to reduce some of the initial issue creation -- there's a lot of info here; but so far I haven't seen any "simple" solution to this, and there's a lot of context at play here.

Proposed Behavior

Proposed UI/UX

Diagrams

N/A

Alternatives

None

The text was updated successfully, but these errors were encountered:

jheld · 2020-05-08T00:43:40Z

Note I have opened a separate issue regarding what appears to be a bug in the revoke store (LimitedSet). It's a bug in terms of the documentation (the original PR has comments which suggests the data logic was implemented in such a way as to continue to be performant. I'm unsure if the performance issues are still relevant in py3 (or if other choices exist that we could switch to).

auvipy · 2020-05-15T05:45:29Z

Would be glad if you can push it forward

jheld · 2020-05-16T01:59:08Z

Small update.

I've added a set of custom control commands to my project (including purging & clearing) for revoked info on workers.

I don't know that I've run it against the 500k default, but certainly have run it against a set somwhere on the higher range. It seems fast!

I was looking into leveraging sortedcollections ValueSortedDict (since the underlying data stores have awesome reported benchmarks against other data structures/implementations). In my testing, it is slower than our standard LimitedSet (I created a LimitedSortedSet based on ValueSortedSet). At 500k it's definitely slower than celery's (often 2 or more times slower). It's not "slow" so much as not nearly as fast as I was expecting, given the project I was trying to leverage, in a test env.

But still, both were quite fast for add/update/purge. I purged all values, and ran another 500k onto an already filled data. Those operations (at least for LimitedSet) appear to take within a second (some way below 1s).

I may run some timeit benchmarks (rather than my several one off tests).

So, mostly what I'm saying right now is although I trust the implementer(s) on the current LimitedSet design, I'm not sure how large of data they were testing against. 500k is totally fine for modern hardware -- you shouldn't really notice any real human latency. I'm considering adding a purge_on_add=PURGE_ON_ADD_BOOL setting on the __init__ so that instead of lazily calling purge on add (only when maxlen has been reached), the user may request to purge no matter what. For data sizes that aren't enormous (500k certainly is not), I think this would be a fine option. It would then be up to the calling projects to determine what behavior they want.

Edit: On 5m data set add, it took 9s for ours, and 17s for my new implementation.

auvipy · 2020-05-16T02:57:14Z

I would say you should keep using and improving it. later optional cythonization can be applied if needed.

jheld · 2020-05-16T03:17:39Z

@auvipy thank you for reading through so far!

I do concur re:add non-laziness purging options and/or cythonization/other speed ups; I was hoping that we could simply improve the speed even way above 500k, to be able to do non-lazy purging, to get around some of the storage & communication issues.

But, knowing that most projects probably won't see/use my github gist for purge_revoked, we still have a process & tech issue (esp for redis, but probably for RabbitMQ, too), with all of this data flying through the broker on worker(s) startup.

For instance, if we introduced/plugin for an alternative "Persistent" revoked backend, (not using the shelve file), and called it RedisPersistentRevoked (or whatever), then we could store the source of truth in redis in a sorted set; as task IDs are added to the revoked storage, we would discard element ranges based on the sort score -- I'm unsure of exactly how we would communicate expiration to the downstream workers, but I'm sure there are ways to do it; if they store their own copy...which I'm not 100% sold on at this time). The workers can still (maybe they should!) hold onto their own local copy during runtime, but on startup, they could simply get the smembers from redis, 1 copy, as opposed to N copies (N being the number of already started workers).

Currently to me that feels like a much neater & scalable approach just in terms of reducing the issues around mingle's implementation.

I am interested in the leader election/framework, but I don't really know much about it. I'd like to learn more. I have looked through some of the docs but haven't necessarily found what I'm looking for -- if you happen to know/wish to elaborate on the leader system, please do/send over doc/code links. I do think it could be really cool/useful to leverage it, if it's an appropriate solution.

auvipy · 2020-05-16T03:27:41Z

@auvipy thank you for reading through so far!

I do concur re:add non-laziness purging options and/or cythonization/other speed ups; I was hoping that we could simply improve the speed even way above 500k, to be able to do non-lazy purging, to get around some of the storage & communication issues.

But, knowing that most projects probably won't see/use my github gist for purge_revoked, we still have a process & tech issue (esp for redis, but probably for RabbitMQ, too), with all of this data flying through the broker on worker(s) startup.

For instance, if we introduced/plugin for an alternative "Persistent" revoked backend, (not using the shelve file), and called it RedisPersistentRevoked (or whatever), then we could store the source of truth in redis in a sorted set; as task IDs are added to the revoked storage, we would discard element ranges based on the sort score -- I'm unsure of exactly how we would communicate expiration to the downstream workers, but I'm sure there are ways to do it; if they store their own copy...which I'm not 100% sold on at this time). The workers can still (maybe they should!) hold onto their own local copy during runtime, but on startup, they could simply get the smembers from redis, 1 copy, as opposed to N copies (N being the number of already started workers).

Currently to me that feels like a much neater & scalable approach just in terms of reducing the issues around mingle's implementation.

I am interested in the leader election/framework, but I don't really know much about it. I'd like to learn more. I have looked through some of the docs but haven't necessarily found what I'm looking for -- if you happen to know/wish to elaborate on the leader system, please do/send over doc/code links. I do think it could be really cool/useful to leverage it, if it's an appropriate solution.

I have some sparse knowledge on them but will review and come with a compiled collection. But need some time as I have some backlog to clear in celery related projects. In this time of global crisis, I can dedicate way more time in celery then before and this will be true for the next few months I guess. But you shouldn't stop your trial n error R&D on these topics.

auvipy · 2020-05-16T05:00:01Z

you can check this issue to have some insight #6067

caffeinatedMike · 2022-04-13T03:43:33Z

Pinging this because it would be amazing to have a redis option for statedb. Been googling for a bit and found this issue.

auvipy · 2022-04-13T07:50:22Z

we need champions for the effort

jheld added the Issue Type: Enhancement label May 6, 2020

auvipy added this to the 4.5 milestone May 16, 2020

auvipy added the Status: Design Decision Needed ✘ label Jan 29, 2021

auvipy modified the milestones: 4.5, Future Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revoked tasks storage & communication improvements #6075

Revoked tasks storage & communication improvements #6075

jheld commented May 6, 2020 •

edited by sync-by-unito bot

jheld commented May 8, 2020

auvipy commented May 15, 2020

jheld commented May 16, 2020 •

edited

auvipy commented May 16, 2020

jheld commented May 16, 2020

auvipy commented May 16, 2020

auvipy commented May 16, 2020

caffeinatedMike commented Apr 13, 2022

auvipy commented Apr 13, 2022

Revoked tasks storage & communication improvements #6075

Revoked tasks storage & communication improvements #6075

Comments

jheld commented May 6, 2020 • edited by sync-by-unito bot

Checklist

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Brief Summary

Design

Architectural Considerations

Proposed Behavior

Proposed UI/UX

Diagrams

Alternatives

jheld commented May 8, 2020

auvipy commented May 15, 2020

jheld commented May 16, 2020 • edited

auvipy commented May 16, 2020

jheld commented May 16, 2020

auvipy commented May 16, 2020

auvipy commented May 16, 2020

caffeinatedMike commented Apr 13, 2022

auvipy commented Apr 13, 2022

jheld commented May 6, 2020 •

edited by sync-by-unito bot

jheld commented May 16, 2020 •

edited