New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "How does it work?" section with a little explanation #69
Comments
No, the tasks are queued in the celery workers, it doesn't matter if they're sent from the same or different clients. In your example, the tasks are sent from the gunicorn process(es) into the broker, as normal with Celery. The celery worker consuming from that queue then queues them in memory until the flush interval or flush every is met and executes the task. |
@clokep The Flash! Oh, I see - the plugin sends Task immediately to the broker, then a worker get a task from the queue but doesn't execute the "business" code until How it works with |
The statement is correct, but celery-batches doesn't do anything special for sending the task, this is all standard celery behavior.
The following assumes you're using the standard "prefork" configuration of a celery worker (although the explanation doesn't change too much if you're using gevent or eventlet or threads): Celery workers have a "main" process which fetches tasks from the broker, by default it gets whatever the "prefetch multiplier x concurrency" is (so if your prefetch multiplier is 100 and your concurrency is 4, it attempts to pull up to 400 items from the broker's queue). Once it has those in memory it deserializes them and runs whatever their What celery-batches changes is that the "main" celery worker process will queue in memory until flush interval or flush every is reached and send that entire set of tasks to the worker in the processing pool together. (None of this is really explained well in the celery documentation, as far as I know... celery-batches should explain the difference it is making better, but it is hard without something else to refer to which discusses the "normal" way of doing things.) |
Great explanation, didn't know these details about celery! One more question left - the flow will be broken when we run three "main" workers on different physical nodes or three workers in the same node, as documentation shows https://docs.celeryq.dev/en/stable/userguide/workers.html#starting-the-worker
Or it'll be handled by the broker somehow? |
Each worker will have their own in-memory queue, so you need to think carefully about what you want a prefetch multiplier (and concurrency) to be in order to keep your resources well occupied. It will work fine, but it might not work at utmost efficiency. |
@clokep thank you for the detailed explanation! I think we can add How does it work section in README.md and add a link to this issue :D It explains everything. |
I think this should get distilled and added to the documentation, yes. |
Hello, I agree that this is a great module and more examples/docs will be awesome. I made my own system to bulk insert data on ES with Celery/RabbitMQ but I encountered a lot of problems that I'm trying to solve, hopefully with this package. For example, it seems (thanks to your answers here), that this package can be used with Celery in its default pool, "prefork". My solution can't: I'm keeping in memory documents before it reaches a certain amount to bulk insert data. But I also added a thread timer to check every X minutes if there are no documents left in order to push them (for instance when we don't receive new tasks). The problem is that when Celery receives a SIGTERM (from Kubernetes for instance), in prefork, it doesn't wait for the tread timer to execute before terminating. Therefore, if I had documents waiting to get pushed, they got lost. I had to use the "solo" pool for that, which carefully wait the execution of the timer before terminating. Unfortunately, Celery slowly increases the RAM usage and I can't make use of the So I have a few questions:
Thank you very much for your answers! |
I haven't run celery under kubernetes but it should wait for any running tasks to finish.
Do you have
I think you're just asking "what happens if I restarted the worker while using this package?" -- I suspect whatever is in the queue (so a maximum of
Yes. |
It should, and it does. But a thread timer is not a task per se, it is something next to the task. Therefore, Celery thinks everything is done and terminate.
Yes, I have
What do you mean by "dropped?", I will test yes. My question was more: does the package respect these options and consider a whole batch as a single "task" (just like we would if we were using Celery normally). In this case, Celery and your package, will wait to the full batch to get executed before terminating/stop listening to the queue which will be great. I'm just concerned about the same problem I have above with my solution: I want to be sure that when the Celery worker restarts, thanks to these Celery options (either number of tasks or max memory reached), all my documents in the batch will get treated, meaning no loss. Right now, by using these options alone, I don't have the control to generate a bulk insert when these limits are reached. This is the whole problem. If I could detect that a Celery worker is about to restart because it reaches one of these settings values, I could force a bulk insert and stop right there, no loss. If you have any ideas (sorry for the long message), please do. Thanks! |
I did some light testing and it seems that the enqueued requests do get picked back up after a worker restart. (Alternately they should get picked up by a different worker if one exists.) If there's a running batch task, then yes that gets treated as a single task and Celery will wait during shutdown for it to finish.
I can't really answer questions on how this compares to your current solution or if it will work for your workload, sorry. |
Hi! I was looking at plugins for celery in the github and found celery-batches but didn't quite well get how it actually work.
It'd be great to add How does it work? section in README.md, as for https://github.com/steinitzu/celery-singleton#how-does-it-work
For instance, I have an usual HTTP API setup:
gunicorn
that runsdjango
withcelery
's app inside (not worker, just publisher).gunicorn
runs 5 processes ofdjango
http app, anddjango
the only code that callscelery
's tasks like this:last_seen_user.delay(user=user_id, when=utcnow())
.celery
's workers - are run withcelery --concurrency=4
How does
celery-batches
work in this setup?The documentation says:
Does it mean that Task requests are buffered inside one of 5 processes of
django
app that processed the request?I mean if the user send two HTTP requests and gunicorn send them to different processes - it means that we have two local (in memory) queues for this
BatchTask
, even for the same user and afterflush_interval
orflush_every
(that we count only for THIS process, not for all of them) - we actually send Task to the celery broker, for our case - we send two tasks even if it was the same user.The text was updated successfully, but these errors were encountered: