New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery not using all available worker processes due to unreleased LaxBoundedSemaphore #3434
Comments
Same issue , Celery worker processes are not picking the tasks and tasks get accumulated. Same after 8-10 hours |
I ran into the same issue here as well, but have found the cause to be because of prefork pool prefetching for our case. We first noticed it when one of our django tasks ran a bad query that started taking 30mins - 1hr, and then have been able to reproduce it by using the following two tasks: @app.task()
def inf_loop():
while True:
pass
@app.task()
def null_task():
pass If you issue the I have actually traced this to be because the prefetching consumes a slot in the semaphore despite actually being in a queued state. Essentially, the following happens:
Therefore, eventually, all the semaphores will be allocated to the one process that is executing the long running task, with the allocation accounted by all the tasks that are waiting behind the currently executing task. At this point, all new tasks are pushed to the waiting Using |
I verified the issue still exists with celery4 when using the |
could anyone verify it against master again? |
will reopen if anyone verifies this exists on celery master |
I'm seeing an issue on Celery 3.1.23 where I have a single worker node with 20 allocated worker processes. After some amount of time (~8 hrs for me) the worker will eventually start running less and less tasks concurrently, using fewer and fewer worker processes. This doesn't affect task performance but will lead to such as Tasks being revoked as the queue up and hit time limits. Eventually the celery node will become inactive and stop consuming tasks.
I looked deeper and found that this appears to be an issue with the LaxBoundedSemaphore that is used by celery.worker.WorkController._process_task_sem to call req.execute_using_pool. Watching this semaphore object it looks like over time release isn't called enough times. The leads to the semaphore value bottoming-out at 0 and not allowing anymore tasks to execute
I'm not yet sure what's causing this behavior in 3.1.23. I'm going to give this same thing a try with 4.0.0rc3 and see if the issue reoccurs.
The text was updated successfully, but these errors were encountered: