Celery not using all available worker processes due to unreleased LaxBoundedSemaphore #3434

colinmcintosh · 2016-09-08T04:22:45Z

I'm seeing an issue on Celery 3.1.23 where I have a single worker node with 20 allocated worker processes. After some amount of time (~8 hrs for me) the worker will eventually start running less and less tasks concurrently, using fewer and fewer worker processes. This doesn't affect task performance but will lead to such as Tasks being revoked as the queue up and hit time limits. Eventually the celery node will become inactive and stop consuming tasks.

I looked deeper and found that this appears to be an issue with the LaxBoundedSemaphore that is used by celery.worker.WorkController._process_task_sem to call req.execute_using_pool. Watching this semaphore object it looks like over time release isn't called enough times. The leads to the semaphore value bottoming-out at 0 and not allowing anymore tasks to execute

# with a logging statement added to celery.worker.WorkController._process_task_sem
12704/MainThread      2016-09-08 04:13:01,008               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:3 waiting:0>
12704/MainThread      2016-09-08 04:13:02,015               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:2 waiting:0>
12704/MainThread      2016-09-08 04:13:02,042               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:1 waiting:0>
12704/MainThread      2016-09-08 04:13:03,008               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:1 waiting:0>
12704/MainThread      2016-09-08 04:13:03,022               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:0>
12704/MainThread      2016-09-08 04:13:04,005               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:1>
12704/MainThread      2016-09-08 04:13:04,012               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:2>
12704/MainThread      2016-09-08 04:13:04,021               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:3>
12704/MainThread      2016-09-08 04:13:07,036               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:4>
12704/MainThread      2016-09-08 04:13:07,040               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:5>
12704/MainThread      2016-09-08 04:13:07,051               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:6>
12704/MainThread      2016-09-08 04:13:07,058               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:7>

I'm not yet sure what's causing this behavior in 3.1.23. I'm going to give this same thing a try with 4.0.0rc3 and see if the issue reoccurs.

The text was updated successfully, but these errors were encountered:

mohit-tm · 2016-09-22T04:55:25Z

Same issue , Celery worker processes are not picking the tasks and tasks get accumulated. Same after 8-10 hours

yorinasub17 · 2016-12-09T05:13:13Z

I ran into the same issue here as well, but have found the cause to be because of prefork pool prefetching for our case. We first noticed it when one of our django tasks ran a bad query that started taking 30mins - 1hr, and then have been able to reproduce it by using the following two tasks:

@app.task()
def inf_loop():
    while True:
        pass

@app.task()
def null_task():
    pass

If you issue the inf_loop task and then a bunch of null_tasks, eventually we reach the same situation where it stops allocating tasks despite open workers. In fact, you can issue as many null_tasks as you want at this point and they will never be allocated to the other workers.

I have actually traced this to be because the prefetching consumes a slot in the semaphore despite actually being in a queued state. Essentially, the following happens:

Each time a task is allocated, the pool will acquire from the semaphore, decrementing the available pool by one, even if the worker did not actually start the task.
Each time a worker becomes ready, it will release from the semaphore, incrementing the available pool
In the -Ofast version, the tasks are allocated in round robin fashion, thus tasks can be allocated to workers that are still executing a previously allocated task
If a long running task is allocated to a worker, additional tasks allocated to that worker will be blocked behind it. These, despite not running, still consume a semaphore spot.

Therefore, eventually, all the semaphores will be allocated to the one process that is executing the long running task, with the allocation accounted by all the tasks that are waiting behind the currently executing task.

At this point, all new tasks are pushed to the waiting deque of the semaphore because the semaphore is empty, effectively blocked until the process running the task releases the locks it's holding. This is because more locks won't be released since the other idle processes aren't changing state to the READY state (which is when the release method is called).

Using -Ofair solved the issue, which makes sense because it prevents pushing tasks to the running process, preventing the semaphore lock from being acquired.

yorinasub17 · 2016-12-09T05:35:20Z

I verified the issue still exists with celery4 when using the -Ofast option.

auvipy · 2017-12-20T05:14:44Z

could anyone verify it against master again?

auvipy · 2018-08-12T04:59:11Z

will reopen if anyone verifies this exists on celery master

auvipy added the Issue Type: Bug Report label Dec 20, 2017

auvipy added this to the v5.0.0 milestone Dec 20, 2017

auvipy modified the milestones: v5.0.0, v4.3 May 27, 2018

auvipy added the Status: Needs Verification ✘ label May 27, 2018

auvipy added the Status: Feedback Needed ✘ label Aug 12, 2018

auvipy closed this as completed Aug 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery not using all available worker processes due to unreleased LaxBoundedSemaphore #3434

Celery not using all available worker processes due to unreleased LaxBoundedSemaphore #3434

colinmcintosh commented Sep 8, 2016

mohit-tm commented Sep 22, 2016

yorinasub17 commented Dec 9, 2016 •

edited

yorinasub17 commented Dec 9, 2016

auvipy commented Dec 20, 2017

auvipy commented Aug 12, 2018

Celery not using all available worker processes due to unreleased LaxBoundedSemaphore #3434

Celery not using all available worker processes due to unreleased LaxBoundedSemaphore #3434

Comments

colinmcintosh commented Sep 8, 2016

mohit-tm commented Sep 22, 2016

yorinasub17 commented Dec 9, 2016 • edited

yorinasub17 commented Dec 9, 2016

auvipy commented Dec 20, 2017

auvipy commented Aug 12, 2018

yorinasub17 commented Dec 9, 2016 •

edited