Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery not using all available worker processes due to unreleased LaxBoundedSemaphore #3434

Closed
colinmcintosh opened this issue Sep 8, 2016 · 5 comments

Comments

@colinmcintosh
Copy link
Contributor

I'm seeing an issue on Celery 3.1.23 where I have a single worker node with 20 allocated worker processes. After some amount of time (~8 hrs for me) the worker will eventually start running less and less tasks concurrently, using fewer and fewer worker processes. This doesn't affect task performance but will lead to such as Tasks being revoked as the queue up and hit time limits. Eventually the celery node will become inactive and stop consuming tasks.

I looked deeper and found that this appears to be an issue with the LaxBoundedSemaphore that is used by celery.worker.WorkController._process_task_sem to call req.execute_using_pool. Watching this semaphore object it looks like over time release isn't called enough times. The leads to the semaphore value bottoming-out at 0 and not allowing anymore tasks to execute

# with a logging statement added to celery.worker.WorkController._process_task_sem
12704/MainThread      2016-09-08 04:13:01,008               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:3 waiting:0>
12704/MainThread      2016-09-08 04:13:02,015               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:2 waiting:0>
12704/MainThread      2016-09-08 04:13:02,042               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:1 waiting:0>
12704/MainThread      2016-09-08 04:13:03,008               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:1 waiting:0>
12704/MainThread      2016-09-08 04:13:03,022               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:0>
12704/MainThread      2016-09-08 04:13:04,005               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:1>
12704/MainThread      2016-09-08 04:13:04,012               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:2>
12704/MainThread      2016-09-08 04:13:04,021               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:3>
12704/MainThread      2016-09-08 04:13:07,036               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:4>
12704/MainThread      2016-09-08 04:13:07,040               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:5>
12704/MainThread      2016-09-08 04:13:07,051               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:6>
12704/MainThread      2016-09-08 04:13:07,058               celery.worker:_process_task_sem        INFO: Worker semaphore: <LaxBoundedSemaphore at 0x7f95c1406690 value:0 waiting:7>

I'm not yet sure what's causing this behavior in 3.1.23. I'm going to give this same thing a try with 4.0.0rc3 and see if the issue reoccurs.

@mohit-tm
Copy link

Same issue , Celery worker processes are not picking the tasks and tasks get accumulated. Same after 8-10 hours

@yorinasub17
Copy link

yorinasub17 commented Dec 9, 2016

I ran into the same issue here as well, but have found the cause to be because of prefork pool prefetching for our case. We first noticed it when one of our django tasks ran a bad query that started taking 30mins - 1hr, and then have been able to reproduce it by using the following two tasks:

@app.task()
def inf_loop():
    while True:
        pass

@app.task()
def null_task():
    pass

If you issue the inf_loop task and then a bunch of null_tasks, eventually we reach the same situation where it stops allocating tasks despite open workers. In fact, you can issue as many null_tasks as you want at this point and they will never be allocated to the other workers.

I have actually traced this to be because the prefetching consumes a slot in the semaphore despite actually being in a queued state. Essentially, the following happens:

  • Each time a task is allocated, the pool will acquire from the semaphore, decrementing the available pool by one, even if the worker did not actually start the task.
  • Each time a worker becomes ready, it will release from the semaphore, incrementing the available pool
  • In the -Ofast version, the tasks are allocated in round robin fashion, thus tasks can be allocated to workers that are still executing a previously allocated task
  • If a long running task is allocated to a worker, additional tasks allocated to that worker will be blocked behind it. These, despite not running, still consume a semaphore spot.

Therefore, eventually, all the semaphores will be allocated to the one process that is executing the long running task, with the allocation accounted by all the tasks that are waiting behind the currently executing task.

At this point, all new tasks are pushed to the waiting deque of the semaphore because the semaphore is empty, effectively blocked until the process running the task releases the locks it's holding. This is because more locks won't be released since the other idle processes aren't changing state to the READY state (which is when the release method is called).

Using -Ofair solved the issue, which makes sense because it prevents pushing tasks to the running process, preventing the semaphore lock from being acquired.

@yorinasub17
Copy link

I verified the issue still exists with celery4 when using the -Ofast option.

@auvipy
Copy link
Member

auvipy commented Dec 20, 2017

could anyone verify it against master again?

@auvipy auvipy added this to the v5.0.0 milestone Dec 20, 2017
@auvipy auvipy modified the milestones: v5.0.0, v4.3 May 27, 2018
@auvipy
Copy link
Member

auvipy commented Aug 12, 2018

will reopen if anyone verifies this exists on celery master

@auvipy auvipy closed this as completed Aug 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants