Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery worker stuck with unack mutex and high network usage with Kombu & Redis #1816

Open
MathieuLamiot opened this issue Nov 2, 2023 · 1 comment

Comments

@MathieuLamiot
Copy link

MathieuLamiot commented Nov 2, 2023

Context
We have an app running on Django and Celery 5.3.1, with a Broker on Redis. The producers submit jobs (basically an image and some context) to the workers who compress and/or convert the image and post it on an S3 object storage.
Here are our Django settings for Celery:

BROKER_URL = config("BROKER_URL")
CELERY_RESULT_BACKEND = config("CELERY_RESULT_BACKEND")
CELERYD_MAX_TASKS_PER_CHILD = 100
CELERY_ACCEPT_CONTENT = ["json"]
CELERY_RESULT_SERIALIZER = "json"
CELERY_TASK_SERIALIZER = "json"
CELERY_ALWAYS_EAGER = False
CELERY_TASK_RESULT_EXPIRES = 600  # 600 seconds that's 10 minutes
LARGE_QUEUE_THRESHOLD_BYTES = config("LARGE_QUEUE_THRESHOLD_BYTES", cast=int, default=512_000)
CELERY_REDIS_BACKEND_HEALTH_CHECK_INTERVAL = 20  # Not sure this is taken into account
CELERYD_PREFETCH_MULTIPLIER = 1

We start workers with CLI: "--single-child",

              "--",
              "celery",
              "--app",
              "imagify",
              "worker",
              "--loglevel",
              "warning",
              "--max-memory-per-child",
              "262144", # 256 Mb in Kb
              "--max-tasks-per-child",
              "50",
              "--concurrency",
              "4",
              "--queues",
              "pro",
              "--heartbeat-interval",
              "30",

Investigation
We noticed very high network usage between the workers and the broker from time to time, leading ultimately to congestion (jobs not processed, producers stuck and redis restarting). We investigated the issue:

  • The network usage starts increasing always 1 hour after one worker crashed (typically because of OOM).
  • The network usage increases every 5 minutes. What happens is that, every 5 minutes, a new worker "get stuck" with high network usage and no more logs. The last logs are saying a job has been completed. Nothing after that.
image - The issue is perfectly correlated with the `unacked_mutex` being set, expired 5 minutes later, re-set, etc. - When the issue occurs, the Redis command hget is being used a lot. It is never used otherwise. image - When the issue occurs, the Redis command `zrevrangebyscore` is exectued once every 5 minutes. image

Issue
We think that workers try to get an unacked job once the visibility timeout is over, but for some reason, they never manage to get it and get stuck. The unacked_mutex expiration of 5 minutes allows another worker to try and get stuck every 5 minutes.
Based on the Redis command executed, the issue appears around here I would say:

for tag, score in visible or []:

Since we don't care if a job is not processed, we changed the mutex TTL to 5 hours, which makes the issue much slower to propagate and allows us to maintain the service. It worked as expected. A more stable mitigation will probably be to set the visibility timeout ridiculously high.
However, we would like to understand the actual root cause and have a proper fix for this.

Let me know if some more information would be useful to investigate this issue and/or suggest a proper fix. Thank you.

@auvipy
Copy link
Member

auvipy commented Nov 15, 2023

If you have more information to share to help finding the root cause please feel free :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants