Celery worker stuck with unack mutex and high network usage with Kombu & Redis #1816

MathieuLamiot · 2023-11-02T20:58:00Z

Context
We have an app running on Django and Celery 5.3.1, with a Broker on Redis. The producers submit jobs (basically an image and some context) to the workers who compress and/or convert the image and post it on an S3 object storage.
Here are our Django settings for Celery:

BROKER_URL = config("BROKER_URL")
CELERY_RESULT_BACKEND = config("CELERY_RESULT_BACKEND")
CELERYD_MAX_TASKS_PER_CHILD = 100
CELERY_ACCEPT_CONTENT = ["json"]
CELERY_RESULT_SERIALIZER = "json"
CELERY_TASK_SERIALIZER = "json"
CELERY_ALWAYS_EAGER = False
CELERY_TASK_RESULT_EXPIRES = 600  # 600 seconds that's 10 minutes
LARGE_QUEUE_THRESHOLD_BYTES = config("LARGE_QUEUE_THRESHOLD_BYTES", cast=int, default=512_000)
CELERY_REDIS_BACKEND_HEALTH_CHECK_INTERVAL = 20  # Not sure this is taken into account
CELERYD_PREFETCH_MULTIPLIER = 1

We start workers with CLI: "--single-child",

              "--",
              "celery",
              "--app",
              "imagify",
              "worker",
              "--loglevel",
              "warning",
              "--max-memory-per-child",
              "262144", # 256 Mb in Kb
              "--max-tasks-per-child",
              "50",
              "--concurrency",
              "4",
              "--queues",
              "pro",
              "--heartbeat-interval",
              "30",

Investigation
We noticed very high network usage between the workers and the broker from time to time, leading ultimately to congestion (jobs not processed, producers stuck and redis restarting). We investigated the issue:

The network usage starts increasing always 1 hour after one worker crashed (typically because of OOM).
The network usage increases every 5 minutes. What happens is that, every 5 minutes, a new worker "get stuck" with high network usage and no more logs. The last logs are saying a job has been completed. Nothing after that.

- The issue is perfectly correlated with the `unacked_mutex` being set, expired 5 minutes later, re-set, etc. - When the issue occurs, the Redis command hget is being used a lot. It is never used otherwise.

- When the issue occurs, the Redis command `zrevrangebyscore` is exectued once every 5 minutes.

Issue
We think that workers try to get an unacked job once the visibility timeout is over, but for some reason, they never manage to get it and get stuck. The unacked_mutex expiration of 5 minutes allows another worker to try and get stuck every 5 minutes.
Based on the Redis command executed, the issue appears around here I would say:

kombu/kombu/transport/redis.py

Line 414 in 3884eb9

for tag, score in visible or []:

Since we don't care if a job is not processed, we changed the mutex TTL to 5 hours, which makes the issue much slower to propagate and allows us to maintain the service. It worked as expected. A more stable mitigation will probably be to set the visibility timeout ridiculously high.
However, we would like to understand the actual root cause and have a proper fix for this.

Let me know if some more information would be useful to investigate this issue and/or suggest a proper fix. Thank you.

The text was updated successfully, but these errors were encountered:

auvipy · 2023-11-15T07:04:36Z

If you have more information to share to help finding the root cause please feel free :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery worker stuck with unack mutex and high network usage with Kombu & Redis #1816

Celery worker stuck with unack mutex and high network usage with Kombu & Redis #1816

MathieuLamiot commented Nov 2, 2023 •

edited

auvipy commented Nov 15, 2023

Celery worker stuck with unack mutex and high network usage with Kombu & Redis #1816

Celery worker stuck with unack mutex and high network usage with Kombu & Redis #1816

Comments

MathieuLamiot commented Nov 2, 2023 • edited

auvipy commented Nov 15, 2023

MathieuLamiot commented Nov 2, 2023 •

edited