Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery pool with Redis broker freeze silently or crash & freeze when hiredis package is installed #3898

Closed
2 tasks done
nigma opened this issue Mar 9, 2017 · 16 comments · Fixed by celery/kombu#954
Closed
2 tasks done

Comments

@nigma
Copy link

nigma commented Mar 9, 2017

Overview

Celery workers freeze silently or crash and freeze after executing worker_max_tasks_per_child tasks when using Redis broker and the hiredis (https://github.com/redis/hiredis-py) package is installed.

redis-py is a library used by Celery to connect to Redis broker and ships with two parser classes, the PythonParser and the HiredisParser. By default, redis-py will attempt to use the HiredisParser if the hiredis module installed and will fallback to the PythonParser otherwise.

Celery is not able to recover from this error and restart broker connection, which is causing permanent deadlock.

Note: this as well may be a problem with redis-py and hiredis packages. I'm filling in this issue because it took me some time to reproduce and narrow down the problem (previous report in #2464 (comment)) and I see a couple similar issues reported that may be caused by this conflict rather than contributed by AWS Elastic Beanstalk / AWS ElastiCache Redis setup.

Checklist

  • I have included the output of celery -A proj report in the issue.
    (if you are not able to do this, then at least specify the Celery
    version affected).
software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:3.6.0
            billiard:3.5.0.2 redis:2.10.5
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://redis:6379/2

accept_content: ['json']
broker_connection_max_retries: 100
broker_connection_retry: True
broker_connection_timeout: 4
broker_heartbeat: 20
broker_pool_limit: 10
broker_transport_options: {
 'socket_keepalive': True, 'socket_timeout': 90}
broker_url: 'redis://redis:6379/1'
redis_max_connections: 10
redis_socket_connect_timeout: None
redis_socket_timeout: 10
result_backend: 'redis://redis:6379/2'
result_expires: datetime.timedelta(1)
task_acks_late: True
task_compression: 'gzip'
task_ignore_result: True
task_publish_retry: True
task_serializer: 'json'
task_time_limit: 600.0
task_track_started: True
worker_concurrency: 5
worker_disable_rate_limits: False
worker_hijack_root_logger: True
worker_lost_wait: 10.0
worker_max_memory_per_child: None
worker_max_tasks_per_child: 10
worker_pool: 'prefork'
worker_prefetch_multiplier: 1
worker_send_task_events: True
  • I have verified that the issue exists against the master branch of Celery.

Environment

  1. Celery 4.0.2 from release or master branch (see requirements.txt)
  2. Redis 3.2.4 or 3.2.8 or AWS ElastiCache Redis
  3. redis (redis-py) package 2.10.5 or 8474249 (current master)
  4. hiredis package 0.2.0 or 5cfdf41 (current master)
  5. Configuration - https://github.com/nigma/celery-worker-deadlock-debug/blob/master/main.py
  6. Command line:
celery worker --app main --hostname default@%h --events --loglevel info -O fair
  1. requirements.txt
#celery[redis,tblib]==4.0.2

git+https://github.com/celery/celery.git@144f88b4e1be21780e737d4be5a734b19f1cf511
git+https://github.com/mher/flower.git@3c5f1f15dc6db018c36cf80df443a4c11f4cd149

kombu==4.0.2
billiard==3.5.0.2
anyjson==0.3.3
amqp==2.1.4
vine==1.1.3
setproctitle==1.1.10
tblib==1.3.0

redis==2.10.5
#git+https://github.com/andymccurdy/redis-py.git@84742495fd952f878fa6d0725d03d867ec39cd5

hiredis==0.2.0
#git+https://github.com/redis/hiredis-py.git

Steps to reproduce

  1. Open the sample project from https://github.com/nigma/celery-worker-deadlock-debug
  2. Build and run the project using docker-compose to reproduce the exact environment and behaviour. Execute make clean && make run.
  3. Repeat several times to observe several kinds of freezes, crashes and stack tracebacks.

Expected behavior

Celery pool workers do not freeze or crash when redis-py is using hiredis parser when connecting to Redis broker.

Actual behavior

  1. Celery worker pool freezes and gets offline without any error message after reaching worker_max_tasks_per_child processed tasks.
  2. Celery freezes with 100% CPU usage when workers try to exit after reaching worker_max_tasks_per_child processed tasks.
    • redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.
  3. Worker pool gets offline after workers reach worker_max_tasks_per_child processed tasks, resumes after broker connection timeout and then crashes after next batch of tasks:
    • Connection to broker lost / BrokenPipeError
    • billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 155.
    • redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
    • redis.exceptions.TimeoutError: Timeout reading from socket
    • redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.
    • Kombu: Unrecoverable error: AttributeError("'NoneType' object has no attribute 'fileno'",)

Above problems are not observed when hiredis package is not present in the system.

@canassa
Copy link

canassa commented Apr 21, 2017

I am having a similar issue, but I am using gevent + redis 2.10.5. I did a strace on the stuck celery process and it's looping on this:

sendto(194, "*6\r\n$4\r\nSCAN\r\n$6\r\n109882\r\n$5\r\nMA"..., 85, 0, NULL, 0) = 85
recvfrom(194, 0x8c97470, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

@AndreCimander
Copy link

We are also seeing this issue on celery 3.1.x and 4.0.x(inclusive master), this is maddening, we already wasted 4 days experimenting with settings, celery versions, we even tried upgrading to celery 4, we tried switching the broker to rabbitmq...

Our eventlet workers just decide to idle for several minutes and the process is stuck in a "Resource temporarily unavailable"-loop. We can't switch to prefork as we use several hundred threads per worker process as we are massively I/O (web-api) bound.

Currently with RabbitMQ as the message broker and Redis as the result backend (with almost all tasks set to ignore_result=True)

recvfrom(7, 0x7f4ce3bb4714, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(18, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=7, u64=13598036333682491399}}) = 0
epoll_wait(18, [], 1023, 209)           = 0
epoll_wait(18, [], 1023, 790)           = 0

as we require near real-time data this is currently fracking our production services...

@thedrow
Copy link
Member

thedrow commented Jul 9, 2017

@AndreCimander If this doesn't work with the RabbitMQ broker as well it sounds like a different issue. The analysis provided here says that using the Redis broker in conjunction with worker_max_tasks_per_child triggers this bug.
If this isn't the case for you as well, please open a new issue about it.

@thedrow
Copy link
Member

thedrow commented Jul 9, 2017

@canassa Can you please provide more information like the OP did? Does the problem go away when worker_max_tasks_per_child isn't used?

@wolph
Copy link
Contributor

wolph commented Jun 18, 2018

I've just encountered a similar issue, removing the worker_max_tasks_per_child setting doesn't fix it for me. and I should note that I use rabbitmq as a backend (but I've tried redis as well).

I do have a few redis connections so it could still be a redis thing. It's definitely driving me crazy... I have no idea how to solve it but it seems something broke it recently for me.

@SHxKM
Copy link

SHxKM commented Sep 22, 2018

I can confirm this exists for me on Celery 4.2.1, kombu==4.2.1, and hiredis 0.2.0. Removing hiredis "solves" this issue.

@leveryd
Copy link

leveryd commented Sep 26, 2018

celery==4.1.0
redis as broker
encounter this problem too.

@seocam
Copy link

seocam commented Nov 19, 2018

I'm suffering with exactly this problem. In my case hiredis was installed as dependency of django channels and then the problem started. As temp alternative I'm gonna try removing worker_max_tasks_per_child and use worker_max_memory_per_child in the meanwhile.

@seocam
Copy link

seocam commented Nov 19, 2018

Just to let everybody know: the problem also happens when using worker_max_memory_per_child even without worker_max_tasks_per_child set.

@popravich
Copy link

Can anyone here check if this PR (celery/kombu#954) makes the difference?

@seocam
Copy link

seocam commented Nov 21, 2018

@popravich I'm deploying that in our testing servers today. In a few hours I can let you know if that worked for us.

@seocam
Copy link

seocam commented Nov 21, 2018

@popravich PR on Kombu (celery/kombu#954) fixes the issue in my case.

@liniribeiro
Copy link

This problem start to happen again.

@abdulla9012
Copy link

abdulla9012 commented Mar 1, 2021

redis.exceptions.ResponseError: MOVED 3367 172.31.31.176:6379

Whats the meaning of this exception from redis in aws elasticache for django.
Even this Ip Address is not even related with my ec2
can Anybody help

settings.py for the ec2 production

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": "redis://trupinion-redis.32pre0.clustercfg.use2.cache.amazonaws.com:6379/0",
        "OPTIONS": {
             "CLIENT_CLASS": "django_redis.client.DefaultClient",
        },
        "KEY_PREFIX": "example"
    }
}

CELERY_BROKER_URL = 'redis://trupinion-redis.32pre0.clustercfg.use2.cache.amazonaws.com:6379/0'
CELERY_RESULT_BACKEND = 'django-db'
CELERY_CACHE_BACKEND = 'django-cache'
CELERY_ENABLE_UTC = False
CELERY_TIMEZONE = "Asia/Kolkata"

@auvipy
Copy link
Member

auvipy commented Mar 2, 2021

This problem start to happen again.

what problem are you facing? can you elaborate more on your setup?

@abdulla9012
Copy link

This problem start to happen again.

what problem are you facing? can you elaborate more on your setup?

Hello thanks for your response
Actually the real problem is when i use this celery command to start the worker using aws elasticcache redis
celery -A projname worker -l info
celery -A projname worker --loglevel=debug --concurrency=4

Any of the above, it works fine in the development environment
But in the aws ec2 production, it throws exception of redis

`-------------- celery@ip-.us-east-2.compute.internal v4.4.7 (cliffs)
--- ***** -----
-- ******* ---- Linux-4.14.209-160.339.amzn2.x86_64-x86_64-with-glibc2.2.5 2021-03-02 12:38:17

  • *** --- * ---
  • ** ---------- [config]
  • ** ---------- .> app: trupinion:0x7f3095e4ab90
  • ** ---------- .> transport: redis://.cache.amazonaws.com:6379/0
  • ** ---------- .> results:
  • *** --- * --- .> concurrency: 1 (prefork)
    -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
    --- ***** -----
    -------------- [queues]
    .> celery exchange=celery(direct) key=celery

[tasks]
. emailTemplate.services.mailer.send_email
. emailTemplate.services.mailer.send_emails
. emailTemplate.tasks.check_queued_task
. survey.tasks.update_overall_survey_ratings

[2021-03-02 12:38:17,230: CRITICAL/MainProcess] Unrecoverable error: ResponseError('MOVED 14663 172.31.22.150:6379')
Traceback (most recent call last):
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/celery/worker/worker.py", line 208, in start
self.blueprint.start(self)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/celery/worker/consumer/connection.py", line 23, in start
c.connection = c.connect()
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 407, in connect
conn.transport.register_with_event_loop(conn.connection, self.hub)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/kombu/transport/redis.py", line 1057, in register_with_event_loop
cycle.on_poll_init(loop.poller)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/kombu/transport/redis.py", line 332, in on_poll_init
num=channel.unacked_restore_limit,
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/kombu/transport/redis.py", line 197, in restore_visible
self.unacked_mutex_expire):
File "/usr/lib64/python3.7/contextlib.py", line 112, in enter
return next(self.gen)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/kombu/transport/redis.py", line 117, in Mutex
lock_acquired = lock.acquire(blocking=False)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/redis/lock.py", line 187, in acquire
if self.do_acquire(token):
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/redis/lock.py", line 203, in do_acquire
if self.redis.set(self.name, token, nx=True, px=timeout):
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/redis/client.py", line 1801, in set
return self.execute_command('SET', *pieces)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/redis/client.py", line 901, in execute_command
return self.parse_response(conn, command_name, **options)
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/redis/client.py", line 915, in parse_response
response = connection.read_response()
File "/home/ec2-user/Project/Trupinion_Backend/env/lib/python3.7/site-packages/redis/connection.py", line 756, in read_response
raise response
redis.exceptions.ResponseError: MOVED 14663 172.31.22.150:6379
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.