Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerized worker with redis gives broken pipe at every startup #5272

Closed
WisdomPill opened this issue Jan 8, 2019 · 11 comments
Closed

dockerized worker with redis gives broken pipe at every startup #5272

WisdomPill opened this issue Jan 8, 2019 · 11 comments

Comments

@WisdomPill
Copy link

WisdomPill commented Jan 8, 2019

Healthcheck fails after a few tries about in 5 minutes, and when I ping all the workers using

celery insect ping -A project

I don't find the worker, I find the other ones by the way.
I have three dockerized workers, using Amazon ECS two of which have light work to do and are not dependent on a VPN so I'm using fargate and the one that gives me problems is on EC2 using host networking mode in a server with a VPN. The other two workers don't have the same start up problem.
This worker has a beat that makes the worker do some work to update some sensors statuses every minute. It uses group calls to do parallel work.

REPORT with part of django settings

software -> celery:4.2.1 (windowlicker) kombu:4.2.1 py:3.6.8
            billiard:3.5.0.4 redis:2.10.6
platform -> system:Linux arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

CELERY_ACCEPT_CONTENT: ['application/json']
CELERY_BEAT_SCHEDULE: {
    'curator_indexes_migration': {   'schedule': <crontab: 0 4 * * * (m/h/d/dM/MY)>,
                                     'task': 'sensor.tasks.statuses.curator_indexes_migration'},
    'fetch_all_configurations_backup': {   'schedule': <crontab: 0 9 * * * (m/h/d/dM/MY)>,
                                           'task': 'sensor.tasks.statuses.fetch_all_configurations_backup'},
    'update_images': {   'schedule': <crontab: 0 * * * * (m/h/d/dM/MY)>,
                         'task': 'sensor.tasks.statuses.update_images'},
    'update_statuses': {   'schedule': <crontab: * * * * * (m/h/d/dM/MY)>,
                           'task': 'sensor.tasks.statuses.update_statuses'}}
CELERY_BROKER_URL: 'redis://safety-staging.h1q2ld.0001.euw1.cache.amazonaws.com:6379/1'
CELERY_RESULT_BACKEND: None
CELERY_RESULT_SERIALIZER: 'json'
CELERY_TASK_ROUTES: {
    'notification.tasks.*': {'queue': 'notification'},
    'sensor.tasks.*': {'queue': 'sensor'}}
CELERY_TASK_SERIALIZER: 'json'
CELERY_TIMEZONE: 'Europe/Rome'

part of requirements.txt concerning celery and redis

aioredis==1.2.0
celery==4.2.1
Django==1.11
hiredis==0.2.0
kombu==4.2.1
redis==2.10.6

start script

#!/usr/bin/env bash
CELERY_DOMAIN=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1)
echo ${CELERY_DOMAIN} > /home/user/celery_domain.txt
printf "Celery domain is %s\n" "$CELERY_DOMAIN"
celery worker -A centroservizi -l INFO --autoscale=4,2 -Q sensor -n sensor@${CELERY_DOMAIN}

health check script

#!/usr/bin/env bash
CELERY_DOMAIN=$(cat /home/user/celery_domain.txt)
printf "Celery domain is %s\n" "$CELERY_DOMAIN"
celery inspect ping -A centroservizi -d sensor@${CELERY_DOMAIN}

Start log

Celery domain is G9ehGHbxIlsYFPb1JeW6Pdlar0gMIUue
[2019-01-08 07:14:57,184: INFO/MainProcess] Connected to redis://<elastic cache>.0001.euw1.cache.amazonaws.com:6379/1
[2019-01-08 07:14:57,200: INFO/MainProcess] mingle: searching for neighbors
[2019-01-08 07:14:58,306: INFO/MainProcess] mingle: sync with 2 nodes
[2019-01-08 07:14:58,307: INFO/MainProcess] mingle: sync complete
[2019-01-08 07:14:58,332: INFO/MainProcess] sensor@G9ehGHbxIlsYFPb1JeW6Pdlar0gMIUue ready.
[2019-01-08 07:14:59,026: INFO/MainProcess] Received task: sensor.tasks.statuses.update_sensor_status[2fa6e725-0a24-42c9-9171-572d7db3d31e]
[2019-01-08 07:14:59,033: INFO/MainProcess] Received task: sensor.tasks.statuses.update_sensor_status[296aaeb7-926a-4d1b-b62d-cec1783dbc70]
[2019-01-08 07:14:59,036: INFO/MainProcess] Received task: sensor.tasks.statuses.update_sensor_status[43cd23b1-5304-41df-ac82-1168cf53da60]
[2019-01-08 07:14:59,036: INFO/MainProcess] Scaling up 1 processes.
[2019-01-08 07:14:59,204: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 590, in send_packed_command
self._sock.sendall(item)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 317, in start
blueprint.start(self)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 593, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
next(loop)
File "/usr/local/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 276, in create_loop
tick_callback()
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 1033, in on_poll_start
cycle_poll_start()
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 315, in on_poll_start
self._register_BRPOP(channel)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 301, in _register_BRPOP
channel._brpop_start()
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 707, in _brpop_start
self.client.connection.send_command('BRPOP', *keys)
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 610, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 603, in send_packed_command
(errno, errmsg))
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.

If I can give you other info on the issue just ask

@WisdomPill
Copy link
Author

I changed the start worker script, I'm not using autoscale anymore, I'm using the concurrency parameter directly.

#!/usr/bin/env bash
CELERY_DOMAIN=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1)
echo ${CELERY_DOMAIN} > /home/user/celery_domain.txt
printf "Celery domain is %s\n" "$CELERY_DOMAIN"
celery worker -A centroservizi -l INFO --concurrency=4 -Q sensor -n sensor@${CELERY_DOMAIN}

@WisdomPill
Copy link
Author

WisdomPill commented Jan 8, 2019

Found an issue https://github.com/celery/kombu/issues/681 in kombu

@thedrow
Copy link
Member

thedrow commented Jan 8, 2019

Could you please attach the Redis log when the connection is severed? It could be that we're sending a large amount of data? See redis/redis-py#986, redis/redis#4888 and https://stackoverflow.com/questions/43204496/broken-pipe-error-redis.
It could be that you are creating a canvas large enough to generate a payload that redis rejects.
Could this be a duplicate of celery/kombu#681?
Have you tried using RabbitMQ to see if this is Redis specific?

@thedrow
Copy link
Member

thedrow commented Jan 8, 2019

How is this related to #4363?

@WisdomPill
Copy link
Author

Sorry, my bad, I confused the issue subject, I will remove the comment, does this remove also the reference in GitHub?

@WisdomPill
Copy link
Author

@thedrow no, I haven't tried it yet with rabbit, I use elasticache in AWS so I can't get the redis logs I will try to reproduce it locally, maybe with the same data in order to have the same load

@WisdomPill
Copy link
Author

I'm quite sure it's not a size issue, I mean, all the parameters that I pass are either, booleans, strings, numbers ir really small dictionaries(one key, sometimes two)

@thedrow
Copy link
Member

thedrow commented Jan 8, 2019

Sorry, my bad, I confused the issue subject, I will remove the comment, does this remove also the reference in GitHub?

Yes.

@WisdomPill
Copy link
Author

Seems like with the new version of celery 4.3.0rc2, this does not happen anymore.

I will confirm that, in the following days when I will try autoscale in production

@WisdomPill
Copy link
Author

I confirm that I don't see that error message anymore, you can close it @thedrow thanks for fixing it

@thedrow thedrow closed this as completed Mar 8, 2019
@thedrow thedrow added this to the v4.3 milestone Mar 8, 2019
@thedrow
Copy link
Member

thedrow commented Mar 8, 2019

Sure, no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants