Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery: worker connection timed out #2991

Open
shiraiyuki opened this issue Jan 8, 2016 · 24 comments
Open

Celery: worker connection timed out #2991

shiraiyuki opened this issue Jan 8, 2016 · 24 comments

Comments

@shiraiyuki
Copy link

shiraiyuki commented Jan 8, 2016

I'm using amqp (1.4.6) , celery (3.1.14) and kombu (3.0.22). Recently, I create a worker in the internet connecting with rabbitmq-server(broker). Sometimes, worker get a [errno110] connection timed out error. The following is the worker's log:

[2016-01-07 15:44:02,001: WARNING/MainProcess] Traceback (most recent call last):
[2016-01-07 15:44:02,001: WARNING/MainProcess] File "../lib/python2.7/site-packages/eventlet/hubs/poll.py", line 115, in wait
[2016-01-07 15:44:02,036: WARNING/MainProcess] listener.cb(fileno)
[2016-01-07 15:44:02,036: WARNING/MainProcess] File "../lib/python2.7/site-packages/celery/worker/pidbox.py", line 112, in loop
[2016-01-07 15:44:02,063: WARNING/MainProcess] connection.drain_events(timeout=1.0)
[2016-01-07 15:44:02,063: WARNING/MainProcess] File "../lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events
[2016-01-07 15:44:02,074: WARNING/MainProcess] return self.transport.drain_events(self.connection, **kwargs)
[2016-01-07 15:44:02,075: WARNING/MainProcess] File "../lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events
[2016-01-07 15:44:02,092: WARNING/MainProcess] return connection.drain_events(**kwargs)
[2016-01-07 15:44:02,093: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/connection.py", line 302, in drain_events
[2016-01-07 15:44:02,107: WARNING/MainProcess] chanmap, None, timeout=timeout,
[2016-01-07 15:44:02,107: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/connection.py", line 365, in _wait_multiple
[2016-01-07 15:44:02,107: WARNING/MainProcess] channel, method_sig, args, content = read_timeout(timeout)
[2016-01-07 15:44:02,107: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/connection.py", line 336, in read_timeout
[2016-01-07 15:44:02,107: WARNING/MainProcess] return self.method_reader.read_method()
[2016-01-07 15:44:02,107: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
[2016-01-07 15:44:02,108: WARNING/MainProcess] raise m
[2016-01-07 15:44:02,108: WARNING/MainProcess] error: [Errno 110] Connection timed out
[2016-01-07 15:44:02,108: WARNING/MainProcess] Removing descriptor: 6
[2016-01-07 15:44:17,609: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
app.conf.update(
    CELERY_IGNORE_RESULT=True,
    CELERY_RESULT_BACKEND='rpc://',
    CELERY_DISABLE_RATE_LIMITS=True,
    # CELERY_DEFAULT_DELIVERY_MODE = True,
    CELERY_TASK_SERIALIZER='json',
    CELERY_RESULT_SERIALIZER='json',
    CELERY_TIMEZONE='Asia/Taipei',
    # if the heartbeat is 10.0 and the rate is the default 2.0, the check will be performed every 5 seconds
    # BROKER_HEARTBEAT=10.0,
    # BROKER_HEARTBEAT_CHECKRATE=2.0,
    BROKER_CONNECTION_MAX_RETRIES=None,
)

After this error message, when worker reconnect to broker, we get another error message:
consumer: Cannot connect to amqp://guest:**@***:5672//: [Errno 104] Connection reset by peer.

Then after a long time, sometimes the worker can reconnect to broker. But when I assign some jobs to the worker, the worker can't run normally. The worker can get the jobs but doesn't return success to
broker.

There is my celery setting:

app.conf.update(
    CELERY_IGNORE_RESULT=True,
    CELERY_RESULT_BACKEND='rpc://',
    CELERY_DISABLE_RATE_LIMITS=True,
    # CELERY_DEFAULT_DELIVERY_MODE = True,
    CELERY_TASK_SERIALIZER='json',
    CELERY_RESULT_SERIALIZER='json',
    CELERY_TIMEZONE='Asia/Taipei',
    # if the heartbeat is 10.0 and the rate is the default 2.0, the check will be performed every 5 seconds
    # BROKER_HEARTBEAT=10.0,
    # BROKER_HEARTBEAT_CHECKRATE=2.0,
    BROKER_CONNECTION_MAX_RETRIES=None,
)

Are there any setting that I need to change?
And if the timeout occur, there is a method that we can know the worker is abnormal and restart worker automaticly?

Thanks

@ask
Copy link
Contributor

ask commented Jan 13, 2016

Are there no remote control related errors above in the log, before this happens?

socket.timeout should be ignored in this code, so is this exception different from socket.timeout?
https://github.com/celery/celery/blob/3.1/celery/worker/pidbox.py#L111-L114

@Glueon
Copy link

Glueon commented Jan 17, 2016

Are you getting this when you run more tasks than there is a concurrency level?

@shiraiyuki
Copy link
Author

In my controller rabbitmq's log, if worker got the error, i got a log like:
=ERROR REPORT==== 20-Jan-2016::10:57:14 ===
closing AMQP connection <0.2631.0> (worker ip:59989 -> controller ip:5672):
{inet_error,etimedout}

After the log, the worker got "[Errno 110] Connection timed out" error and when worker reconnect to broker, it got timeout and "[Errno 104] Connection reset by peer" error until restart the worker.

And sometimes, my rabbitmq's log show:
closing AMQP connection <0.201.11> (127.0.0.1:49464 -> 127.0.0.1:5672):
connection_closed_abruptly

In celery we set the worker has 100 coroutine, and controller will send about 100 tasks, but each task has some subtasks to do my jobs.

Another problem is sometimes the worker doesn't get timeout error, but the connection to controller is abnormal. I can't use celery inspect ping to get the worker. The worker can receive the jobs, but it doesn't respond the success message to controller.

In controller rabbitmq's log:

closing AMQP connection <0.17632.37> (workerip:42754 -> controllerip:5672):
{handshake_timeout,handshake}
or
closing AMQP connection <0.16301.37> (workerip:42557 -> controllerip:5672):
{handshake_timeout,frame_header}

@Djarnis
Copy link

Djarnis commented Feb 4, 2016

I was having similar issues using rabbitmq cluster behind an Elastic Load Balancer. What solved the issue for me was setting BROKER_POOL_LIMIT = None, disabling connection pooling. Downside is all the opens/closes of connections. These are my other settings: https://gist.github.com/Djarnis/c0d0df15656810123f8a#file-settings-py

@ask
Copy link
Contributor

ask commented Jun 24, 2016

Closing this, as we don't have the resources to complete this task.

May be fixed in master, let's see if comes back after 4.0 release.

@vivekanand1101
Copy link
Contributor

vivekanand1101 commented Jul 19, 2017

I am using Celery version: 4.0.2 (latentcall)
I still get the error. Connection timed out. Here is rabbitmq log file for the same:
https://paste.fedoraproject.org/paste/hf2lbqkyK5ZQ5bBpg0u-eQ

@vivekanand1101
Copy link
Contributor

Makes me #wannacry:

`=INFO REPORT==== 20-Jul-2017::12:40:55 ===
accepting AMQP connection <0.773.0> (127.0.0.1:59154 -> 127.0.0.1:5672)

=ERROR REPORT==== 20-Jul-2017::12:41:05 ===
closing AMQP connection <0.773.0> (127.0.0.1:59154 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

=INFO REPORT==== 20-Jul-2017::12:41:07 ===
accepting AMQP connection <0.776.0> (127.0.0.1:59174 -> 127.0.0.1:5672)

=ERROR REPORT==== 20-Jul-2017::12:41:17 ===
closing AMQP connection <0.776.0> (127.0.0.1:59174 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}`

@vivekanand1101
Copy link
Contributor

So, i know exactly how to fix this. Removed my ethernet cable and it worked -_-

@auvipy auvipy reopened this Jul 20, 2017
@aayushgoel92
Copy link

aayushgoel92 commented Sep 17, 2017

@vivekanand1101 You are a life saver. Any ideas why this is happening?

I'm on v4.1.0

@vivekanand1101
Copy link
Contributor

@aayushgoel92 nope :(

@auvipy
Copy link
Member

auvipy commented Dec 19, 2017

could you check the master branch with latest versions of dependency?

@auvipy auvipy added this to the v5.0.0 milestone Dec 19, 2017
alanjds added a commit to alanjds/celery-serverless that referenced this issue Jun 19, 2018
@auvipy auvipy removed this from the v5.0.0 milestone Aug 10, 2018
@auvipy
Copy link
Member

auvipy commented Aug 10, 2018

@alanjds is this still an issue on 4.2? should we close this?

@alanjds
Copy link
Contributor

alanjds commented Aug 10, 2018

I do not remember to had saw this before, @auvipy.

Is waiting feedback for 8 months. What about pinging @aayushgoel92 and
@shiraiyuki and wait a couple of weeks more?

Then close it :/

@auvipy
Copy link
Member

auvipy commented Aug 12, 2018

sure!

@vsag96
Copy link

vsag96 commented Oct 23, 2018

Yes this is still an issue, I am using 4.2.0, I have a rabbitmq 3.6 docker image. If you need more information please let me know, which parts of the codebase should be read for fixing this,I'll be glad to help. This issue has popped up recently #4980 .

@auvipy
Copy link
Member

auvipy commented Nov 18, 2018

@vsag96 could you please install all the package from github master and check?

@hp685
Copy link

hp685 commented Nov 26, 2018

@auvipy I have also observed this recently using master with rabbitmq 3.6:
Detail from rabbitmq log:
=ERROR REPORT==== 26-Nov-2018::09:36:49 ===
closing AMQP connection <0.5823.374> (127.0.0.1:41877 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

I also increased the handshake_timeout to 60s from 10 but that did not help.
Additionally, I don't see "Connection reset by peer" or "Connection timed out" in my case, but see the client block indefinitely after having sent a request.

@sudhishvnair
Copy link

I am also facing this issue with V4.3. Please find trace below.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 431, in _reraise_as_library_errors
yield
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 510, in _ensured
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/usr/local/lib/python3.6/dist-packages/amqp/channel.py", line 1771, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/usr/local/lib/python3.6/dist-packages/amqp/abstract_channel.py", line 51, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.6/dist-packages/amqp/method_framing.py", line 144, in write_frame
frame, 0xce))
File "/usr/local/lib/python3.6/dist-packages/amqp/transport.py", line 288, in write
self._write(s)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 648, in protected_call
return self.run(*args, **kwargs)
File "/root/nlp_engine/rcm_ai_service/job_man/tasks.py", line 15, in nlp_layered_task
celery_task_function_pdf_to_json(dParams)
File "/root/nlp_engine/rcm_ai_service/src/pdf/views.py", line 401, in celery_task_function_pdf_to_json
dMappedJson=context, dFileInfo=dFileInfo, lDbJsonSet=None)
File "/usr/local/lib/python3.6/dist-packages/celery/app/task.py", line 427, in delay
return self.apply_async(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/celery/app/task.py", line 570, in apply_async
**options
File "/usr/local/lib/python3.6/dist-packages/celery/app/base.py", line 756, in send_task
amqp.send_task_message(P, name, message, **options)
File "/usr/local/lib/python3.6/dist-packages/celery/app/amqp.py", line 552, in send_task_message
**properties
File "/usr/local/lib/python3.6/dist-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 543, in _ensured
errback and errback(exc, 0)
File "/usr/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 436, in _reraise_as_library_errors
sys.exc_info()[2])
File "/usr/local/lib/python3.6/dist-packages/vine/five.py", line 194, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 431, in _reraise_as_library_errors
yield
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 510, in _ensured
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/usr/local/lib/python3.6/dist-packages/amqp/channel.py", line 1771, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/usr/local/lib/python3.6/dist-packages/amqp/abstract_channel.py", line 51, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.6/dist-packages/amqp/method_framing.py", line 144, in write_frame
frame, 0xce))
File "/usr/local/lib/python3.6/dist-packages/amqp/transport.py", line 288, in write
self._write(s)
kombu.exceptions.OperationalError: [Errno 110] Connection timed out

@auvipy auvipy modified the milestones: 4.4.0, 4.5 May 7, 2019
@pymonger
Copy link
Contributor

Setting BROKER_POOL_LIMIT = None fixed this issue for me in celery v4.4.0rc3.

@auvipy auvipy modified the milestones: 4.5, 4.4.0 Aug 30, 2019
@auvipy
Copy link
Member

auvipy commented Aug 30, 2019

thanks for verifying!

@auvipy auvipy closed this as completed Aug 30, 2019
@thedrow
Copy link
Member

thedrow commented Oct 2, 2019

@auvipy That's a workaround, not a fix.
Celery should be working with eventlet and a BROKER_POOL_LIMIT.

@thedrow thedrow reopened this Oct 2, 2019
@auvipy auvipy modified the milestones: 4.4.0, 4.5 Oct 2, 2019
@auvipy auvipy modified the milestones: 4.5, 4.4.x Dec 16, 2019
@auvipy auvipy modified the milestones: 4.4.x, 5.1.0 Dec 30, 2020
@auvipy auvipy modified the milestones: 5.1.0, 5.2 Feb 18, 2021
@auvipy auvipy modified the milestones: 5.2, 5.3 Oct 30, 2021
@tahamr83
Copy link

tahamr83 commented Apr 18, 2022

This still persists for me. We run on kubernetes, and when this occurs our task revokes stop working, the revokes never reach that worker since I couldn't find the *.pidbox queue for that worker in our rabbitmq instance.

@tahamr83
Copy link

tahamr83 commented May 6, 2022

Could it be that for some reason it's a different exception than socket.timeout here

except socket.timeout:
, that causes the consumer to simply shutdown, effectively disabling all Control commands

@auvipy auvipy modified the milestones: 5.3, 5.3.x Dec 2, 2022
@auvipy
Copy link
Member

auvipy commented Dec 2, 2022

Could it be that for some reason it's a different exception than socket.timeout here

except socket.timeout:

, that causes the consumer to simply shutdown, effectively disabling all Control commands

can you open a PR and investigate?

@auvipy auvipy modified the milestones: 5.3.x, 5.4.x Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests