Celery: worker connection timed out #2991

shiraiyuki · 2016-01-08T09:24:42Z

I'm using amqp (1.4.6) , celery (3.1.14) and kombu (3.0.22). Recently, I create a worker in the internet connecting with rabbitmq-server(broker). Sometimes, worker get a [errno110] connection timed out error. The following is the worker's log:

[2016-01-07 15:44:02,001: WARNING/MainProcess] Traceback (most recent call last):
[2016-01-07 15:44:02,001: WARNING/MainProcess] File "../lib/python2.7/site-packages/eventlet/hubs/poll.py", line 115, in wait
[2016-01-07 15:44:02,036: WARNING/MainProcess] listener.cb(fileno)
[2016-01-07 15:44:02,036: WARNING/MainProcess] File "../lib/python2.7/site-packages/celery/worker/pidbox.py", line 112, in loop
[2016-01-07 15:44:02,063: WARNING/MainProcess] connection.drain_events(timeout=1.0)
[2016-01-07 15:44:02,063: WARNING/MainProcess] File "../lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events
[2016-01-07 15:44:02,074: WARNING/MainProcess] return self.transport.drain_events(self.connection, **kwargs)
[2016-01-07 15:44:02,075: WARNING/MainProcess] File "../lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events
[2016-01-07 15:44:02,092: WARNING/MainProcess] return connection.drain_events(**kwargs)
[2016-01-07 15:44:02,093: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/connection.py", line 302, in drain_events
[2016-01-07 15:44:02,107: WARNING/MainProcess] chanmap, None, timeout=timeout,
[2016-01-07 15:44:02,107: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/connection.py", line 365, in _wait_multiple
[2016-01-07 15:44:02,107: WARNING/MainProcess] channel, method_sig, args, content = read_timeout(timeout)
[2016-01-07 15:44:02,107: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/connection.py", line 336, in read_timeout
[2016-01-07 15:44:02,107: WARNING/MainProcess] return self.method_reader.read_method()
[2016-01-07 15:44:02,107: WARNING/MainProcess] File "../lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
[2016-01-07 15:44:02,108: WARNING/MainProcess] raise m
[2016-01-07 15:44:02,108: WARNING/MainProcess] error: [Errno 110] Connection timed out
[2016-01-07 15:44:02,108: WARNING/MainProcess] Removing descriptor: 6
[2016-01-07 15:44:17,609: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
app.conf.update(
    CELERY_IGNORE_RESULT=True,
    CELERY_RESULT_BACKEND='rpc://',
    CELERY_DISABLE_RATE_LIMITS=True,
    # CELERY_DEFAULT_DELIVERY_MODE = True,
    CELERY_TASK_SERIALIZER='json',
    CELERY_RESULT_SERIALIZER='json',
    CELERY_TIMEZONE='Asia/Taipei',
    # if the heartbeat is 10.0 and the rate is the default 2.0, the check will be performed every 5 seconds
    # BROKER_HEARTBEAT=10.0,
    # BROKER_HEARTBEAT_CHECKRATE=2.0,
    BROKER_CONNECTION_MAX_RETRIES=None,
)

After this error message, when worker reconnect to broker, we get another error message:
consumer: Cannot connect to amqp://guest:**@***:5672//: [Errno 104] Connection reset by peer.

Then after a long time, sometimes the worker can reconnect to broker. But when I assign some jobs to the worker, the worker can't run normally. The worker can get the jobs but doesn't return success to
broker.

There is my celery setting:

app.conf.update(
    CELERY_IGNORE_RESULT=True,
    CELERY_RESULT_BACKEND='rpc://',
    CELERY_DISABLE_RATE_LIMITS=True,
    # CELERY_DEFAULT_DELIVERY_MODE = True,
    CELERY_TASK_SERIALIZER='json',
    CELERY_RESULT_SERIALIZER='json',
    CELERY_TIMEZONE='Asia/Taipei',
    # if the heartbeat is 10.0 and the rate is the default 2.0, the check will be performed every 5 seconds
    # BROKER_HEARTBEAT=10.0,
    # BROKER_HEARTBEAT_CHECKRATE=2.0,
    BROKER_CONNECTION_MAX_RETRIES=None,
)

Are there any setting that I need to change?
And if the timeout occur, there is a method that we can know the worker is abnormal and restart worker automaticly?

Thanks

The text was updated successfully, but these errors were encountered:

ask · 2016-01-13T01:07:08Z

Are there no remote control related errors above in the log, before this happens?

socket.timeout should be ignored in this code, so is this exception different from socket.timeout?
https://github.com/celery/celery/blob/3.1/celery/worker/pidbox.py#L111-L114

Glueon · 2016-01-17T12:38:09Z

Are you getting this when you run more tasks than there is a concurrency level?

shiraiyuki · 2016-01-20T06:07:17Z

In my controller rabbitmq's log, if worker got the error, i got a log like:
=ERROR REPORT==== 20-Jan-2016::10:57:14 ===
closing AMQP connection <0.2631.0> (worker ip:59989 -> controller ip:5672):
{inet_error,etimedout}

After the log, the worker got "[Errno 110] Connection timed out" error and when worker reconnect to broker, it got timeout and "[Errno 104] Connection reset by peer" error until restart the worker.

And sometimes, my rabbitmq's log show:
closing AMQP connection <0.201.11> (127.0.0.1:49464 -> 127.0.0.1:5672):
connection_closed_abruptly

In celery we set the worker has 100 coroutine, and controller will send about 100 tasks, but each task has some subtasks to do my jobs.

Another problem is sometimes the worker doesn't get timeout error, but the connection to controller is abnormal. I can't use celery inspect ping to get the worker. The worker can receive the jobs, but it doesn't respond the success message to controller.

In controller rabbitmq's log:

closing AMQP connection <0.17632.37> (workerip:42754 -> controllerip:5672):
{handshake_timeout,handshake}
or
closing AMQP connection <0.16301.37> (workerip:42557 -> controllerip:5672):
{handshake_timeout,frame_header}

Djarnis · 2016-02-04T18:57:24Z

I was having similar issues using rabbitmq cluster behind an Elastic Load Balancer. What solved the issue for me was setting BROKER_POOL_LIMIT = None, disabling connection pooling. Downside is all the opens/closes of connections. These are my other settings: https://gist.github.com/Djarnis/c0d0df15656810123f8a#file-settings-py

ask · 2016-06-24T00:52:21Z

Closing this, as we don't have the resources to complete this task.

May be fixed in master, let's see if comes back after 4.0 release.

vivekanand1101 · 2017-07-19T14:47:10Z

I am using Celery version: 4.0.2 (latentcall)
I still get the error. Connection timed out. Here is rabbitmq log file for the same:
https://paste.fedoraproject.org/paste/hf2lbqkyK5ZQ5bBpg0u-eQ

vivekanand1101 · 2017-07-20T07:12:51Z

Makes me #wannacry:

`=INFO REPORT==== 20-Jul-2017::12:40:55 ===
accepting AMQP connection <0.773.0> (127.0.0.1:59154 -> 127.0.0.1:5672)

=ERROR REPORT==== 20-Jul-2017::12:41:05 ===
closing AMQP connection <0.773.0> (127.0.0.1:59154 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

=INFO REPORT==== 20-Jul-2017::12:41:07 ===
accepting AMQP connection <0.776.0> (127.0.0.1:59174 -> 127.0.0.1:5672)

=ERROR REPORT==== 20-Jul-2017::12:41:17 ===
closing AMQP connection <0.776.0> (127.0.0.1:59174 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}`

vivekanand1101 · 2017-07-20T10:47:22Z

So, i know exactly how to fix this. Removed my ethernet cable and it worked -_-

aayushgoel92 · 2017-09-17T11:13:16Z

@vivekanand1101 You are a life saver. Any ideas why this is happening?

I'm on v4.1.0

vivekanand1101 · 2017-09-17T15:01:47Z

@aayushgoel92 nope :(

auvipy · 2017-12-19T13:08:28Z

could you check the master branch with latest versions of dependency?

Fixed on Celery 4.2.0. See: celery/celery#3649 See: celery/celery#2991 See: Polyconseil/aioamqp#96

auvipy · 2018-08-10T11:29:10Z

@alanjds is this still an issue on 4.2? should we close this?

alanjds · 2018-08-10T14:20:15Z

I do not remember to had saw this before, @auvipy.

Is waiting feedback for 8 months. What about pinging @aayushgoel92 and
@shiraiyuki and wait a couple of weeks more?

Then close it :/

auvipy · 2018-08-12T03:48:25Z

sure!

vsag96 · 2018-10-23T16:49:02Z

Yes this is still an issue, I am using 4.2.0, I have a rabbitmq 3.6 docker image. If you need more information please let me know, which parts of the codebase should be read for fixing this,I'll be glad to help. This issue has popped up recently #4980 .

auvipy · 2018-11-18T06:20:18Z

@vsag96 could you please install all the package from github master and check?

hp685 · 2018-11-26T22:34:48Z

@auvipy I have also observed this recently using master with rabbitmq 3.6:
Detail from rabbitmq log:
=ERROR REPORT==== 26-Nov-2018::09:36:49 ===
closing AMQP connection <0.5823.374> (127.0.0.1:41877 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

I also increased the handshake_timeout to 60s from 10 but that did not help.
Additionally, I don't see "Connection reset by peer" or "Connection timed out" in my case, but see the client block indefinitely after having sent a request.

sudhishvnair · 2019-04-22T07:19:57Z

I am also facing this issue with V4.3. Please find trace below.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 431, in _reraise_as_library_errors
yield
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 510, in _ensured
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/usr/local/lib/python3.6/dist-packages/amqp/channel.py", line 1771, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/usr/local/lib/python3.6/dist-packages/amqp/abstract_channel.py", line 51, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.6/dist-packages/amqp/method_framing.py", line 144, in write_frame
frame, 0xce))
File "/usr/local/lib/python3.6/dist-packages/amqp/transport.py", line 288, in write
self._write(s)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 648, in protected_call
return self.run(*args, **kwargs)
File "/root/nlp_engine/rcm_ai_service/job_man/tasks.py", line 15, in nlp_layered_task
celery_task_function_pdf_to_json(dParams)
File "/root/nlp_engine/rcm_ai_service/src/pdf/views.py", line 401, in celery_task_function_pdf_to_json
dMappedJson=context, dFileInfo=dFileInfo, lDbJsonSet=None)
File "/usr/local/lib/python3.6/dist-packages/celery/app/task.py", line 427, in delay
return self.apply_async(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/celery/app/task.py", line 570, in apply_async
**options
File "/usr/local/lib/python3.6/dist-packages/celery/app/base.py", line 756, in send_task
amqp.send_task_message(P, name, message, **options)
File "/usr/local/lib/python3.6/dist-packages/celery/app/amqp.py", line 552, in send_task_message
**properties
File "/usr/local/lib/python3.6/dist-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 543, in _ensured
errback and errback(exc, 0)
File "/usr/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 436, in _reraise_as_library_errors
sys.exc_info()[2])
File "/usr/local/lib/python3.6/dist-packages/vine/five.py", line 194, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 431, in _reraise_as_library_errors
yield
File "/usr/local/lib/python3.6/dist-packages/kombu/connection.py", line 510, in _ensured
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/usr/local/lib/python3.6/dist-packages/amqp/channel.py", line 1771, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/usr/local/lib/python3.6/dist-packages/amqp/abstract_channel.py", line 51, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.6/dist-packages/amqp/method_framing.py", line 144, in write_frame
frame, 0xce))
File "/usr/local/lib/python3.6/dist-packages/amqp/transport.py", line 288, in write
self._write(s)
kombu.exceptions.OperationalError: [Errno 110] Connection timed out

pymonger · 2019-08-30T03:05:55Z

Setting BROKER_POOL_LIMIT = None fixed this issue for me in celery v4.4.0rc3.

auvipy · 2019-08-30T03:26:38Z

thanks for verifying!

thedrow · 2019-10-02T13:27:36Z

@auvipy That's a workaround, not a fix.
Celery should be working with eventlet and a BROKER_POOL_LIMIT.

tahamr83 · 2022-04-18T19:45:50Z

This still persists for me. We run on kubernetes, and when this occurs our task revokes stop working, the revokes never reach that worker since I couldn't find the *.pidbox queue for that worker in our rabbitmq instance.

tahamr83 · 2022-05-06T14:53:42Z

Could it be that for some reason it's a different exception than socket.timeout here

celery/celery/worker/pidbox.py

Line 119 in 7dfc1fd

except socket.timeout:

, that causes the consumer to simply shutdown, effectively disabling all Control commands

auvipy · 2022-12-02T06:28:50Z

Could it be that for some reason it's a different exception than socket.timeout here

celery/celery/worker/pidbox.py

Line 119 in 7dfc1fd

except socket.timeout:

, that causes the consumer to simply shutdown, effectively disabling all Control commands

can you open a PR and investigate?

ask added the Component: Eventlet Workers Pool label Jan 13, 2016

This was referenced Apr 14, 2016

Fix #92 __heartbeat__ opens but never closes connections mozilla/universal-search-recommendation#95

Merged

Web server creates but does not close redis connections mozilla/universal-search-recommendation#92

Closed

ask closed this as completed Jun 24, 2016

chuckharmston mentioned this issue Jun 26, 2016

Celery worker connections not closing. mozilla/universal-search-recommendation#149

Closed

auvipy reopened this Jul 20, 2017

auvipy added this to the v5.0.0 milestone Dec 19, 2017

alanjds added a commit to alanjds/celery-serverless that referenced this issue Jun 19, 2018

Fix "ConnectionResetError: [Errno 54] Connection reset by peer"

a6c300b

Fixed on Celery 4.2.0. See: celery/celery#3649 See: celery/celery#2991 See: Polyconseil/aioamqp#96

auvipy removed this from the v5.0.0 milestone Aug 10, 2018

auvipy added the Status: Feedback Needed ✘ label Aug 10, 2018

auvipy added this to the 4.3.x Maintenance milestone Apr 22, 2019

auvipy removed the Status: Feedback Needed ✘ label Apr 22, 2019

auvipy added the Issue Type: Bug Report label Apr 22, 2019

auvipy modified the milestones: 4.4.0, 4.5 May 7, 2019

auvipy modified the milestones: 4.5, 4.4.0 Aug 30, 2019

auvipy closed this as completed Aug 30, 2019

thedrow reopened this Oct 2, 2019

auvipy modified the milestones: 4.4.0, 4.5 Oct 2, 2019

auvipy modified the milestones: 4.5, 4.4.x Dec 16, 2019

ulyssesric mentioned this issue Apr 17, 2020

Both 'task_publish_retry' and 'task_publish_retry_policy' are ignored when publishing tasks #6046

Closed

18 tasks

auvipy modified the milestones: 4.4.x, 5.1.0 Dec 30, 2020

auvipy modified the milestones: 5.1.0, 5.2 Feb 18, 2021

auvipy modified the milestones: 5.2, 5.3 Oct 30, 2021

auvipy modified the milestones: 5.3, 5.3.x Dec 2, 2022

auvipy modified the milestones: 5.3.x, 5.4.x Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery: worker connection timed out #2991

Celery: worker connection timed out #2991

shiraiyuki commented Jan 8, 2016 •

edited by sync-by-unito bot

ask commented Jan 13, 2016

Glueon commented Jan 17, 2016

shiraiyuki commented Jan 20, 2016

Djarnis commented Feb 4, 2016

ask commented Jun 24, 2016

vivekanand1101 commented Jul 19, 2017 •

edited

vivekanand1101 commented Jul 20, 2017

vivekanand1101 commented Jul 20, 2017

aayushgoel92 commented Sep 17, 2017 •

edited

vivekanand1101 commented Sep 17, 2017

auvipy commented Dec 19, 2017

auvipy commented Aug 10, 2018

alanjds commented Aug 10, 2018

auvipy commented Aug 12, 2018

vsag96 commented Oct 23, 2018

auvipy commented Nov 18, 2018

hp685 commented Nov 26, 2018 •

edited

sudhishvnair commented Apr 22, 2019

pymonger commented Aug 30, 2019

auvipy commented Aug 30, 2019

thedrow commented Oct 2, 2019

tahamr83 commented Apr 18, 2022 •

edited

tahamr83 commented May 6, 2022

auvipy commented Dec 2, 2022

Celery: worker connection timed out #2991

Celery: worker connection timed out #2991

Comments

shiraiyuki commented Jan 8, 2016 • edited by sync-by-unito bot

ask commented Jan 13, 2016

Glueon commented Jan 17, 2016

shiraiyuki commented Jan 20, 2016

Djarnis commented Feb 4, 2016

ask commented Jun 24, 2016

vivekanand1101 commented Jul 19, 2017 • edited

vivekanand1101 commented Jul 20, 2017

vivekanand1101 commented Jul 20, 2017

aayushgoel92 commented Sep 17, 2017 • edited

vivekanand1101 commented Sep 17, 2017

auvipy commented Dec 19, 2017

auvipy commented Aug 10, 2018

alanjds commented Aug 10, 2018

auvipy commented Aug 12, 2018

vsag96 commented Oct 23, 2018

auvipy commented Nov 18, 2018

hp685 commented Nov 26, 2018 • edited

sudhishvnair commented Apr 22, 2019

pymonger commented Aug 30, 2019

auvipy commented Aug 30, 2019

thedrow commented Oct 2, 2019

tahamr83 commented Apr 18, 2022 • edited

tahamr83 commented May 6, 2022

auvipy commented Dec 2, 2022

shiraiyuki commented Jan 8, 2016 •

edited by sync-by-unito bot

vivekanand1101 commented Jul 19, 2017 •

edited

aayushgoel92 commented Sep 17, 2017 •

edited

hp685 commented Nov 26, 2018 •

edited

tahamr83 commented Apr 18, 2022 •

edited