celery failover with rabbitmq cluster hangs #4075

bpereto · 2017-06-07T11:16:01Z

We have a rabbitmq HA cluster as Broker with a loadbalancer.
In my failovertests I expected some issues I want to address/discuss.

If I force poweroff one node of the rabbitmq cluster, the celery workers doesn't reconnect throught the loadbalancer to the second node and fails with error: [Errno 104] Connection reset by peer

Setup

software -> celery:3.1.25 (Cipater) kombu:3.0.37 py:2.7.5
billiard:3.3.0.23 py-amqp:1.4.9
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:amqp results:redis

Broker

rabbitmq HA cluster as Broker.
SSL is used.

policy
ha-sync-mode: automatic
ha-mode: all
pattern: .*

Worker

BROKER_URL = 'amqp://user:pw@mybrokerurl/appvhost'
BROKER_USE_SSL={
            'ca_certs': '/etc/pki/myca/CA/certs/ca.pem'
}
BROKER_POOL_LIMIT = None

 # my test to set socket timeout
BROKER_TRANSPORT_OPTIONS = {'confirm_publish': True,
                                                      'socket_timeout' : 60}
BROKER_HEARTBEAT = 10
BROKER_CONNECTION_MAX_RETRIES = None

CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TIMEZONE = 'Europe/Zurich'
CELERY_ENABLE_UTC = True
CELERY_TRACK_STARTED = True
CELERY_EVENT_QUEUE_EXPIRES = 60 # Will delete all celeryev. queues without consumers after 1 minute.
CELERYD_PREFETCH_MULTIPLIER = 1

Steps to reproduce

rabbitmqctl stop_app on node1 to let workers reconnect to node2. reconnect works fine.
rabbitmqctl start_app on node1 to be HA protected again
Force Poweroff node2 or ifdown eth0

Expected behavior

Celery workers reconnect after a few seconds to the next broker node

Actual behavior

Celery workers stuck 15 minutes, until a socket timeout occurs before reconnecting to the broker.

[2017-06-07 11:37:33,307: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 280, in start
    blueprint.start(self)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 884, in start
    c.loop(*c.loop_args())
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/worker/loops.py", line 76, in asynloop
    next(loop)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/hub.py", line 281, in create_loop
    poll_timeout = fire_timers(propagate=propagate) if scheduled else 1
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/hub.py", line 140, in fire_timers
    entry()
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/timer.py", line 64, in __call__
    return self.fun(*self.args, **self.kwargs)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/timer.py", line 132, in _reschedules
    return fun(*args, **kwargs)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/connection.py", line 277, in heartbeat_check
    return self.transport.heartbeat_check(self.connection, rate=rate)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 135, in heartbeat_check
    return connection.heartbeat_tick(rate=rate)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/connection.py", line 907, in heartbeat_tick
    self.send_heartbeat()
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/connection.py", line 884, in send_heartbeat
    self.transport.write_frame(8, 0, bytes())
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/transport.py", line 254, in _write
    n = write(s)
  File "/usr/lib64/python2.7/ssl.py", line 671, in write
    return self._sslobj.write(data)
error: [Errno 104] Connection reset by peer

I haven't done the testing against the master yet, because a major upgrade in our software project and settings is needed.

The text was updated successfully, but these errors were encountered:

georgepsarakis · 2017-06-07T19:18:38Z

Would you mind taking a look into #3921 ? It seems related.

bpereto · 2017-06-07T19:33:31Z

I will look into it.

My search revealed #1859 . Do you think there is a correlation?

georgepsarakis · 2017-06-07T19:37:08Z

Could be, I will read the comments there too.

bpereto · 2017-06-08T07:25:19Z

Looked into the related issue you suggested.
The option --whitout-mingle and celery/kombu#724 does not fix the problem. Additionally, I tested the same steps for amqp without SSL and still the same error picture.

I also made tests with a BROKER_URL List to test the setup without loadbalancer. Still the same issue.

strace shows:

Process 27606 attached
recvfrom(100,

bpereto · 2017-06-08T13:08:14Z

This mitigates the problem: sysctl net.ipv4.tcp_retries2=5

The default value of 15 (~924 s = 15.4 min) matches the timeout value.

I don't know if this setting with the value 5 is good or bad.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_retries2 - INTEGER
	This value influences the timeout of an alive TCP connection,
	when RTO retransmissions remain unacknowledged.
	Given a value of N, a hypothetical TCP connection following
	exponential backoff with an initial RTO of TCP_RTO_MIN would
	retransmit N times before killing the connection at the (N+1)th RTO.

	The default value of 15 yields a hypothetical timeout of 924.6
	seconds and is a lower bound for the effective timeout.
	TCP will effectively time out at the first RTO which exceeds the
	hypothetical timeout.

	RFC 1122 recommends at least 100 seconds for the timeout,
	which corresponds to a value of at least 8.

auvipy · 2018-01-09T12:34:15Z

closing to continue discussion on older issue

auvipy mentioned this issue Jan 9, 2018

Celery 4 worker can't connect to RabbitMQ broker failover #3921

Closed

auvipy closed this as completed Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

celery failover with rabbitmq cluster hangs #4075

celery failover with rabbitmq cluster hangs #4075

bpereto commented Jun 7, 2017 •

edited

georgepsarakis commented Jun 7, 2017

bpereto commented Jun 7, 2017

georgepsarakis commented Jun 7, 2017

bpereto commented Jun 8, 2017 •

edited

bpereto commented Jun 8, 2017 •

edited

auvipy commented Jan 9, 2018

celery failover with rabbitmq cluster hangs #4075

celery failover with rabbitmq cluster hangs #4075

Comments

bpereto commented Jun 7, 2017 • edited

Setup

Broker

Worker

Steps to reproduce

Expected behavior

Actual behavior

georgepsarakis commented Jun 7, 2017

bpereto commented Jun 7, 2017

georgepsarakis commented Jun 7, 2017

bpereto commented Jun 8, 2017 • edited

bpereto commented Jun 8, 2017 • edited

auvipy commented Jan 9, 2018

bpereto commented Jun 7, 2017 •

edited

bpereto commented Jun 8, 2017 •

edited

bpereto commented Jun 8, 2017 •

edited