Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

celery failover with rabbitmq cluster hangs #4075

Closed
bpereto opened this issue Jun 7, 2017 · 6 comments
Closed

celery failover with rabbitmq cluster hangs #4075

bpereto opened this issue Jun 7, 2017 · 6 comments

Comments

@bpereto
Copy link
Contributor

bpereto commented Jun 7, 2017

We have a rabbitmq HA cluster as Broker with a loadbalancer.
In my failovertests I expected some issues I want to address/discuss.

If I force poweroff one node of the rabbitmq cluster, the celery workers doesn't reconnect throught the loadbalancer to the second node and fails with error: [Errno 104] Connection reset by peer

Setup

software -> celery:3.1.25 (Cipater) kombu:3.0.37 py:2.7.5
billiard:3.3.0.23 py-amqp:1.4.9
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:amqp results:redis

Broker

rabbitmq HA cluster as Broker.
SSL is used.

policy
ha-sync-mode: automatic
ha-mode: all
pattern: .*

Worker

BROKER_URL = 'amqp://user:pw@mybrokerurl/appvhost'
BROKER_USE_SSL={
            'ca_certs': '/etc/pki/myca/CA/certs/ca.pem'
}
BROKER_POOL_LIMIT = None

 # my test to set socket timeout
BROKER_TRANSPORT_OPTIONS = {'confirm_publish': True,
                                                      'socket_timeout' : 60}
BROKER_HEARTBEAT = 10
BROKER_CONNECTION_MAX_RETRIES = None

CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TIMEZONE = 'Europe/Zurich'
CELERY_ENABLE_UTC = True
CELERY_TRACK_STARTED = True
CELERY_EVENT_QUEUE_EXPIRES = 60 # Will delete all celeryev. queues without consumers after 1 minute.
CELERYD_PREFETCH_MULTIPLIER = 1

Steps to reproduce

  • rabbitmqctl stop_app on node1 to let workers reconnect to node2. reconnect works fine.
  • rabbitmqctl start_app on node1 to be HA protected again
  • Force Poweroff node2 or ifdown eth0

Expected behavior

Celery workers reconnect after a few seconds to the next broker node

Actual behavior

Celery workers stuck 15 minutes, until a socket timeout occurs before reconnecting to the broker.

[2017-06-07 11:37:33,307: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 280, in start
    blueprint.start(self)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 884, in start
    c.loop(*c.loop_args())
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/celery/worker/loops.py", line 76, in asynloop
    next(loop)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/hub.py", line 281, in create_loop
    poll_timeout = fire_timers(propagate=propagate) if scheduled else 1
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/hub.py", line 140, in fire_timers
    entry()
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/timer.py", line 64, in __call__
    return self.fun(*self.args, **self.kwargs)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/async/timer.py", line 132, in _reschedules
    return fun(*args, **kwargs)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/connection.py", line 277, in heartbeat_check
    return self.transport.heartbeat_check(self.connection, rate=rate)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 135, in heartbeat_check
    return connection.heartbeat_tick(rate=rate)
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/connection.py", line 907, in heartbeat_tick
    self.send_heartbeat()
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/connection.py", line 884, in send_heartbeat
    self.transport.write_frame(8, 0, bytes())
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/app/myapp/.venv/lib/python2.7/site-packages/amqp/transport.py", line 254, in _write
    n = write(s)
  File "/usr/lib64/python2.7/ssl.py", line 671, in write
    return self._sslobj.write(data)
error: [Errno 104] Connection reset by peer

I haven't done the testing against the master yet, because a major upgrade in our software project and settings is needed.

@georgepsarakis
Copy link
Contributor

Would you mind taking a look into #3921 ? It seems related.

@bpereto
Copy link
Contributor Author

bpereto commented Jun 7, 2017

I will look into it.

My search revealed #1859 . Do you think there is a correlation?

@georgepsarakis
Copy link
Contributor

Could be, I will read the comments there too.

@bpereto
Copy link
Contributor Author

bpereto commented Jun 8, 2017

Looked into the related issue you suggested.
The option --whitout-mingle and celery/kombu#724 does not fix the problem. Additionally, I tested the same steps for amqp without SSL and still the same error picture.

I also made tests with a BROKER_URL List to test the setup without loadbalancer. Still the same issue.

strace shows:

Process 27606 attached
recvfrom(100, 

@bpereto
Copy link
Contributor Author

bpereto commented Jun 8, 2017

This mitigates the problem: sysctl net.ipv4.tcp_retries2=5

The default value of 15 (~924 s = 15.4 min) matches the timeout value.

I don't know if this setting with the value 5 is good or bad.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_retries2 - INTEGER
	This value influences the timeout of an alive TCP connection,
	when RTO retransmissions remain unacknowledged.
	Given a value of N, a hypothetical TCP connection following
	exponential backoff with an initial RTO of TCP_RTO_MIN would
	retransmit N times before killing the connection at the (N+1)th RTO.

	The default value of 15 yields a hypothetical timeout of 924.6
	seconds and is a lower bound for the effective timeout.
	TCP will effectively time out at the first RTO which exceeds the
	hypothetical timeout.

	RFC 1122 recommends at least 100 seconds for the timeout,
	which corresponds to a value of at least 8.

@auvipy
Copy link
Member

auvipy commented Jan 9, 2018

closing to continue discussion on older issue

@auvipy auvipy closed this as completed Jan 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants