-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
celery failover with rabbitmq cluster hangs #4075
Comments
Would you mind taking a look into #3921 ? It seems related. |
I will look into it. My search revealed #1859 . Do you think there is a correlation? |
Could be, I will read the comments there too. |
Looked into the related issue you suggested. I also made tests with a BROKER_URL List to test the setup without loadbalancer. Still the same issue. strace shows:
|
This mitigates the problem: The default value of 15 (~924 s = 15.4 min) matches the timeout value. I don't know if this setting with the value 5 is good or bad. https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
|
closing to continue discussion on older issue |
We have a rabbitmq HA cluster as Broker with a loadbalancer.
In my failovertests I expected some issues I want to address/discuss.
If I force poweroff one node of the rabbitmq cluster, the celery workers doesn't reconnect throught the loadbalancer to the second node and fails with
error: [Errno 104] Connection reset by peer
Setup
software -> celery:3.1.25 (Cipater) kombu:3.0.37 py:2.7.5
billiard:3.3.0.23 py-amqp:1.4.9
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:amqp results:redis
Broker
rabbitmq HA cluster as Broker.
SSL is used.
policy
ha-sync-mode: automatic
ha-mode: all
pattern: .*
Worker
Steps to reproduce
rabbitmqctl stop_app
on node1 to let workers reconnect to node2. reconnect works fine.rabbitmqctl start_app
on node1 to be HA protected againExpected behavior
Celery workers reconnect after a few seconds to the next broker node
Actual behavior
Celery workers stuck 15 minutes, until a socket timeout occurs before reconnecting to the broker.
I haven't done the testing against the master yet, because a major upgrade in our software project and settings is needed.
The text was updated successfully, but these errors were encountered: