New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery 4 worker can't connect to RabbitMQ broker failover #3921
Comments
I have this same issue. |
@csterk or @draskomikic could you try the following patch please? Change this line of the
If this works we could then start working on a proper way to fix this bug. |
Hi George, I have tried to set I have also tried to set Celery configuration parameter I have also added one line in amqp/transport.py library to print out node Line is added before: According to logs it seems that Celery has conflict in connecting to You can see celery log output when I stop rabbitmq1 RabbitMQ node:
|
@draskomikic did you try starting the worker with the |
Hi George, I have run some basis tests using Thank you very much for all help, let me know if I can somehow help you with this bug fix. |
@draskomikic great 😄 , happy to help! Yes, I believe we need to solve this. I will give it some thought and let you know. |
@georgepsarakis Great job troubleshooting this. Have you thought about how to fix this? Btw, can you use the labels to triage issues. It's easier for others to filter them based on their expertise. |
@thedrow you mean adding the I haven't yet concluded on how we can fix this, as it goes beyond my understanding of the project internals. I will have another look and let you know if I have any suggestions. |
I have exactly same problem, and |
Looks like it's a bit deeper. The same issue exists with task producer - if it's connected to a rabbit node and this node dies, producer is unable to establish stable connection to failover node.
And on failover rabbit node:
And mingle stuff has nothing to do with producer, right? |
@draskomikic @monitorius can you please try this patch: celery/kombu#724 ? |
My kombu/connection.py looks like this now:
And still have reconnects:
|
With debugger I see that infinite repeats came from here: And stack (to
https://github.com/celery/py-amqp/blob/v2.1.4/amqp/transport.py#L275 |
@monitorius are you using the RPC result backend? If yes, could you also try with another backend too? |
Here is the problem: Literally, we have a new alive channel, but we trying to send messages to dead channel just because As a proof of concept I inserted this dirty hack here:
changing I'm new to kombu code, but it looks like a serious problem, because |
@monitorius great work in debugging here. I may have an alternative. What happens if you change line to:
|
If you mean |
With disabled rpc backend failover works fine - |
Sorry for late response guys. I did quick test with patch celery/kombu#724 and it seems to be working. I am not using RPC result backend. I will do more thoroughly testing in a few hours. UPDATE: When I include |
Oh, right. In fact, we are talking about two different cases (with same symptoms) here.
I tested patch celery/kombu#724 only for case 2) - and my last messages are about producer. I've just tested this for 1) case - and @draskomikic is right, it works fine with workers. |
@monitorius correct. Now that I think of it, I think this happens because the exception type that is raised is not included in the list of |
I believe it is safe to change the line to:
What do you think? |
@monitorius I haven't still figured out how to fully fix the issue with the RPC backend. After the connection retries, it seems that the endless loop is transferred to Kombu |
@georgepsarakis So celery/kombu#724 is only a partial fix? |
As verified above by @monitorius and @draskomikic , the worker issue is resolved. However, I could not get the RPC backend to work unfortunately, thus my latest comment. |
@georgepsarakis Thanks for all hard work and prompt responses for this issue. |
Is this issue resolved? I've tried the dirty fix from @monitorius with |
As far as we can tel there's only a partial fix. |
The full fix won't hit 4.1.0 unfortunately. |
@draskomikic I tried to fix this issue with proxying default_channel (which will automatically replace default channel in case of reconnect). Can you please test if it works for you (before I create pull request)? https://github.com/mirasrael/kombu/tree/fix-reconnect-for-send-message |
can you look into the last comment of the issue #4075 |
@mirasrael plz send a pr with test |
Also encountered this issue (I think) using Celery 4.1.0, rabbitmq log was filling up with 1000s of messages like:
This happened after restarting the device running the rabbitmq server and worker. The client calling the tasks would go crazy on rabbitmq once that server+worker came back online. Removing the |
Is 4.2 coming any time soon then? |
Can anyone confirm this issue with the latest release? |
I have 3 RabbitMQ nodes in cluster in HA mode. Each node is on separate Docker container.
I have used this command to set HA policy:
rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
Celery config looks like this:
Everything works fine until I stop master RabbitMQ application in order to test Celery failover feature using command:
rabbitmqctl stop_app
Immediately after RabbitMQ application is stopped I started seeing errors in log bellow. Frequency of log messages is very high and it doesn't slow down with number of attempts.
According to logs Celery tries to reconnect using next failover but it get interrupted by another try to reconnect to node that was stopped. The same thing happens over and over like in infinite loop.
pip list:
The text was updated successfully, but these errors were encountered: