Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow recovery from rabbitmq restart #324

Merged
merged 6 commits into from Apr 21, 2023

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Apr 20, 2023

ConnectionForced seems very deliberate and may not fix all of #316, but a controlled restart of a rabbitmq server results in:

Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]: Exception in thread consume-kill-pyamqp://pulsar_au:********@gat-4.eu.galaxy.training:5671//pulsar/pulsar_au?ssl=1:
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]: Traceback (most recent call last):
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     self.run()
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/usr/lib/python3.10/threading.py", line 953, in run
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     self._target(*self._args, **self._kwargs)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/pulsar/messaging/bind_amqp.py", line 52, in drain
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     __drain(name, queue_state, pulsar_exchange, callback)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/pulsar/messaging/bind_amqp.py", line 100, in __drain
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     pulsar_exchange.consume(name, callback=callback, check=queue_state)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/pulsar/client/amqp_exchange.py", line 119, in consume
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     connection.drain_events(timeout=self.__timeout)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/kombu/connection.py", line 318, in drain_events
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     return self.transport.drain_events(self.connection, **kwargs)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/kombu/transport/pyamqp.py", line 101, in drain_events
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     return connection.drain_events(**kwargs)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/amqp/connection.py", line 522, in drain_events
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     while not self.blocking_read(timeout):
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/amqp/connection.py", line 528, in blocking_read
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     return self.on_inbound_frame(frame)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/amqp/method_framing.py", line 53, in on_frame
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     callback(channel, method_sig, buf, None)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/amqp/connection.py", line 534, in on_inbound_method
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     return self.channels[channel_id].dispatch_method(
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/amqp/abstract_channel.py", line 143, in dispatch_method
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     listener(*args)
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:   File "/mnt/pulsar/venv/lib/python3.10/site-packages/amqp/connection.py", line 664, in _on_close
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]:     raise error_for_code(reply_code, reply_text,
Apr 20 09:20:12 gat-24.oz.galaxy.training pulsar[71616]: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'

and this fixes that.

@mvdbeek mvdbeek force-pushed the dont_die_on_rabbitmq_restart branch from 33724c0 to 73caeee Compare April 20, 2023 10:10
Should fix
```
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]: pulsar.client.manager ERROR 2023-04-20 10:35:34,753 [pN:handler_0,p:1034676,tN:pulsar_client__default__kill_ack] Exception while handling kill acknowledgement messages, this shouldn't really happen. Handler should be restarted.
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]: Traceback (most recent call last):
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 446, in _reraise_as_library_errors
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     yield
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 433, in _ensure_connection
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     return retry_over_time(
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/utils/functional.py", line 312, in retry_over_time
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     return fun(*args, **kwargs)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 877, in _connection_factory
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self._connection = self._establish_connection()
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 812, in _establish_connection
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     conn = self.transport.establish_connection()
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/transport/pyamqp.py", line 201, in establish_connection
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     conn.connect()
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/amqp/connection.py", line 323, in connect
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self.transport.connect()
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/amqp/transport.py", line 129, in connect
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self._connect(self.host, self.port, self.connect_timeout)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/amqp/transport.py", line 184, in _connect
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self.sock.connect(sa)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]: ConnectionRefusedError: [Errno 111] Connection refused
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]: The above exception was the direct cause of the following exception:
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]: Traceback (most recent call last):
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/pulsar/client/manager.py", line 191, in ack_consumer
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self.exchange.consume(queue_name + '_ack', None, check=self)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/pulsar/client/amqp_exchange.py", line 124, in consume
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     with kombu.Consumer(connection, queues=[queue], callbacks=callbacks, accept=['json']):
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/messaging.py", line 387, in __init__
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self.revive(self.channel)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/messaging.py", line 400, in revive
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     channel = self.channel = maybe_channel(channel)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 1052, in maybe_channel
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     return channel.default_channel
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 895, in default_channel
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self._ensure_connection(**conn_opts)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 432, in _ensure_connection
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     with ctx():
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     self.gen.throw(typ, value, traceback)
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:   File "/srv/galaxy/venv/lib/python3.10/site-packages/kombu/connection.py", line 450, in _reraise_as_library_errors
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]:     raise ConnectionError(str(exc)) from exc
Apr 20 10:35:34 gat-4.eu.galaxy.training galaxyctl[1034676]: kombu.exceptions.OperationalError: [Errno 111] Connection refused
```

on the client side (=Galaxy) when restarting rabbitmq.
@mvdbeek mvdbeek marked this pull request as ready for review April 20, 2023 12:14
@jmchilton jmchilton merged commit e487494 into galaxyproject:master Apr 21, 2023
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants