Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wsrelay does not re-establish connection after db outage #15030

Closed
5 of 11 tasks
TheRealHaoLiu opened this issue Mar 25, 2024 · 1 comment
Closed
5 of 11 tasks

wsrelay does not re-establish connection after db outage #15030

TheRealHaoLiu opened this issue Mar 25, 2024 · 1 comment

Comments

@TheRealHaoLiu
Copy link
Member

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

after a (unspecified) period of DB outage wsrelay does correctly re-establish connection result in websocket stop working

AWX version

24.0.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

  • deploy awx on kube
  • scale down awx-operator
  • scale down postgres statefulset for....60 second
  • scale backup postgres statefulset

Expected results

websocket (like job log live update) works in the UI

Actual results

websocket stop working

Additional information

future: <Task finished name='Task-9' coro=<WebSocketRelayManager.on_ws_heartbeat() done, defined at /var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/wsrelay.py:221> exception=OperationalError('consuming input failed: server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.')>
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/wsrelay.py", line 223, in on_ws_heartbeat
    async for notif in conn.notifies():
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/psycopg/connection_async.py", line 315, in notifies
    raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.```
	
we see that the recent change to add tcp keepalive does result in pg listen connection to correctly terminate after database down for > 25 second (yay)

after some digging with @jbradberry we found that the task created by `event_loop.create_task(self.on_ws_heartbeat(async_conn))` terminated but the main loop outside is still running (since it doesn't access the database)

we determine that we need to re-establish the db connection and restart the `on_ws_heartbeat` task after db connection has been lost

the end goal wsrelay process should not terminate even when there's db connection problem it should continue to retry to establish connection forever
@TheRealHaoLiu
Copy link
Member Author

fixed by #15031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant