-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505
Comments
Hi. Will #13507 be continued, or is there another solution? |
I hope that @AlanCoding can check it for next release.... |
ansible/awx-operator#1393 i think this PR resolves the problem, please verify |
@tanganellilore did u had a chance to verify if the fix works? |
@TheRealHaoLiu tested today and this new params with check of keepilive work great. Thanks a lot for the solution!!! |
Hi. I tested this now from our end with AWX 22.3.0 and got the following behavior with our pgpool based HA postgres setup:
I tried the default settings for keepalive and also experimented with the values, but i always got the same result. |
In my case i see errors awx-task, in case of manual failover (with patroni) and real dr (shutdoun master postgresql) |
Hi, i tested again with different settings in our K8S based Postgres setup and found out that probably the clusterIP service in Kubernetes is the reason the jobs can not be scheduled after the database DR switchover. I created another headless service like in the standalone DB setup and here the problem with the not schedulable jobs does not occur. Here we seem to have a hard cut in the connection which leads to a reconnect and to a AWX web/task service which is not reachable for some seconds. @tanganellilore Can you also see this behavior? I can also create another issue for this when we want to handle this separately. |
You mean job that finish in the same time of db switchover? Ot you mean a long run job that running during db switchover |
The jobs which are finishing in the time period the switchover happens and where we do not have a DB connection. |
Ok, I don't know how to test this limit case, but your switch time it's fast or require some times like 20/30 seconds? Because probably you can customize settings and keepalive... To be able to fetch this limit case. To be honest, this solution is a HA solution, not a BC solution, so I'm expect that some limit case (like this one) are not covered well. So everything depends on your configuration and replication mode. |
Please confirm the following
Bug Summary
Hi team,
as discussed on matrix with @TheRealHaoLiu and @shanemcd I notice that in case of instable connection with PostgreSQL or failover (in case of PostgreSQL HA configuration) awx.main.dispatcher lose the connection with DB and not reconnect it.
I tired multiple use case, but the actual reconnection will work only in a very short dropdown connection (less then 1 sec) or not everytime.
I notice that similar issue was already opened but closed with a wron motivation #12683 .
I opened a PR to try to fix this issue with a simple check of connection with some retry on pg_bus_conn function (as default 40 times every 4 seconds).
If connection is restored we need to kill dispatcher process on awx-task container to restart it.
I don't know if different approach it's possibile but in that way we are sure that all process restart correctly.
From UI perspective or awx-web logs I don't see any drop, everything works whell.
AWX version
21.10.2
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Use AWX normally and then try to restart or deattach the network to postgreSQL and check awx-task log.
Connection will be droppend and all jobs will be in pending status.
Expected results
In case of PostgreSQL drop connection , reconnection should be re-established
Actual results
Error on awx-task
Additional information
No response
The text was updated successfully, but these errors were encountered: