-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507
Conversation
Issue: #13505 |
@AlanCoding I used a different approacha fter a lot of testing and trying different use case. I adopted the usage of "ping" in postgres with a So simply add one line of code we can catch the issue :-) Obivously, i take a "query" every 5 seconds in case of there is no notification to be processed but i think this is acceptable. |
@tanganellilore Could you provide us with additional information regarding the error you experienced and what you saw? Screenshots, additional logs, or any additional information you can provide would be very helpful. Thank you for your time and assistance with this matter! |
Specifically, as you're adding another class of error to be caught, which would make it really really relevant to be able to see what tracebacks were produced in the failover events that you had. |
Good morning @AlanCoding , This PR try to catch the reconnection error also for the main loop notify. Some screenshot of error:Simulate a failover with deattach of vip address from Master postgreql instance with actual release:Failover started on 08:16:20On AWX UI all jobs are in pending: Inside awx-task the awx-manage command: Simulate a failover with deattach of vip address from Master postgreql instance with this PR:Failover failover started on 08:33:40On awx-task logs: On AWX UI all jobs are in running because we have a reconnection at 08:34:15: Inside awx-task the awx-manage command: |
Thanks for the reply. I really want to understand the mechanics of what you've seen here. If I go to In [1]: from awx.main.dispatch import pg_bus_conn
In [2]: with pg_bus_conn(new_connection=True) as conn:
...: for e in conn.events(yield_timeouts=True):
...: print(e)
...:
None
None
None
None Then I will continue getting a
At first glance, this seems like the same sort of traceback you had, although this is still indirect logs. The log I have, seen from the dispatcher's listener's perspective is:
Again, consistent with my shell_plus testing. This is showing that it falls within the |
Thanks @AlanCoding for reply. I understand What you mean, but probably I catch why you are not able to replay. One of the most standard solutions for a failover is to move the VIP ip from master to one of the slave or use an ha proxy. What I tried to understant is why, in case of failover, i receive an exception from awx.conf.settings instead from Main loop process and this was my answer. |
these some logs: You can see that there isn't any "conenction" drop. So I confirm that this happen only in a single instance or in HA solution where all postgressql go down in the same time. If we have an HA, with Patroni, pg_bouncer and vip-manager or HA proxy with VIP address, old connections go down but seems is not fetched by psycopg2 library (probaly because is very short time) . In any case I notice that nothing check if |
good morning @AlanCoding , And this is for me normally, becasue the vip address will be moved to one of the slave and nobody told to main loop that this connection was dropped (it not receive an ack status or dropping or something like this because master was shutted down not programmatically). This is my lab (created with https://github.com/vitabaks/postgresql_cluster): What I tried:
address side:
Failover via patroni and Failover via deattach NW to vm or shutdown not gracefully: KO, psycopg2 not catch that connection go drop and re-up in a short time |
Thanks for digging and the information you gave here. I am much more-so seeing why there isn't an underlying traceback associated with the problem, and that the problem is a bit lower-level on the I brought this up at the community meeting #13563 Engineering-wise, it might be obvious I don't like adding another periodic SELECT, particularly in such a low-level fast 5 second loop. But you have good points that suggest it may be unavoidable in which case increasing the timeout might be a good counter-balance. We already run tasks in the main loop on a periodic basis, and these can already talk to the database (and other systems). The weird and non-obvious point to know is that this process maintains 2 database connections - one for pg_notify listening and the other for Django ORM actions. The latter connection is closed for at least scaleup events. If the connection might ever be closed, then it is no good for listening to pg_notify because it would lose messages while closed. The entire logic for this is suspicious and I don't feel like both connections should be necessary. If those could be consolidated, my preference would be to test the connection in normal dispatcher periodic tasks. This would be a much larger change that risks unexpected fallout, but I am thinking about it. |
I wasn't able to join on community meeting, so i will answer here. Consider that in some of my debug session, i notice that in a awx instance used normally None event are very, very low so this contuer can be very low, like 10/12. So in the worst case, we stay without db connection for 1 minute with one single check per minute |
Maybe! It will not hit the timeout every loop, because it exits early if it gets a message. We don't need to pretend that we don't know anything about the event load coming in. Every single AWX node has periodic tasks running that are submitted to the local queue. See: Line 452 in 1147559
So, since there is essentially a chron job submitting a task every 20 seconds, then if you hit 4 timeouts (5x4...) then something's wrong, it's time to panic. It could be because the fork submitting periodic tasks had an error, but it could also be due to this problem, a problem with the listening connection. |
Hi @AlanCoding Any update on the merging the above postgres DB instable connections? [-] awx.conf.settings Database settings are not available, using defaults. error: connection already closed\n","stream":"stderr" |
Hi @AlanCoding, Any update on the merging the above postgres DB instable connections? |
I'm glad that you put it on the community meeting schedule. @tanganellilore has done excellent work here, and I'm on-board with the latest state of our discussion - that we need to SELECT something with the listening connection periodically to detect this case where it has gone stale. In the meantime, other changes have been landing in the |
@AlanCoding I try to update my fork with a new branch from devel, and try to approach with different solution that we discuss (check connection after X iteration). I know that with web/task split, |
finally got around to pulling down your change and testing it... apology for the amount of time that it took setup:
result:
|
Hi team, From your tests seems that not works on RDS, but i also remeber some issues with RDS in general. When I'm ready i will ping you and we can see the output |
@tanganellilore ansible/awx-operator#1393 here's a PR that we are currently testing to solve the problem seems to be working |
@tanganellilore have you tested out ansible/awx-operator#1393 to see if it works for you? if it does lets close out this PR |
as per comment on #13505 Thanks for solution!! |
SUMMARY
I notice that in case of instable connection with PostgreSQL or failover (in case of PostgreSQL HA configuration) awx.main.dispatcher lose the connection with DB and not reconnect it.
I opened a PR to try to fix this issue with a simple check of connection with some retry on pg_bus_conn function (as default 40 times every 4 seconds, as per default).
If connection is restored we need to kill dispatcher process on awx-task container and wait to restart it.
I don't know if different approach it's possibile but in that way we are sure that all process restart correctly.
ISSUE TYPE
COMPONENT NAME
AWX VERSION
ADDITIONAL INFORMATION