fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507

tanganellilore · 2023-02-01T21:13:44Z

SUMMARY

I notice that in case of instable connection with PostgreSQL or failover (in case of PostgreSQL HA configuration) awx.main.dispatcher lose the connection with DB and not reconnect it.

I opened a PR to try to fix this issue with a simple check of connection with some retry on pg_bus_conn function (as default 40 times every 4 seconds, as per default).
If connection is restored we need to kill dispatcher process on awx-task container and wait to restart it.

I don't know if different approach it's possibile but in that way we are sure that all process restart correctly.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

Other

AWX VERSION

AWX 20.10.2

ADDITIONAL INFORMATION

tanganellilore · 2023-02-01T21:15:11Z

Issue: #13505

awx/main/dispatch/__init__.py

tanganellilore · 2023-02-02T16:35:10Z

@AlanCoding I used a different approacha fter a lot of testing and trying different use case.

I adopted the usage of "ping" in postgres with a SELECT 1 in the event loop only when yield_timeouts is true.
In that way if in the main thread the pg_bus_conn setted in the frist loop of dispatcher will go in hang (due to nw change on postgres or connection timeout or somehting similar) with this simple ping we can go directly to the exception and go to the retry loop alrready inplace.

So simply add one line of code we can catch the issue :-)

Obivously, i take a "query" every 5 seconds in case of there is no notification to be processed but i think this is acceptable.

djyasin · 2023-02-09T15:32:00Z

@tanganellilore
Thank you for providing this additional information! We are still trying to identify exactly what is occurring here.

Could you provide us with additional information regarding the error you experienced and what you saw? Screenshots, additional logs, or any additional information you can provide would be very helpful.

Thank you for your time and assistance with this matter!

AlanCoding · 2023-02-09T15:51:14Z

Specifically, as you're adding another class of error to be caught, which would make it really really relevant to be able to see what tracebacks were produced in the failover events that you had.

tanganellilore · 2023-02-10T08:50:02Z

Good morning @AlanCoding ,
I removed the db.Error except, due to the fact that is not required on this case.

This PR try to catch the reconnection error also for the main loop notify.
Actually the main loop on /awx/main/dispatch/worker/base.py line 177, will not catch the connection drop beacuse it's simply create a connection first time and then listen the queue only one time. So if the connection to postgres will drop, for example for a network lag or for a failover of postgres, nobody restart the main loop and reconnect to all queue.
Going to exception, we simply restart the main loop if the retry happens before pg_max_wait time.

Some screenshot of error:

Simulate a failover with deattach of vip address from Master postgreql instance with actual release:

Failover started on 08:16:20

On awx-task logs:

On AWX UI all jobs are in pending:

Inside awx-task the awx-manage command:

(all worker are in idle)

(error on wait the listener)

Simulate a failover with deattach of vip address from Master postgreql instance with this PR:

Failover failover started on 08:33:40

On awx-task logs:

(we going to exception)

On AWX UI all jobs are in running because we have a reconnection at 08:34:15:

Inside awx-task the awx-manage command:

(worker now run some command and sync)

(listener work)

AlanCoding · 2023-02-14T01:51:51Z

Thanks for the reply. I really want to understand the mechanics of what you've seen here. If I go to awx-manage shell_plus and then run this:

In [1]: from awx.main.dispatch import pg_bus_conn

In [2]: with pg_bus_conn(new_connection=True) as conn:
   ...:     for e in conn.events(yield_timeouts=True):
   ...:         print(e)
   ...: 
None
None
None
None

Then I will continue getting a None every 5 seconds as long as the database is healthy. Now if I essentially shut postgres down (pretty normally I think, it logs about terminating connections then "LOG: shutting down") then it will stop printing those and give:

None
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
Cell In[2], line 2
      1 with pg_bus_conn(new_connection=True) as conn:
----> 2     for e in conn.events(yield_timeouts=True):
      3         print(e)

File /awx_devel/awx/main/dispatch/__init__.py:42, in PubSub.events(self, select_timeout, yield_timeouts)
     40         yield None
     41 else:
---> 42     self.conn.poll()
     43     while self.conn.notifies:
     44         yield self.conn.notifies.pop(0)

OperationalError: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

At first glance, this seems like the same sort of traceback you had, although this is still indirect logs. The log I have, seen from the dispatcher's listener's perspective is:

tools_awx_1 | 2023-02-13 21:18:21,113 ERROR    [a73e7ff7] awx.main.dispatch Error consuming new events from postgres, will retry for 40 s
tools_awx_1 | Traceback (most recent call last):
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/base.py", line 182, in run
tools_awx_1 |     for e in conn.events(yield_timeouts=True):
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/__init__.py", line 42, in events
tools_awx_1 |     self.conn.poll()
tools_awx_1 | psycopg2.OperationalError: server closed the connection unexpectedly
tools_awx_1 | 	This probably means the server terminated abnormally
tools_awx_1 | 	before or while processing the request.

Again, consistent with my shell_plus testing. This is showing that it falls within the (db.DatabaseError, psycopg2.OperationalError) exceptions being caught. This already sounds like it's doing what you propose, and I am interested to close any cases of database flake that current logic does not catch.

tanganellilore · 2023-02-14T06:46:02Z

Thanks @AlanCoding for reply. I understand What you mean, but probably I catch why you are not able to replay.
Use case i little different than you share now.
In HA postgres solution you have 3 instances of postgres, 1 master and 2 slaves, accessible via VIP.

One of the most standard solutions for a failover is to move the VIP ip from master to one of the slave or use an ha proxy.
What i can suppose is that, the main differance beetwen this case and your examples is the connection itself.
In your case you drop all connections, shutting down the postgres and all pg_notify go down.
In failover case we have different behavior and seems that the main connection it's partially broken or will be restore partially in some way (or the switch is very fast) but it will not work properly.

What I tried to understant is why, in case of failover, i receive an exception from awx.conf.settings instead from Main loop process and this was my answer.
Put a "ping" during the "None" message to check if connection is healty, because we check the connection unhealty only in case we recive notify. But if we don'recive notify, we don't check if connection is healty...
We suppose that it's implicit in a single instance, and it's true, but this is not the same in a ha postgre

tanganellilore · 2023-02-14T09:50:21Z

these some logs:
Failover at 02/14/2023, 7:53:11:

From awx-manage shell_plus:

You can see that there isn't any "conenction" drop.

So I confirm that this happen only in a single instance or in HA solution where all postgressql go down in the same time.

If we have an HA, with Patroni, pg_bouncer and vip-manager or HA proxy with VIP address, old connections go down but seems is not fetched by psycopg2 library (probaly because is very short time) .

In any case I notice that nothing check if self.conn is valid and working when we are in if select.select([self.conn], [], [], select_timeout) == NOT_READY:, so in case of connection resetted by postgres we risk to have a not working connection on awx side.

tanganellilore · 2023-02-15T10:20:06Z

good morning @AlanCoding ,
I tried different configuration for postgres HA but in some cases is not catched by exception in main loop.
To be more precise, in all cases where there is a real failover (deattach of nw from master or brute shutdown of master) this is not catched.

And this is for me normally, becasue the vip address will be moved to one of the slave and nobody told to main loop that this connection was dropped (it not receive an ack status or dropping or something like this because master was shutted down not programmatically).
The strange things is the error was catched by awx.conf.settings every 1 minute

This is my lab (created with https://github.com/vitabaks/postgresql_cluster):
postgresql01: 10.10.1.2 - patroni, vip-manager, etcd, pgbouncer
postgresql02: 10.10.1.3 - patroni, vip-manager, etcd, pgbouncer
etcdarbiter: 10.10.1.4 - etcd
vip: 10.10.1.5

What I tried:
server side:

with pgbouncer (using port 6432)
without pgbouncer (using port 5432)

address side:

with dns that resolve to vip
without dns (directly IP address)

Failover via patroni and patronictl failover cluster or normal shutdown of postgres master: Success (expected, becasue manual failover or reboot drop all connection and probably send message that is catched by psycopg2)

Failover via deattach NW to vm or shutdown not gracefully: KO, psycopg2 not catch that connection go drop and re-up in a short time

AlanCoding · 2023-03-14T15:11:46Z

Thanks for digging and the information you gave here. I am much more-so seeing why there isn't an underlying traceback associated with the problem, and that the problem is a bit lower-level on the select.select not realizing that a failover had happened and made the socket it was using defunct.

I brought this up at the community meeting #13563

Engineering-wise, it might be obvious I don't like adding another periodic SELECT, particularly in such a low-level fast 5 second loop. But you have good points that suggest it may be unavoidable in which case increasing the timeout might be a good counter-balance. We already run tasks in the main loop on a periodic basis, and these can already talk to the database (and other systems). The weird and non-obvious point to know is that this process maintains 2 database connections - one for pg_notify listening and the other for Django ORM actions. The latter connection is closed for at least scaleup events. If the connection might ever be closed, then it is no good for listening to pg_notify because it would lose messages while closed. The entire logic for this is suspicious and I don't feel like both connections should be necessary. If those could be consolidated, my preference would be to test the connection in normal dispatcher periodic tasks. This would be a much larger change that risks unexpected fallout, but I am thinking about it.

tanganellilore · 2023-03-14T17:51:02Z

I wasn't able to join on community meeting, so i will answer here.
I have the same concern related Select every 5 sec, also if we can consider that connection to DB is already in place and a Select will not so eavy.
To me, probably the much faster and affordable solution, until a complete rework on dispatching (or probably a rework on pg_notify 😄) could be add a counter and check the connection after X Times that we recive None event.

Consider that in some of my debug session, i notice that in a awx instance used normally None event are very, very low so this contuer can be very low, like 10/12.
So in that case you check connection every 1 minute, for a connection already in place.

So in the worst case, we stay without db connection for 1 minute with one single check per minute

AlanCoding · 2023-03-14T18:42:45Z

after X Times that we recive None event.

Maybe!

It will not hit the timeout every loop, because it exits early if it gets a message. We don't need to pretend that we don't know anything about the event load coming in. Every single AWX node has periodic tasks running that are submitted to the local queue. See:

awx/awx/settings/defaults.py

Line 452 in 1147559

CELERYBEAT_SCHEDULE = {

So, since there is essentially a chron job submitting a task every 20 seconds, then if you hit 4 timeouts (5x4...) then something's wrong, it's time to panic. It could be because the fork submitting periodic tasks had an error, but it could also be due to this problem, a problem with the listening connection.

CMPluto · 2023-03-21T21:34:16Z

Hi @AlanCoding Any update on the merging the above postgres DB instable connections?
We can still see that in 21.12 as well.

[-] awx.conf.settings Database settings are not available, using defaults. error: connection already closed\n","stream":"stderr"

CMPluto · 2023-03-27T19:25:54Z

Hi @AlanCoding, Any update on the merging the above postgres DB instable connections?

AlanCoding · 2023-04-10T14:42:35Z

I'm glad that you put it on the community meeting schedule. @tanganellilore has done excellent work here, and I'm on-board with the latest state of our discussion - that we need to SELECT something with the listening connection periodically to detect this case where it has gone stale.

In the meantime, other changes have been landing in the devel branch. Now we have expanded the use of this "listening" connection to other services to enable the web/task pod scaling split in OCP. Whatever arguments applied to this will now apply to those services as well, so we need to discuss those and come to a solution.

tanganellilore · 2023-04-11T09:11:57Z

@AlanCoding I try to update my fork with a new branch from devel, and try to approach with different solution that we discuss (check connection after X iteration).
Is a draft but you can check here : https://github.com/tanganellilore/awx/commits/reconnect_dispatch_v2

I know that with web/task split, conn.events it's called in different points than before, but this is just to show us a different approach than check conn every time.
We can discuss it on today meeting.

TheRealHaoLiu · 2023-04-28T13:30:24Z

finally got around to pulling down your change and testing it... apology for the amount of time that it took

setup:

multi-AZ RDS and triggered a force failover

result:

not good...
all pg listener connection die so no pg_notify is working
cluster_heathcheck fails which trigger peer judgement in repeat and all instance but 1 is reaped

tanganellilore · 2023-04-29T09:08:59Z

Hi team,
I'm happy that you are able to "reprodce" issue and I need to be clear, this implementation was tested in a release before task//web split, so I don't know/tests if this still works.

From your tests seems that not works on RDS, but i also remeber some issues with RDS in general.
Actually a i need to recreate a postgres cluster to restest new awx but now i need some time to do it.

When I'm ready i will ping you and we can see the output

TheRealHaoLiu · 2023-05-10T19:30:55Z

@tanganellilore ansible/awx-operator#1393 here's a PR that we are currently testing to solve the problem

seems to be working

TheRealHaoLiu · 2023-05-22T20:11:29Z

@tanganellilore have you tested out ansible/awx-operator#1393 to see if it works for you? if it does lets close out this PR

tanganellilore · 2023-06-07T13:11:47Z

as per comment on #13505

Thanks for solution!!

kill dispatcher in case of connection drop

aad3a28

github-actions bot added component:api community labels Feb 1, 2023

tanganellilore changed the title ~~fix #13505~~ fix postgreSQL reconnection in case of DB failover or idle or instable connection Feb 1, 2023

AlanCoding reviewed Feb 1, 2023

View reviewed changes

awx/main/dispatch/__init__.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 1, 2023

View reviewed changes

awx/main/dispatch/__init__.py Outdated Show resolved Hide resolved

different approach to avoid the killing of proc

23f207c

rollback error check on dispatcher worker base

252f22d

Cl0udius mentioned this pull request Mar 7, 2023

PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505

Closed

9 tasks

CMPluto mentioned this pull request Mar 21, 2023

AWX Community Meeting Agenda - Apr 11 2023 #13687

Closed

TheRealHaoLiu self-assigned this Apr 28, 2023

john-westcott-iv mentioned this pull request May 4, 2023

AWX Community Meeting Agenda - May 9 2023 #13836

Closed

tanganellilore closed this Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507

fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507

tanganellilore commented Feb 1, 2023

tanganellilore commented Feb 1, 2023

tanganellilore commented Feb 2, 2023

djyasin commented Feb 9, 2023

AlanCoding commented Feb 9, 2023

tanganellilore commented Feb 10, 2023

AlanCoding commented Feb 14, 2023

tanganellilore commented Feb 14, 2023

tanganellilore commented Feb 14, 2023

tanganellilore commented Feb 15, 2023

AlanCoding commented Mar 14, 2023

tanganellilore commented Mar 14, 2023

AlanCoding commented Mar 14, 2023

CMPluto commented Mar 21, 2023

CMPluto commented Mar 27, 2023

AlanCoding commented Apr 10, 2023

tanganellilore commented Apr 11, 2023 •

edited

Loading

TheRealHaoLiu commented Apr 28, 2023

tanganellilore commented Apr 29, 2023

TheRealHaoLiu commented May 10, 2023

TheRealHaoLiu commented May 22, 2023

tanganellilore commented Jun 7, 2023

fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507

fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507

Conversation

tanganellilore commented Feb 1, 2023

SUMMARY

ISSUE TYPE

COMPONENT NAME

AWX VERSION

ADDITIONAL INFORMATION

tanganellilore commented Feb 1, 2023

tanganellilore commented Feb 2, 2023

djyasin commented Feb 9, 2023

AlanCoding commented Feb 9, 2023

tanganellilore commented Feb 10, 2023

Some screenshot of error:

Simulate a failover with deattach of vip address from Master postgreql instance with actual release:

Failover started on 08:16:20

Simulate a failover with deattach of vip address from Master postgreql instance with this PR:

Failover failover started on 08:33:40

AlanCoding commented Feb 14, 2023

tanganellilore commented Feb 14, 2023

tanganellilore commented Feb 14, 2023

tanganellilore commented Feb 15, 2023

AlanCoding commented Mar 14, 2023

tanganellilore commented Mar 14, 2023

AlanCoding commented Mar 14, 2023

CMPluto commented Mar 21, 2023

CMPluto commented Mar 27, 2023

AlanCoding commented Apr 10, 2023

tanganellilore commented Apr 11, 2023 • edited Loading

TheRealHaoLiu commented Apr 28, 2023

tanganellilore commented Apr 29, 2023

TheRealHaoLiu commented May 10, 2023

TheRealHaoLiu commented May 22, 2023

tanganellilore commented Jun 7, 2023

tanganellilore commented Apr 11, 2023 •

edited

Loading