Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507

Closed

Conversation

tanganellilore
Copy link
Contributor

SUMMARY

I notice that in case of instable connection with PostgreSQL or failover (in case of PostgreSQL HA configuration) awx.main.dispatcher lose the connection with DB and not reconnect it.

I opened a PR to try to fix this issue with a simple check of connection with some retry on pg_bus_conn function (as default 40 times every 4 seconds, as per default).
If connection is restored we need to kill dispatcher process on awx-task container and wait to restart it.

I don't know if different approach it's possibile but in that way we are sure that all process restart correctly.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • Other
AWX VERSION
AWX 20.10.2
ADDITIONAL INFORMATION

@tanganellilore
Copy link
Contributor Author

Issue: #13505

@tanganellilore tanganellilore changed the title fix #13505 fix postgreSQL reconnection in case of DB failover or idle or instable connection Feb 1, 2023
@tanganellilore
Copy link
Contributor Author

@AlanCoding I used a different approacha fter a lot of testing and trying different use case.

I adopted the usage of "ping" in postgres with a SELECT 1 in the event loop only when yield_timeouts is true.
In that way if in the main thread the pg_bus_conn setted in the frist loop of dispatcher will go in hang (due to nw change on postgres or connection timeout or somehting similar) with this simple ping we can go directly to the exception and go to the retry loop alrready inplace.

So simply add one line of code we can catch the issue :-)

Obivously, i take a "query" every 5 seconds in case of there is no notification to be processed but i think this is acceptable.

@djyasin
Copy link
Member

djyasin commented Feb 9, 2023

@tanganellilore
Thank you for providing this additional information! We are still trying to identify exactly what is occurring here.

Could you provide us with additional information regarding the error you experienced and what you saw? Screenshots, additional logs, or any additional information you can provide would be very helpful.

Thank you for your time and assistance with this matter!

@AlanCoding
Copy link
Member

Specifically, as you're adding another class of error to be caught, which would make it really really relevant to be able to see what tracebacks were produced in the failover events that you had.

@tanganellilore
Copy link
Contributor Author

Good morning @AlanCoding ,
I removed the db.Error except, due to the fact that is not required on this case.

This PR try to catch the reconnection error also for the main loop notify.
Actually the main loop on /awx/main/dispatch/worker/base.py line 177, will not catch the connection drop beacuse it's simply create a connection first time and then listen the queue only one time. So if the connection to postgres will drop, for example for a network lag or for a failover of postgres, nobody restart the main loop and reconnect to all queue.
Going to exception, we simply restart the main loop if the retry happens before pg_max_wait time.

Some screenshot of error:


Simulate a failover with deattach of vip address from Master postgreql instance with actual release:

Failover started on 08:16:20

On awx-task logs:
image

On AWX UI all jobs are in pending:
image

Inside awx-task the awx-manage command:

image
(all worker are in idle)

image
(error on wait the listener)


Simulate a failover with deattach of vip address from Master postgreql instance with this PR:

Failover failover started on 08:33:40

On awx-task logs:
image
(we going to exception)

On AWX UI all jobs are in running because we have a reconnection at 08:34:15:
image

Inside awx-task the awx-manage command:

image
(worker now run some command and sync)

image
(listener work)

@AlanCoding
Copy link
Member

Thanks for the reply. I really want to understand the mechanics of what you've seen here. If I go to awx-manage shell_plus and then run this:

In [1]: from awx.main.dispatch import pg_bus_conn

In [2]: with pg_bus_conn(new_connection=True) as conn:
   ...:     for e in conn.events(yield_timeouts=True):
   ...:         print(e)
   ...: 
None
None
None
None

Then I will continue getting a None every 5 seconds as long as the database is healthy. Now if I essentially shut postgres down (pretty normally I think, it logs about terminating connections then "LOG: shutting down") then it will stop printing those and give:

None
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
Cell In[2], line 2
      1 with pg_bus_conn(new_connection=True) as conn:
----> 2     for e in conn.events(yield_timeouts=True):
      3         print(e)

File /awx_devel/awx/main/dispatch/__init__.py:42, in PubSub.events(self, select_timeout, yield_timeouts)
     40         yield None
     41 else:
---> 42     self.conn.poll()
     43     while self.conn.notifies:
     44         yield self.conn.notifies.pop(0)

OperationalError: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

At first glance, this seems like the same sort of traceback you had, although this is still indirect logs. The log I have, seen from the dispatcher's listener's perspective is:

tools_awx_1 | 2023-02-13 21:18:21,113 ERROR    [a73e7ff7] awx.main.dispatch Error consuming new events from postgres, will retry for 40 s
tools_awx_1 | Traceback (most recent call last):
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/base.py", line 182, in run
tools_awx_1 |     for e in conn.events(yield_timeouts=True):
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/__init__.py", line 42, in events
tools_awx_1 |     self.conn.poll()
tools_awx_1 | psycopg2.OperationalError: server closed the connection unexpectedly
tools_awx_1 | 	This probably means the server terminated abnormally
tools_awx_1 | 	before or while processing the request.

Again, consistent with my shell_plus testing. This is showing that it falls within the (db.DatabaseError, psycopg2.OperationalError) exceptions being caught. This already sounds like it's doing what you propose, and I am interested to close any cases of database flake that current logic does not catch.

@tanganellilore
Copy link
Contributor Author

Thanks @AlanCoding for reply. I understand What you mean, but probably I catch why you are not able to replay.
Use case i little different than you share now.
In HA postgres solution you have 3 instances of postgres, 1 master and 2 slaves, accessible via VIP.

One of the most standard solutions for a failover is to move the VIP ip from master to one of the slave or use an ha proxy.
What i can suppose is that, the main differance beetwen this case and your examples is the connection itself.
In your case you drop all connections, shutting down the postgres and all pg_notify go down.
In failover case we have different behavior and seems that the main connection it's partially broken or will be restore partially in some way (or the switch is very fast) but it will not work properly.

What I tried to understant is why, in case of failover, i receive an exception from awx.conf.settings instead from Main loop process and this was my answer.
Put a "ping" during the "None" message to check if connection is healty, because we check the connection unhealty only in case we recive notify. But if we don'recive notify, we don't check if connection is healty...
We suppose that it's implicit in a single instance, and it's true, but this is not the same in a ha postgre

@tanganellilore
Copy link
Contributor Author

these some logs:
Failover at 02/14/2023, 7:53:11:

From awx-manage shell_plus:
image

You can see that there isn't any "conenction" drop.

So I confirm that this happen only in a single instance or in HA solution where all postgressql go down in the same time.

If we have an HA, with Patroni, pg_bouncer and vip-manager or HA proxy with VIP address, old connections go down but seems is not fetched by psycopg2 library (probaly because is very short time) .

In any case I notice that nothing check if self.conn is valid and working when we are in if select.select([self.conn], [], [], select_timeout) == NOT_READY:, so in case of connection resetted by postgres we risk to have a not working connection on awx side.

@tanganellilore
Copy link
Contributor Author

good morning @AlanCoding ,
I tried different configuration for postgres HA but in some cases is not catched by exception in main loop.
To be more precise, in all cases where there is a real failover (deattach of nw from master or brute shutdown of master) this is not catched.

And this is for me normally, becasue the vip address will be moved to one of the slave and nobody told to main loop that this connection was dropped (it not receive an ack status or dropping or something like this because master was shutted down not programmatically).
The strange things is the error was catched by awx.conf.settings every 1 minute
image

This is my lab (created with https://github.com/vitabaks/postgresql_cluster):
postgresql01: 10.10.1.2 - patroni, vip-manager, etcd, pgbouncer
postgresql02: 10.10.1.3 - patroni, vip-manager, etcd, pgbouncer
etcdarbiter: 10.10.1.4 - etcd
vip: 10.10.1.5

What I tried:
server side:

  • with pgbouncer (using port 6432)
  • without pgbouncer (using port 5432)

address side:

  • with dns that resolve to vip
  • without dns (directly IP address)

Failover via patroni and patronictl failover cluster or normal shutdown of postgres master: Success (expected, becasue manual failover or reboot drop all connection and probably send message that is catched by psycopg2)

Failover via deattach NW to vm or shutdown not gracefully: KO, psycopg2 not catch that connection go drop and re-up in a short time

@AlanCoding
Copy link
Member

Thanks for digging and the information you gave here. I am much more-so seeing why there isn't an underlying traceback associated with the problem, and that the problem is a bit lower-level on the select.select not realizing that a failover had happened and made the socket it was using defunct.

I brought this up at the community meeting #13563

Engineering-wise, it might be obvious I don't like adding another periodic SELECT, particularly in such a low-level fast 5 second loop. But you have good points that suggest it may be unavoidable in which case increasing the timeout might be a good counter-balance. We already run tasks in the main loop on a periodic basis, and these can already talk to the database (and other systems). The weird and non-obvious point to know is that this process maintains 2 database connections - one for pg_notify listening and the other for Django ORM actions. The latter connection is closed for at least scaleup events. If the connection might ever be closed, then it is no good for listening to pg_notify because it would lose messages while closed. The entire logic for this is suspicious and I don't feel like both connections should be necessary. If those could be consolidated, my preference would be to test the connection in normal dispatcher periodic tasks. This would be a much larger change that risks unexpected fallout, but I am thinking about it.

@tanganellilore
Copy link
Contributor Author

I wasn't able to join on community meeting, so i will answer here.
I have the same concern related Select every 5 sec, also if we can consider that connection to DB is already in place and a Select will not so eavy.
To me, probably the much faster and affordable solution, until a complete rework on dispatching (or probably a rework on pg_notify 😄) could be add a counter and check the connection after X Times that we recive None event.

Consider that in some of my debug session, i notice that in a awx instance used normally None event are very, very low so this contuer can be very low, like 10/12.
So in that case you check connection every 1 minute, for a connection already in place.

So in the worst case, we stay without db connection for 1 minute with one single check per minute

@AlanCoding
Copy link
Member

after X Times that we recive None event.

Maybe!

It will not hit the timeout every loop, because it exits early if it gets a message. We don't need to pretend that we don't know anything about the event load coming in. Every single AWX node has periodic tasks running that are submitted to the local queue. See:

CELERYBEAT_SCHEDULE = {

So, since there is essentially a chron job submitting a task every 20 seconds, then if you hit 4 timeouts (5x4...) then something's wrong, it's time to panic. It could be because the fork submitting periodic tasks had an error, but it could also be due to this problem, a problem with the listening connection.

@CMPluto
Copy link

CMPluto commented Mar 21, 2023

Hi @AlanCoding Any update on the merging the above postgres DB instable connections?
We can still see that in 21.12 as well.

[-] awx.conf.settings Database settings are not available, using defaults. error: connection already closed\n","stream":"stderr"

@CMPluto
Copy link

CMPluto commented Mar 27, 2023

Hi @AlanCoding, Any update on the merging the above postgres DB instable connections?

@AlanCoding
Copy link
Member

I'm glad that you put it on the community meeting schedule. @tanganellilore has done excellent work here, and I'm on-board with the latest state of our discussion - that we need to SELECT something with the listening connection periodically to detect this case where it has gone stale.

In the meantime, other changes have been landing in the devel branch. Now we have expanded the use of this "listening" connection to other services to enable the web/task pod scaling split in OCP. Whatever arguments applied to this will now apply to those services as well, so we need to discuss those and come to a solution.

@tanganellilore
Copy link
Contributor Author

tanganellilore commented Apr 11, 2023

@AlanCoding I try to update my fork with a new branch from devel, and try to approach with different solution that we discuss (check connection after X iteration).
Is a draft but you can check here : https://github.com/tanganellilore/awx/commits/reconnect_dispatch_v2

I know that with web/task split, conn.events it's called in different points than before, but this is just to show us a different approach than check conn every time.
We can discuss it on today meeting.

@TheRealHaoLiu
Copy link
Member

finally got around to pulling down your change and testing it... apology for the amount of time that it took

setup:

  • multi-AZ RDS and triggered a force failover

result:

  • not good...
  • all pg listener connection die so no pg_notify is working
  • cluster_heathcheck fails which trigger peer judgement in repeat and all instance but 1 is reaped

@TheRealHaoLiu TheRealHaoLiu self-assigned this Apr 28, 2023
@tanganellilore
Copy link
Contributor Author

Hi team,
I'm happy that you are able to "reprodce" issue and I need to be clear, this implementation was tested in a release before task//web split, so I don't know/tests if this still works.

From your tests seems that not works on RDS, but i also remeber some issues with RDS in general.
Actually a i need to recreate a postgres cluster to restest new awx but now i need some time to do it.

When I'm ready i will ping you and we can see the output

@TheRealHaoLiu
Copy link
Member

@tanganellilore ansible/awx-operator#1393 here's a PR that we are currently testing to solve the problem

seems to be working

@TheRealHaoLiu
Copy link
Member

@tanganellilore have you tested out ansible/awx-operator#1393 to see if it works for you? if it does lets close out this PR

@tanganellilore
Copy link
Contributor Author

as per comment on #13505

Thanks for solution!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants