PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505

tanganellilore · 2023-02-01T20:49:43Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi team,
as discussed on matrix with @TheRealHaoLiu and @shanemcd I notice that in case of instable connection with PostgreSQL or failover (in case of PostgreSQL HA configuration) awx.main.dispatcher lose the connection with DB and not reconnect it.

I tired multiple use case, but the actual reconnection will work only in a very short dropdown connection (less then 1 sec) or not everytime.

I notice that similar issue was already opened but closed with a wron motivation #12683 .

I opened a PR to try to fix this issue with a simple check of connection with some retry on pg_bus_conn function (as default 40 times every 4 seconds).
If connection is restored we need to kill dispatcher process on awx-task container to restart it.

I don't know if different approach it's possibile but in that way we are sure that all process restart correctly.

From UI perspective or awx-web logs I don't see any drop, everything works whell.

AWX version

21.10.2

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Use AWX normally and then try to restart or deattach the network to postgreSQL and check awx-task log.

Connection will be droppend and all jobs will be in pending status.

Expected results

In case of PostgreSQL drop connection , reconnection should be re-established

Actual results

Error on awx-task

Additional information

No response

Cl0udius · 2023-03-07T18:07:57Z

Hi. Will #13507 be continued, or is there another solution?

tanganellilore · 2023-03-07T19:17:06Z

I hope that @AlanCoding can check it for next release....

TheRealHaoLiu · 2023-05-22T20:12:00Z

ansible/awx-operator#1393 i think this PR resolves the problem, please verify

TheRealHaoLiu · 2023-06-05T16:15:03Z

@tanganellilore did u had a chance to verify if the fix works?

tanganellilore · 2023-06-07T13:11:01Z

@TheRealHaoLiu tested today and this new params with check of keepilive work great.
In case of reach maximum of keepalive we have a container restart.
In case of fast failover, connection is re-stablished in automatic way without any restart of container.

Thanks a lot for the solution!!!

Cl0udius · 2023-06-09T13:10:27Z

Hi.

I tested this now from our end with AWX 22.3.0 and got the following behavior with our pgpool based HA postgres setup:

Already started jobs finished normally without any error.
Jobs which are started after the takeover get stuck in pending state forever until the task pods get restarted.
I do not receive an error message regarding a lost DB connection in the logs.

I tried the default settings for keepalive and also experimented with the values, but i always got the same result.
Can someone reproduce this on their setup too?

tanganellilore · 2023-06-09T19:01:22Z

In my case i see errors awx-task, in case of manual failover (with patroni) and real dr (shutdoun master postgresql)

Cl0udius · 2023-06-20T08:23:10Z

Hi,

i tested again with different settings in our K8S based Postgres setup and found out that probably the clusterIP service in Kubernetes is the reason the jobs can not be scheduled after the database DR switchover.
With a clusterIP defined in the service, the connections do not get recreated and the task pod has to be restarted.

I created another headless service like in the standalone DB setup and here the problem with the not schedulable jobs does not occur.

Here we seem to have a hard cut in the connection which leads to a reconnect and to a AWX web/task service which is not reachable for some seconds.
This was the good news.
The bad news are that we will lose jobs which finish in the time the database switchover happens.
We receive the following error message on such jobs: "Task was marked as running but was not present in the job queue, so it has been marked as failed."

@tanganellilore Can you also see this behavior?

I can also create another issue for this when we want to handle this separately.

tanganellilore · 2023-06-20T08:37:15Z

You mean job that finish in the same time of db switchover?

Ot you mean a long run job that running during db switchover

Cl0udius · 2023-06-20T08:40:00Z

The jobs which are finishing in the time period the switchover happens and where we do not have a DB connection.
The longrunners seem to be stable.

tanganellilore · 2023-06-20T12:20:01Z

Ok, I don't know how to test this limit case, but your switch time it's fast or require some times like 20/30 seconds?

Because probably you can customize settings and keepalive... To be able to fetch this limit case.

To be honest, this solution is a HA solution, not a BC solution, so I'm expect that some limit case (like this one) are not covered well.
Considering also that postgresql with Ha configuration, genrally is configured in async way, so nobody guarantee that a transaction complete in the first instance is propagated to other one.
You need to configure ha in sync mode, but this means that in case of one secondari postgres go down or not avaliabe for some time , you will not able to complete transaction also if the primary work well.

So everything depends on your configuration and replication mode.
Solution propose here is for HA use case (so some jobs failing to me is accept able) not for BC

github-actions bot added needs_triage type:bug community labels Feb 1, 2023

tanganellilore mentioned this issue Feb 1, 2023

fix postgreSQL reconnection in case of DB failover or idle or instable connection #13507

Closed

fosterseth removed the needs_triage label Feb 8, 2023

stanislav-zaprudskiy mentioned this issue Feb 22, 2023

awx.conf.settings Database settings are not available #12683

Closed

9 tasks

TheRealHaoLiu assigned TheRealHaoLiu and AlanCoding Apr 28, 2023

tanganellilore closed this as completed Jun 7, 2023

shanemcd changed the title ~~PostgreSQL reconnection in case of DB failover or idle or instable connection~~ PostgreSQL reconnection in case of DB failover or idle or unstable connection Jun 20, 2023

sylvain-de-fuster mentioned this issue Apr 29, 2024

Fake crash of external postgres leader - Can't launch new job and instance healthcheck hanging ansible/awx-operator#1844

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505

PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505

tanganellilore commented Feb 1, 2023

Cl0udius commented Mar 7, 2023

tanganellilore commented Mar 7, 2023

TheRealHaoLiu commented May 22, 2023

TheRealHaoLiu commented Jun 5, 2023

tanganellilore commented Jun 7, 2023

Cl0udius commented Jun 9, 2023

tanganellilore commented Jun 9, 2023

Cl0udius commented Jun 20, 2023

tanganellilore commented Jun 20, 2023

Cl0udius commented Jun 20, 2023 •

edited

Loading

tanganellilore commented Jun 20, 2023

PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505

PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505

Comments

tanganellilore commented Feb 1, 2023

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

Cl0udius commented Mar 7, 2023

tanganellilore commented Mar 7, 2023

TheRealHaoLiu commented May 22, 2023

TheRealHaoLiu commented Jun 5, 2023

tanganellilore commented Jun 7, 2023

Cl0udius commented Jun 9, 2023

tanganellilore commented Jun 9, 2023

Cl0udius commented Jun 20, 2023

tanganellilore commented Jun 20, 2023

Cl0udius commented Jun 20, 2023 • edited Loading

tanganellilore commented Jun 20, 2023

Cl0udius commented Jun 20, 2023 •

edited

Loading