Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL reconnection in case of DB failover or idle or unstable connection #13505

Closed
4 of 9 tasks
tanganellilore opened this issue Feb 1, 2023 · 11 comments
Closed
4 of 9 tasks

Comments

@tanganellilore
Copy link
Contributor

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi team,
as discussed on matrix with @TheRealHaoLiu and @shanemcd I notice that in case of instable connection with PostgreSQL or failover (in case of PostgreSQL HA configuration) awx.main.dispatcher lose the connection with DB and not reconnect it.

I tired multiple use case, but the actual reconnection will work only in a very short dropdown connection (less then 1 sec) or not everytime.

I notice that similar issue was already opened but closed with a wron motivation #12683 .

I opened a PR to try to fix this issue with a simple check of connection with some retry on pg_bus_conn function (as default 40 times every 4 seconds).
If connection is restored we need to kill dispatcher process on awx-task container to restart it.

I don't know if different approach it's possibile but in that way we are sure that all process restart correctly.

From UI perspective or awx-web logs I don't see any drop, everything works whell.

AWX version

21.10.2

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Use AWX normally and then try to restart or deattach the network to postgreSQL and check awx-task log.

Connection will be droppend and all jobs will be in pending status.

Expected results

In case of PostgreSQL drop connection , reconnection should be re-established

Actual results

Error on awx-task

Additional information

No response

@Cl0udius
Copy link

Cl0udius commented Mar 7, 2023

Hi. Will #13507 be continued, or is there another solution?

@tanganellilore
Copy link
Contributor Author

I hope that @AlanCoding can check it for next release....

@TheRealHaoLiu
Copy link
Member

ansible/awx-operator#1393 i think this PR resolves the problem, please verify

@TheRealHaoLiu
Copy link
Member

@tanganellilore did u had a chance to verify if the fix works?

@tanganellilore
Copy link
Contributor Author

@TheRealHaoLiu tested today and this new params with check of keepilive work great.
In case of reach maximum of keepalive we have a container restart.
In case of fast failover, connection is re-stablished in automatic way without any restart of container.

Thanks a lot for the solution!!!

@Cl0udius
Copy link

Cl0udius commented Jun 9, 2023

Hi.

I tested this now from our end with AWX 22.3.0 and got the following behavior with our pgpool based HA postgres setup:

  • Already started jobs finished normally without any error.
  • Jobs which are started after the takeover get stuck in pending state forever until the task pods get restarted.
  • I do not receive an error message regarding a lost DB connection in the logs.

I tried the default settings for keepalive and also experimented with the values, but i always got the same result.
Can someone reproduce this on their setup too?

@tanganellilore
Copy link
Contributor Author

In my case i see errors awx-task, in case of manual failover (with patroni) and real dr (shutdoun master postgresql)

@Cl0udius
Copy link

Hi,

i tested again with different settings in our K8S based Postgres setup and found out that probably the clusterIP service in Kubernetes is the reason the jobs can not be scheduled after the database DR switchover.
With a clusterIP defined in the service, the connections do not get recreated and the task pod has to be restarted.

I created another headless service like in the standalone DB setup and here the problem with the not schedulable jobs does not occur.

Here we seem to have a hard cut in the connection which leads to a reconnect and to a AWX web/task service which is not reachable for some seconds.
This was the good news.
The bad news are that we will lose jobs which finish in the time the database switchover happens.
We receive the following error message on such jobs: "Task was marked as running but was not present in the job queue, so it has been marked as failed."

@tanganellilore Can you also see this behavior?

I can also create another issue for this when we want to handle this separately.

@tanganellilore
Copy link
Contributor Author

You mean job that finish in the same time of db switchover?

Ot you mean a long run job that running during db switchover

@Cl0udius
Copy link

Cl0udius commented Jun 20, 2023

The jobs which are finishing in the time period the switchover happens and where we do not have a DB connection.
The longrunners seem to be stable.

@tanganellilore
Copy link
Contributor Author

Ok, I don't know how to test this limit case, but your switch time it's fast or require some times like 20/30 seconds?

Because probably you can customize settings and keepalive... To be able to fetch this limit case.

To be honest, this solution is a HA solution, not a BC solution, so I'm expect that some limit case (like this one) are not covered well.
Considering also that postgresql with Ha configuration, genrally is configured in async way, so nobody guarantee that a transaction complete in the first instance is propagated to other one.
You need to configure ha in sync mode, but this means that in case of one secondari postgres go down or not avaliabe for some time , you will not able to complete transaction also if the primary work well.

So everything depends on your configuration and replication mode.
Solution propose here is for HA use case (so some jobs failing to me is accept able) not for BC

@shanemcd shanemcd changed the title PostgreSQL reconnection in case of DB failover or idle or instable connection PostgreSQL reconnection in case of DB failover or idle or unstable connection Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants