-
Notifications
You must be signed in to change notification settings - Fork 641
Closed
Labels
backport-requested ◀️This pull request should be backported to all supported releasesThis pull request should be backported to all supported releasesbug 🐛Something isn't workingSomething isn't working
Milestone
Description
Is there an existing issue already for this bug?
- I have searched for an existing issue, and could not find anything. I believe this is a new bug.
I have read the troubleshooting guide
- I have read the troubleshooting guide and I think this is a new bug.
I am running a supported version of CloudNativePG
- I have read the troubleshooting guide and I think this is a new bug.
Contact Details
Version
1.25 (latest patch)
What version of Kubernetes are you using?
1.29
What is your Kubernetes environment?
Cloud: Azure AKS
How did you install the operator?
YAML manifest
What happened?
When 3 scheduled volume snapshots fail in a row. The super user connection pool becomes exhausted in the instance manger. This causes all endpoints except /healthz to fail. This causes a near total failure of CNPG.
Impact:
- All scheduled backups start to fail (due to endpoints hanging)
- You run out of WAL space
- Metrics are no longer collected
- Hibernate fails
cnpg statushangs- All HTTP endpoints except /healthz hang
- Instance manager can no longer be updated in-place
Depending on when you catch it, the real reason for the volume snapshot failure will be buried by context cancelled: deadline exceeded (because only the first 2-3 attempts can start a backup, the other ones hang getting the cluster status).
This PR has the beginnings of a fix, but there's more to be done:
jmealo#1
Big shout outs to the Gophers in the Go Slack:
@farhaven and nbraun
Cluster resource
Not relevantRelevant log output
postgres=# SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
pid | age | usename | query
------+-----------------+----------+--------------------------------------------------
26 | | |
27 | | |
28 | | |
204 | | |
3886 | 00:03:55.430776 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
2518 | 01:04:35.711006 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
1112 | 02:06:57.669102 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
(7 rows)
postgres=# Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Labels
backport-requested ◀️This pull request should be backported to all supported releasesThis pull request should be backported to all supported releasesbug 🐛Something isn't workingSomething isn't working
Type
Projects
Status
Done