[Bug]: If 3 volume snapshots fail in a row, the instance manager stops functioning (connection pool exhausted)

### Is there an existing issue already for this bug?

- [x] I have searched for an existing issue, and could not find anything. I believe this is a new bug.

### I have read the troubleshooting guide

- [x] I have read the troubleshooting guide and I think this is a new bug.

### I am running a supported version of CloudNativePG

- [x] I have read the troubleshooting guide and I think this is a new bug.

### Contact Details

jeffm@gmail.com

### Version

1.25 (latest patch)

### What version of Kubernetes are you using?

1.29

### What is your Kubernetes environment?

Cloud: Azure AKS

### How did you install the operator?

YAML manifest

### What happened?

When 3 scheduled volume snapshots fail in a row. The super user connection pool becomes exhausted in the instance manger. This causes all endpoints except `/healthz` to fail. This causes a near total failure of CNPG.

**Impact:**
- All scheduled backups start to fail (due to endpoints hanging)
- You run out of WAL space
- Metrics are no longer collected
- Hibernate fails
- `cnpg status` hangs
- All HTTP endpoints except /healthz hang
- Instance manager can no longer be updated in-place

Depending on when you catch it, the real reason for the volume snapshot failure will be buried by `context cancelled: deadline exceeded` (because only the first 2-3 attempts can start a backup, the other ones hang getting the cluster status).

This PR has the beginnings of a fix, but there's more to be done:
https://github.com/jmealo/cloudnative-pg/pull/1

# Big shout outs to the Gophers in the Go Slack:
## @farhaven and nbraun

### Cluster resource

```shell
Not relevant
```

### Relevant log output

```shell
postgres=# SELECT pid, age(clock_timestamp(), query_start), usename, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' 
ORDER BY query_start desc;
 pid  |       age       | usename  |                      query                       
------+-----------------+----------+--------------------------------------------------
   26 |                 |          | 
   27 |                 |          | 
   28 |                 |          | 
  204 |                 |          | 
 3886 | 00:03:55.430776 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
 2518 | 01:04:35.711006 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
 1112 | 02:06:57.669102 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
(7 rows)

postgres=# 
```

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: If 3 volume snapshots fail in a row, the instance manager stops functioning (connection pool exhausted) #6761

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

Version

What version of Kubernetes are you using?

What is your Kubernetes environment?

How did you install the operator?

What happened?

Big shout outs to the Gophers in the Go Slack:

@farhaven and nbraun

Cluster resource

Relevant log output

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: If 3 volume snapshots fail in a row, the instance manager stops functioning (connection pool exhausted) #6761

Description

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

Version

What version of Kubernetes are you using?

What is your Kubernetes environment?

How did you install the operator?

What happened?

Big shout outs to the Gophers in the Go Slack:

@farhaven and nbraun

Cluster resource

Relevant log output

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions