Skip to content

[Bug]: If 3 volume snapshots fail in a row, the instance manager stops functioning (connection pool exhausted) #6761

@jmealo

Description

@jmealo

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

jeffm@gmail.com

Version

1.25 (latest patch)

What version of Kubernetes are you using?

1.29

What is your Kubernetes environment?

Cloud: Azure AKS

How did you install the operator?

YAML manifest

What happened?

When 3 scheduled volume snapshots fail in a row. The super user connection pool becomes exhausted in the instance manger. This causes all endpoints except /healthz to fail. This causes a near total failure of CNPG.

Impact:

  • All scheduled backups start to fail (due to endpoints hanging)
  • You run out of WAL space
  • Metrics are no longer collected
  • Hibernate fails
  • cnpg status hangs
  • All HTTP endpoints except /healthz hang
  • Instance manager can no longer be updated in-place

Depending on when you catch it, the real reason for the volume snapshot failure will be buried by context cancelled: deadline exceeded (because only the first 2-3 attempts can start a backup, the other ones hang getting the cluster status).

This PR has the beginnings of a fix, but there's more to be done:
jmealo#1

Big shout outs to the Gophers in the Go Slack:

@farhaven and nbraun

Cluster resource

Not relevant

Relevant log output

postgres=# SELECT pid, age(clock_timestamp(), query_start), usename, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' 
ORDER BY query_start desc;
 pid  |       age       | usename  |                      query                       
------+-----------------+----------+--------------------------------------------------
   26 |                 |          | 
   27 |                 |          | 
   28 |                 |          | 
  204 |                 |          | 
 3886 | 00:03:55.430776 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
 2518 | 01:04:35.711006 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
 1112 | 02:06:57.669102 | postgres | SELECT pg_backup_start(label => $1, fast => $2);
(7 rows)

postgres=# 

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

backport-requested ◀️This pull request should be backported to all supported releasesbug 🐛Something isn't working

Type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions