-
Notifications
You must be signed in to change notification settings - Fork 645
Cluster gets stuck after creating a new replica if there's no primary #1915
Copy link
Copy link
Closed
Description
We're assesing cloudnative pg for running a production postgres instance and I encountered something strange.
When primary instance gets shut down and replica creation is in progress, it will be forever stuck, since it can't connect to the primary:
➜ ~ k get cluster -A
NAMESPACE NAME AGE INSTANCES READY STATUS PRIMARY
app hub 16h Creating a new replica hub-4
➜ ~ k logs -f hub-5-join-gfx7t
Defaulted container "join" out of: join, bootstrap-controller (init)
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server.crt","secret":"hub-server"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server.key","secret":"hub-server"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/streaming_replica.crt","secret":"hub-replication"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/streaming_replica.key","secret":"hub-replication"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/client-ca.crt","secret":"hub-ca"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server-ca.crt","secret":"hub-ca"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Waiting for server to be available","logging_pod":"hub-5-join","connectionString":"host=hub-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=hub-5-join sslmode=verify-ca dbname=postgres connect_timeout=5 replication=1"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"DB not available, will retry","logging_pod":"hub-5-join","err":"failed to connect to `host=hub-rw user=streaming_replica database=postgres`: dial error (dial tcp 172.20.8.3:5432: connect: connection refused)"}
{"level":"info","ts":"2023-04-12T08:54:02Z","msg":"DB not available, will retry","logging_pod":"hub-5-join","err":"failed to connect to `host=hub-rw user=streaming_replica database=postgres`: dial error (dial tcp 172.20.8.3:5432: connect: connection refused)"}
... same lines over-and-over again
And the primary also won't get recreated. I'm not sure how to fix this, because I can't create primary, and I can't remove the new broken replica instance. The only way to get out of this seem to be to create a new cluster. Am I wrong here though?
How to reproduce:
- Create cluster with 1 instance.
- Scale up to 2 instances.
- Delete primary instance which is already initialized by running
kubectl delete podwhile replica creation is running.
What it should do instead is realize that there's no primary and try to re-create it?
I have a setup like this:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: hub
namespace: app
spec:
# One primary and two standby servers
instances: 1
affinity:
nodeSelector:
workload-type: database
tolerations:
- key: "DatabaseWorkloads"
operator: "Exists"
effect: "PreferNoSchedule"
backup:
barmanObjectStore:
destinationPath: "gs://hub-pg-backups-staging"
googleCredentials:
gkeEnvironment: true
serviceAccountTemplate:
metadata:
annotations:
iam.gke.io/gcp-service-account: pg-backup-producer@absolute-nuance-268810.iam.gserviceaccount.com
# Sets the resources for Guaranteed QoS
resources:
requests:
memory: "1Gi"
cpu: 100m
limits:
memory: "4Gi"
cpu: 2
# Sets the 50GB storage for PGDATA
# This volume will also be used by the import
# process to temporarily store the custom format
# dump coming from the source database
storage:
size: 50Gi
#
# # Initializes the cluster from scratch (initdb bootstrap)
bootstrap:
initdb:
# Enables data checksums
dataChecksums: true
# Sets WAL segment size to 32MB
walSegmentSize: 32
import:
type: microservice
databases:
- hub
source:
externalCluster: imported-cluster
#
# # Defines the cluster-pg10 external cluster
# # by providing information on how to connect to the Postgres
# # instance, including user and password (contained in a
# # separate secret that you need to create).
externalClusters:
- name: imported-cluster
connectionParameters:
host: postgres-backup.default.svc.cluster.local
port: "5432"
user: youthink
dbname: hub
password:
name: imported-cluster-credentials
key: password
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels