Skip to content

Cluster gets stuck after creating a new replica if there's no primary #1915

@blitss

Description

@blitss

We're assesing cloudnative pg for running a production postgres instance and I encountered something strange.
When primary instance gets shut down and replica creation is in progress, it will be forever stuck, since it can't connect to the primary:

➜  ~ k get cluster -A
NAMESPACE   NAME   AGE   INSTANCES   READY   STATUS                   PRIMARY
app         hub    16h                       Creating a new replica   hub-4
➜  ~ k logs -f hub-5-join-gfx7t                   
Defaulted container "join" out of: join, bootstrap-controller (init)
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server.crt","secret":"hub-server"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server.key","secret":"hub-server"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/streaming_replica.crt","secret":"hub-replication"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/streaming_replica.key","secret":"hub-replication"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/client-ca.crt","secret":"hub-ca"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server-ca.crt","secret":"hub-ca"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Waiting for server to be available","logging_pod":"hub-5-join","connectionString":"host=hub-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=hub-5-join sslmode=verify-ca dbname=postgres connect_timeout=5 replication=1"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"DB not available, will retry","logging_pod":"hub-5-join","err":"failed to connect to `host=hub-rw user=streaming_replica database=postgres`: dial error (dial tcp 172.20.8.3:5432: connect: connection refused)"}
{"level":"info","ts":"2023-04-12T08:54:02Z","msg":"DB not available, will retry","logging_pod":"hub-5-join","err":"failed to connect to `host=hub-rw user=streaming_replica database=postgres`: dial error (dial tcp 172.20.8.3:5432: connect: connection refused)"}
... same lines over-and-over again

And the primary also won't get recreated. I'm not sure how to fix this, because I can't create primary, and I can't remove the new broken replica instance. The only way to get out of this seem to be to create a new cluster. Am I wrong here though?

How to reproduce:

  1. Create cluster with 1 instance.
  2. Scale up to 2 instances.
  3. Delete primary instance which is already initialized by running kubectl delete pod while replica creation is running.

What it should do instead is realize that there's no primary and try to re-create it?
I have a setup like this:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: hub
  namespace: app
spec:
  # One primary and two standby servers
  instances: 1

  affinity:
    nodeSelector:
      workload-type: database

    tolerations:
      - key: "DatabaseWorkloads"
        operator: "Exists"
        effect: "PreferNoSchedule"

  backup:
    barmanObjectStore:
      destinationPath: "gs://hub-pg-backups-staging"
      googleCredentials:
        gkeEnvironment: true

  serviceAccountTemplate:
    metadata:
      annotations:
        iam.gke.io/gcp-service-account: pg-backup-producer@absolute-nuance-268810.iam.gserviceaccount.com

  # Sets the resources for Guaranteed QoS
  resources:
    requests:
      memory: "1Gi"
      cpu: 100m
    limits:
      memory: "4Gi"
      cpu: 2

  # Sets the 50GB storage for PGDATA
  # This volume will also be used by the import
  # process to temporarily store the custom format
  # dump coming from the source database
  storage:
    size: 50Gi
#
#  # Initializes the cluster from scratch (initdb bootstrap)
  bootstrap:
    initdb:
      # Enables data checksums
      dataChecksums: true
      # Sets WAL segment size to 32MB
      walSegmentSize: 32
      import:
        type: microservice
        databases:
          - hub
        source:
          externalCluster: imported-cluster
#
#  # Defines the cluster-pg10 external cluster
#  # by providing information on how to connect to the Postgres
#  # instance, including user and password (contained in a
#  # separate secret that you need to create).
  externalClusters:
    - name: imported-cluster
      connectionParameters:
        host: postgres-backup.default.svc.cluster.local
        port: "5432"
        user: youthink
        dbname: hub
      password:
        name: imported-cluster-credentials
        key: password

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions