Cluster gets stuck after creating a new replica if there's no primary

We're assesing cloudnative pg for running a production postgres instance and I encountered something strange.
When primary instance gets shut down and replica creation is in progress, it will be forever stuck, since it can't connect to the primary:

```
➜  ~ k get cluster -A
NAMESPACE   NAME   AGE   INSTANCES   READY   STATUS                   PRIMARY
app         hub    16h                       Creating a new replica   hub-4
```

```
➜  ~ k logs -f hub-5-join-gfx7t                   
Defaulted container "join" out of: join, bootstrap-controller (init)
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server.crt","secret":"hub-server"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server.key","secret":"hub-server"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/streaming_replica.crt","secret":"hub-replication"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/streaming_replica.key","secret":"hub-replication"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/client-ca.crt","secret":"hub-ca"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Refreshed configuration file","logging_pod":"hub-5-join","filename":"/controller/certificates/server-ca.crt","secret":"hub-ca"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"Waiting for server to be available","logging_pod":"hub-5-join","connectionString":"host=hub-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=hub-5-join sslmode=verify-ca dbname=postgres connect_timeout=5 replication=1"}
{"level":"info","ts":"2023-04-12T08:53:57Z","msg":"DB not available, will retry","logging_pod":"hub-5-join","err":"failed to connect to `host=hub-rw user=streaming_replica database=postgres`: dial error (dial tcp 172.20.8.3:5432: connect: connection refused)"}
{"level":"info","ts":"2023-04-12T08:54:02Z","msg":"DB not available, will retry","logging_pod":"hub-5-join","err":"failed to connect to `host=hub-rw user=streaming_replica database=postgres`: dial error (dial tcp 172.20.8.3:5432: connect: connection refused)"}
... same lines over-and-over again
```

And the primary also won't get recreated. I'm not sure how to fix this, because I can't create primary, and I can't remove the new broken replica instance. The only way to get out of this seem to be to create a new cluster. Am I wrong here though?

How to reproduce:

1. Create cluster with 1 instance.
2. Scale up to 2 instances.
3. Delete primary instance which is already initialized by running `kubectl delete pod` while replica creation is running. 

What it should do instead is realize that there's no primary and try to re-create it?
I have a setup like this:

```
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: hub
  namespace: app
spec:
  # One primary and two standby servers
  instances: 1

  affinity:
    nodeSelector:
      workload-type: database

    tolerations:
      - key: "DatabaseWorkloads"
        operator: "Exists"
        effect: "PreferNoSchedule"

  backup:
    barmanObjectStore:
      destinationPath: "gs://hub-pg-backups-staging"
      googleCredentials:
        gkeEnvironment: true

  serviceAccountTemplate:
    metadata:
      annotations:
        iam.gke.io/gcp-service-account: pg-backup-producer@absolute-nuance-268810.iam.gserviceaccount.com

  # Sets the resources for Guaranteed QoS
  resources:
    requests:
      memory: "1Gi"
      cpu: 100m
    limits:
      memory: "4Gi"
      cpu: 2

  # Sets the 50GB storage for PGDATA
  # This volume will also be used by the import
  # process to temporarily store the custom format
  # dump coming from the source database
  storage:
    size: 50Gi
#
#  # Initializes the cluster from scratch (initdb bootstrap)
  bootstrap:
    initdb:
      # Enables data checksums
      dataChecksums: true
      # Sets WAL segment size to 32MB
      walSegmentSize: 32
      import:
        type: microservice
        databases:
          - hub
        source:
          externalCluster: imported-cluster
#
#  # Defines the cluster-pg10 external cluster
#  # by providing information on how to connect to the Postgres
#  # instance, including user and password (contained in a
#  # separate secret that you need to create).
  externalClusters:
    - name: imported-cluster
      connectionParameters:
        host: postgres-backup.default.svc.cluster.local
        port: "5432"
        user: youthink
        dbname: hub
      password:
        name: imported-cluster-credentials
        key: password
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster gets stuck after creating a new replica if there's no primary #1915

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster gets stuck after creating a new replica if there's no primary #1915

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions