[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680

orenzah · 2024-01-10T13:24:22Z

Is there an existing issue already for this bug?

I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

I have read the troubleshooting guide and I think this is a new bug.

Contact Details

oren@aperio.ai

Version

1.22.0

What version of Kubernetes are you using?

1.27

What is your Kubernetes environment?

Cloud: Amazon EKS

How did you install the operator?

Helm

What happened?

We deployed the operator using a Helm chart and the PostgreSQL cluster resource using a separate Helm chart. Three instances of PostgreSQL were deployed on EC2 Spot instances. After a few hours, it appears that the primary instance (postgres-1) did not gracefully shut down, leading to the absence of the relevant Write-Ahead Logging (WAL) file.

Subsequently, attempts to use pg_rewind have failed, with the process encountering an empty file. Additionally, the file "pg_xact/0000" is also empty, contributing to the inability to resolve the issue and leaving the old primary instance in a stuck state.

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  annotations:
    meta.helm.sh/release-name: postgres-cn
    meta.helm.sh/release-namespace: stg
  creationTimestamp: "2024-01-09T12:06:49Z"
  generation: 4
  labels:
    app.kubernetes.io/managed-by: Helm
  name: postgres
  namespace: stg
  resourceVersion: "346234286"
  uid: 69ddea55-1c72-4e9f-abf6-aa2e43a32db4
spec:
  affinity:
    podAntiAffinityType: preferred
  bootstrap:
    initdb:
      database: app
      encoding: UTF8
      import:
        databases:
        - webapi
        source:
          externalCluster: xxxxxx
        type: microservice
      localeCType: C
      localeCollate: C
      owner: app
  enableSuperuserAccess: true
  externalClusters:
  - connectionParameters:
      dbname: webapi
      host: xxxxxxxxxxxx
      user: postgres
    name: xxxxxxx
    password:
      key: postgres-password
      name: xxxxxxxxxxxxxx
  failoverDelay: 0
  imageName: xxxxxxx/postgresql:16.1-xxxxxxxxx
  inheritedMetadata:
    labels:
      app.kubernetes.io/name: postgres
      app.kubernetes.io/version: "16.1"
  instances: 3
  logLevel: info
  maxSyncReplicas: 2
  minSyncReplicas: 1
  monitoring:
    customQueriesConfigMap:
    - key: queries
      name: cnpg-default-monitoring
    disableDefaultQueries: false
    enablePodMonitor: false
  postgresGID: 26
  postgresUID: 26
  postgresql:
    parameters:
      archive_mode: "on"
      archive_timeout: 5min
      dynamic_shared_memory_type: posix
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_rotation_age: "0"
      log_rotation_size: "0"
      log_truncate_on_rotation: "false"
      logging_collector: "on"
      max_parallel_workers: "32"
      max_replication_slots: "32"
      max_worker_processes: "32"
      shared_memory_type: mmap
      shared_preload_libraries: ""
      ssl_max_protocol_version: TLSv1.3
      ssl_min_protocol_version: TLSv1.3
      wal_keep_size: 512MB
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
    syncReplicaElectionConstraint:
      enabled: false
  primaryUpdateMethod: restart
  primaryUpdateStrategy: unsupervised
  replicationSlots:
    highAvailability:
      enabled: true
      slotPrefix: _cnpg_
    updateInterval: 30
  resources: {}
  smartShutdownTimeout: 180
  startDelay: 3600
  stopDelay: 1800
  storage:
    resizeInUseVolumes: true
    size: 30Gi
    storageClass: gp3
  superuserSecret:
    name: postgres-superuser
  switchoverDelay: 3600
status:
  certificates:
    clientCASecret: postgres-ca
    expirations:
      postgres-ca: 2024-04-08 12:01:49 +0000 UTC
      postgres-replication: 2024-04-08 12:01:49 +0000 UTC
      postgres-server: 2024-04-08 12:01:49 +0000 UTC
    replicationTLSSecret: postgres-replication
    serverAltDNSNames:
    - postgres-rw
    - postgres-rw.stg
    - postgres-rw.stg.svc
    - postgres-r
    - postgres-r.stg
    - postgres-r.stg.svc
    - postgres-ro
    - postgres-ro.stg
    - postgres-ro.stg.svc
    serverCASecret: postgres-ca
    serverTLSSecret: postgres-server
  cloudNativePGCommitHash: 86b9dc80
  cloudNativePGOperatorHash: 411a67468c5e1357d387221ce5b11d906bd33d62324846a8c776d206b6c94162
  conditions:
  - lastTransitionTime: "2024-01-10T08:20:16Z"
    message: Cluster Is Not Ready
    reason: ClusterIsNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-01-10T12:18:14Z"
    message: Continuous archiving is working
    reason: ContinuousArchivingSuccess
    status: "True"
    type: ContinuousArchiving
  configMapResourceVersion:
    metrics:
      cnpg-default-monitoring: "345484626"
  currentPrimary: postgres-2
  currentPrimaryTimestamp: "2024-01-10T08:21:59.663283Z"
  healthyPVC:
  - postgres-1
  - postgres-2
  - postgres-3
  instanceNames:
  - postgres-1
  - postgres-2
  - postgres-3
  instances: 3
  instancesReportedState:
    postgres-1:
      isPrimary: false
    postgres-2:
      isPrimary: true
      timeLineID: 2
    postgres-3:
      isPrimary: false
      timeLineID: 2
  instancesStatus:
    healthy:
    - postgres-2
    - postgres-3
    replicating:
    - postgres-1
  latestGeneratedNode: 3
  managedRolesStatus: {}
  phase: Waiting for the instances to become active
  phaseReason: Some instances are not yet active. Please wait.
  poolerIntegrations:
    pgBouncerIntegration: {}
  pvcCount: 3
  readService: postgres-r
  readyInstances: 2
  secretsResourceVersion:
    applicationSecretVersion: "345484596"
    clientCaSecretVersion: "345484593"
    externalClusterSecretVersion:
      stg-postgres-postgresql: "231319709"
    replicationSecretVersion: "345484595"
    serverCaSecretVersion: "345484593"
    serverSecretVersion: "345484594"
    superuserSecretVersion: "341642273"
  targetPrimary: postgres-2
  targetPrimaryTimestamp: "2024-01-10T08:20:16.701057Z"
  timelineID: 2
  topology:
    instances:
      postgres-1: {}
      postgres-2: {}
      postgres-3: {}
    nodesUsed: 2
    successfullyExtracted: true
  writeService: postgres-rw

Relevant log output

{"level":"error","ts":"2024-01-10T10:58:19Z","msg":"Reconciler error","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"9b64cee9-7e71-4e05-9bb0-e1cbaee76c9d","error":"while activating instance: error starting PostgreSQL instance: exit status 1","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Cluster status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"This is an old primary instance, waiting for the switchover to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Switchover completed","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","targetPrimary":"postgres-2","currentPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Waiting for the new primary to be available","logging_pod":"postgres-1","primaryConnInfo":"host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres connect_timeout=5"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Extracting pg_controldata information","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","reason":"before pg_rewind"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_controldata","msg":"pg_control version number:            1300\nCatalog version number:               202307071\nDatabase system identifier:           7322068936411099156\nDatabase cluster state:               shut down\npg_control last modified:             Wed 10 Jan 2024 08:19:35 AM UTC\nLatest checkpoint location:           2/AA000028\nLatest checkpoint's REDO location:    2/AA000028\nLatest checkpoint's REDO WAL file:    0000000100000002000000AA\nLatest checkpoint's TimeLineID:       1\nLatest checkpoint's PrevTimeLineID:   1\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID:          0:36923\nLatest checkpoint's NextOID:          25092\nLatest checkpoint's NextMultiXactId:  1\nLatest checkpoint's NextMultiOffset:  0\nLatest checkpoint's oldestXID:        722\nLatest checkpoint's oldestXID's DB:   1\nLatest checkpoint's oldestActiveXID:  0\nLatest checkpoint's oldestMultiXid:   1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint:            Wed 10 Jan 2024 08:19:35 AM UTC\nFake LSN counter for unlogged rels:   0/3E8\nMinimum recovery ending location:     0/0\nMin recovery ending loc's timeline:   0\nBackup start location:                0/0\nBackup end location:                  0/0\nEnd-of-backup record required:        no\nwal_level setting:                    logical\nwal_log_hints setting:                on\nmax_connections setting:              100\nmax_worker_processes setting:         32\nmax_wal_senders setting:              10\nmax_prepared_xacts setting:           0\nmax_locks_per_xact setting:           64\ntrack_commit_timestamp setting:       off\nMaximum data alignment:               8\nDatabase block size:                  8192\nBlocks per segment of large relation: 131072\nWAL block size:                       8192\nBytes per WAL segment:                16777216\nMaximum length of identifiers:        64\nMaximum columns in an index:          32\nMaximum size of a TOAST chunk:        1996\nSize of a large-object chunk:         2048\nDate/time type storage:               64-bit integers\nFloat8 argument passing:              by value\nData page checksum version:           0\nMock authentication nonce:            4e9ac5de19e20369e90184039bdcf5c867fa08bd811982f2edb29d1f11d8fd7b\n","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Starting up pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"]}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: connected to server","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: servers diverged at WAL location 2/A9000860 on timeline 1","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: error: could not read file \"/var/lib/postgresql/data/pgdata/pg_wal/0000000100000002000000A8\": read 0 of 8192","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: error: could not find previous WAL record at 2/A8241C48","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"error","ts":"2024-01-10T10:58:28Z","msg":"Failed to execute pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"],"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/postgres.(*Instance).Rewind\n\tpkg/management/postgres/instance.go:972\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).verifyPgDataCoherenceForPrimary\n\tinternal/management/controller/instance_startup.go:235\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).initialize\n\tinternal/management/controller/instance_controller.go:342\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).Reconcile\n\tinternal/management/controller/instance_controller.go:141\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"pg_rewind failed, starting the server to complete the crash recovery","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","err":"error executing pg_rewind: exit status 1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Waiting for server to complete crash recovery","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Starting up instance","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["start","-w","-D","/var/lib/postgresql/data/pgdata","-o","-c port=5432 -c unix_socket_directories=/controller/run","-t 40000000"]}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"waiting for server to start....2024-01-10 10:58:28.483 UTC [348] LOG:  redirecting log output to logging collector process","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"2024-01-10 10:58:28.483 UTC [348] HINT:  Future log output will appear in directory \"/controller/log\".","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"2024-01-10 10:58:28.484 UTC [348] LOG:  ending log output to stderr","source":"/controller/log/postgres","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"2024-01-10 10:58:28.484 UTC [348] HINT:  Future log output will go to log destination \"csvlog\".","source":"/controller/log/postgres","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"1","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"ending log output to stderr","hint":"Future log output will go to log destination \"csvlog\".","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"2","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"starting PostgreSQL 16.1 (Debian 16.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"3","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"4","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv6 address \"::\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.488 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"5","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on Unix socket \"/controller/run/.s.PGSQL.5432\"","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.495 UTC","process_id":"352","session_id":"659e7854.160","session_line_num":"1","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system was shut down at 2024-01-10 08:19:35 UTC","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.495 UTC","process_id":"352","session_id":"659e7854.160","session_line_num":"2","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"XX000","message":"could not access status of transaction 36923","detail":"Could not read from file \"pg_xact/0000\" at offset 8192: read too few bytes.","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.496 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"6","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"startup process (PID 352) exited with exit code 1","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.497 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"7","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"aborting startup due to startup process failure","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.497 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"8","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system is shut down","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":" stopped waiting","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"pg_ctl: could not start server","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"Examine the log output.","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.csv","logging_pod":"postgres-1"}
{"level":"error","ts":"2024-01-10T10:58:28Z","msg":"Reconciler error","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","error":"while activating instance: error starting PostgreSQL instance: exit status 1","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

Code of Conduct

I agree to follow this project's Code of Conduct

gbartolini · 2024-01-11T08:26:48Z

From what I can see, it looks like an issue with the underlying file system (data corruption of that file). You need to recreate the PVC of that instance.

orenzah · 2024-01-11T10:00:22Z

It happens 2 days in a row, we are using EC2 spot instances, and we managed to solve it by deleting the problematic PVC (gp3) but that manual PVC creation every day is probably not the desired behavior.

gbartolini · 2024-01-17T15:57:03Z

This might be related to #3698. Give us some time to investigate.

janosmiko · 2024-02-29T07:11:30Z

I face the same issue in multiple clusters in the same k8s cluster. I think it occurs when a network error happens.

The only way to resolve it is to delete the PVC of the failing pod and remove the pod. In this case, the operator is going to recreate the failing replica.

Logs:

{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"Starting CloudNativePG Instance Manager","logging_pod":"harbor-postgres-3","version":"1.22.1","build":{"Version":"1.22.1","Commit":"c7be872e","Date":"2024-02-02"}}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"starting tablespace manager","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"starting external server manager","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"starting controller-runtime manager","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"roles_reconciler","msg":"starting up the runnable","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"roles_reconciler","msg":"skipping the RoleSynchronizer in replicas","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"roles_reconciler","msg":"setting up RoleSynchronizer loop","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting webserver","logging_pod":"harbor-postgres-3","address":":9187"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting webserver","logging_pod":"harbor-postgres-3","address":"localhost:8010"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting webserver","logging_pod":"harbor-postgres-3","address":":8000"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Installed configuration file","logging_pod":"harbor-postgres-3","pgdata":"/var/lib/postgresql/data/pgdata","filename":"pg_ident.conf"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Updated replication settings","logging_pod":"harbor-postgres-3","filename":"override.conf"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Found previous run flag","logging_pod":"harbor-postgres-3","filename":"/var/lib/postgresql/data/pgdata/cnpg_initialized-harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Extracting pg_controldata information","logging_pod":"harbor-postgres-3","reason":"postmaster start up"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"pg_controldata","msg":"pg_control version number:            1300\nCatalog version number:               202307071\nDatabase system identifier:           7321087858539200534\nDatabase cluster state:               shut down in recovery\npg_control last modified:             Thu 29 Feb 2024 06:43:08 AM UTC\nLatest checkpoint location:           E/10000028\nLatest checkpoint's REDO location:    E/10000028\nLatest checkpoint's REDO WAL file:    0000000C0000000E00000010\nLatest checkpoint's TimeLineID:       12\nLatest checkpoint's PrevTimeLineID:   12\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID:          0:93004\nLatest checkpoint's NextOID:          25537\nLatest checkpoint's NextMultiXactId:  2\nLatest checkpoint's NextMultiOffset:  3\nLatest checkpoint's oldestXID:        722\nLatest checkpoint's oldestXID's DB:   1\nLatest checkpoint's oldestActiveXID:  0\nLatest checkpoint's oldestMultiXid:   1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint:            Wed 28 Feb 2024 09:53:23 PM UTC\nFake LSN counter for unlogged rels:   0/3E8\nMinimum recovery ending location:     E/100000A0\nMin recovery ending loc's timeline:   12\nBackup start location:                0/0\nBackup end location:                  0/0\nEnd-of-backup record required:        no\nwal_level setting:                    logical\nwal_log_hints setting:                on\nmax_connections setting:              100\nmax_worker_processes setting:         32\nmax_wal_senders setting:              10\nmax_prepared_xacts setting:           0\nmax_locks_per_xact setting:           64\ntrack_commit_timestamp setting:       off\nMaximum data alignment:               8\nDatabase block size:                  8192\nBlocks per segment of large relation: 131072\nWAL block size:                       8192\nBytes per WAL segment:                16777216\nMaximum length of identifiers:        64\nMaximum columns in an index:          32\nMaximum size of a TOAST chunk:        1996\nSize of a large-object chunk:         2048\nDate/time type storage:               64-bit integers\nFloat8 argument passing:              by value\nData page checksum version:           0\nMock authentication nonce:            431926e09764e6c576b934232250a417e85da4f29af6306f5db1e88e1660bc4f\n","pipe":"stdout","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Instance is still down, will retry in 1 second","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"0846c8b9-adae-4fa0-9085-82724a409769","uuid":"e5f15169-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.099 UTC [24] LOG:  pgaudit extension initialized","pipe":"stderr","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] LOG:  redirecting log output to logging collector process","pipe":"stderr","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] HINT:  Future log output will appear in directory \"/controller/log\".","pipe":"stderr","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"1","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"ending log output to stderr","hint":"Future log output will go to log destination \"csvlog\".","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"2","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"starting PostgreSQL 16.1 (Debian 16.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"3","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"4","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv6 address \"::\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] LOG:  ending log output to stderr","source":"/controller/log/postgres","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] HINT:  Future log output will go to log destination \"csvlog\".","source":"/controller/log/postgres","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.167 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"5","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on Unix socket \"/controller/run/.s.PGSQL.5432\"","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.198 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"1","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system was shut down in recovery at 2024-02-29 06:43:08 UTC","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"wal-restore","msg":"Restored WAL file","logging_pod":"harbor-postgres-3","walName":"0000000D.history","startTime":"2024-02-29T07:12:29Z","endTime":"2024-02-29T07:12:29Z","elapsedWalTime":0.633726065}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"wal-restore","msg":"WAL restore command completed (parallel)","logging_pod":"harbor-postgres-3","walName":"0000000D.history","maxParallel":1,"successfulWalRestore":1,"failedWalRestore":0,"endOfWALStream":false,"startTime":"2024-02-29T07:12:29Z","downloadStartTime":"2024-02-29T07:12:29Z","downloadTotalTime":0.634063937,"totalTime":0.734779978}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.964 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"2","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000D.history\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.066 UTC","user_name":"postgres","database_name":"postgres","process_id":"52","connection_from":"[local]","session_id":"65e02e5e.34","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","msg":"Updated replication settings","logging_pod":"harbor-postgres-3","filename":"override.conf"}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.222 UTC","user_name":"postgres","database_name":"postgres","process_id":"55","connection_from":"[local]","session_id":"65e02e5e.37","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.245 UTC","user_name":"postgres","database_name":"postgres","process_id":"56","connection_from":"[local]","session_id":"65e02e5e.38","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.246 UTC","user_name":"postgres","database_name":"postgres","process_id":"57","connection_from":"[local]","session_id":"65e02e5e.39","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"fbc281dc-1514-4b0a-9c39-0689e3a17f0a","uuid":"e6a01d01-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.250 UTC","user_name":"postgres","database_name":"postgres","process_id":"58","connection_from":"[local]","session_id":"65e02e5e.3a","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.305 UTC","user_name":"postgres","database_name":"postgres","process_id":"59","connection_from":"[local]","session_id":"65e02e5e.3b","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.571 UTC","user_name":"postgres","database_name":"postgres","process_id":"60","connection_from":"[local]","session_id":"65e02e5e.3c","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"harbor-postgres-3","walName":"0000000E.history","options":["--endpoint-url","https://s3.eu-central-1.amazonaws.com","--cloud-provider","aws-s3","s3://TRUNCATED/","harbor-postgres"],"startTime":"2024-02-29T07:12:30Z","endTime":"2024-02-29T07:12:30Z","elapsedWalTime":0.555448566}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.751 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"3","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"entering standby mode","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","msg":"Updated replication settings","logging_pod":"harbor-postgres-3","filename":"override.conf"}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.397 UTC","user_name":"postgres","database_name":"postgres","process_id":"74","connection_from":"[local]","session_id":"65e02e5f.4a","session_line_num":"1","session_start_time":"2024-02-29 07:12:31 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.399 UTC","user_name":"postgres","database_name":"postgres","process_id":"75","connection_from":"[local]","session_id":"65e02e5f.4b","session_line_num":"1","session_start_time":"2024-02-29 07:12:31 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"70ee1fc8-5c19-43f4-a7d1-3bc8df2a218d","uuid":"e75275e9-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"wal-restore","msg":"Restored WAL file","logging_pod":"harbor-postgres-3","walName":"0000000D.history","startTime":"2024-02-29T07:12:30Z","endTime":"2024-02-29T07:12:31Z","elapsedWalTime":0.524848024}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"wal-restore","msg":"WAL restore command completed (parallel)","logging_pod":"harbor-postgres-3","walName":"0000000D.history","maxParallel":1,"successfulWalRestore":1,"failedWalRestore":0,"endOfWALStream":false,"startTime":"2024-02-29T07:12:30Z","downloadStartTime":"2024-02-29T07:12:30Z","downloadTotalTime":0.525037538,"totalTime":0.633526301}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.418 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"4","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000D.history\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.911 UTC","user_name":"postgres","database_name":"postgres","process_id":"88","connection_from":"[local]","session_id":"65e02e5f.58","session_line_num":"1","session_start_time":"2024-02-29 07:12:31 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.059 UTC","user_name":"postgres","database_name":"postgres","process_id":"89","connection_from":"[local]","session_id":"65e02e60.59","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.072 UTC","user_name":"postgres","database_name":"postgres","process_id":"90","connection_from":"[local]","session_id":"65e02e60.5a","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.129 UTC","user_name":"postgres","database_name":"postgres","process_id":"91","connection_from":"[local]","session_id":"65e02e60.5b","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"wal-restore","msg":"Restored WAL file","logging_pod":"harbor-postgres-3","walName":"0000000D0000000E00000010","startTime":"2024-02-29T07:12:31Z","endTime":"2024-02-29T07:12:32Z","elapsedWalTime":0.614393645}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"wal-restore","msg":"WAL restore command completed (parallel)","logging_pod":"harbor-postgres-3","walName":"0000000D0000000E00000010","maxParallel":1,"successfulWalRestore":1,"failedWalRestore":0,"endOfWALStream":false,"startTime":"2024-02-29T07:12:31Z","downloadStartTime":"2024-02-29T07:12:31Z","downloadTotalTime":0.61485673,"totalTime":0.708725379}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.200 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"5","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000D0000000E00000010\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.392 UTC","user_name":"postgres","database_name":"postgres","process_id":"93","connection_from":"[local]","session_id":"65e02e60.5d","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.416 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"6","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"invalid resource manager ID in checkpoint record","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.416 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"7","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"PANIC","sql_state_code":"XX000","message":"could not locate a valid checkpoint record","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.434 UTC","user_name":"postgres","database_name":"postgres","process_id":"94","connection_from":"[local]","session_id":"65e02e60.5e","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.494 UTC","user_name":"postgres","database_name":"postgres","process_id":"96","connection_from":"[local]","session_id":"65e02e60.60","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"d5a0ece4-2194-454f-80c1-1c370fda5f19","uuid":"e801d87a-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.496 UTC","user_name":"postgres","database_name":"postgres","process_id":"97","connection_from":"[local]","session_id":"65e02e60.61","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.634 UTC","user_name":"postgres","database_name":"postgres","process_id":"98","connection_from":"[local]","session_id":"65e02e60.62","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:33.283 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"6","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"startup process (PID 28) was terminated by signal 6: Aborted","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:33.283 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"7","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"aborting startup due to startup process failure","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:33.285 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"8","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system is shut down","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Extracting pg_controldata information","logging_pod":"harbor-postgres-3","reason":"postmaster has exited"}
{"level":"error","ts":"2024-02-29T07:12:33Z","msg":"PostgreSQL process exited with errors","logging_pod":"harbor-postgres-3","error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/run/lifecycle.(*PostgresLifecycle).Start\n\tinternal/cmd/manager/instance/run/lifecycle/lifecycle.go:98\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"error","ts":"2024-02-29T07:12:33Z","msg":"error received after stop sequence was engaged","error":"exit status 1","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/internal.go:490"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Webserver exited","logging_pod":"harbor-postgres-3","address":":9187"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Webserver exited","logging_pod":"harbor-postgres-3","address":":8000"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.json","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Exited log pipe","fileName":"/controller/log/postgres","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"All workers finished","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Webserver exited","logging_pod":"harbor-postgres-3","address":"localhost:8010"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"All workers finished","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"All workers finished","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.csv","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"roles_reconciler","msg":"Terminated RoleSynchronizer loop","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Wait completed, proceeding to shutdown the manager"}

ensdhvi · 2024-03-28T19:44:31Z

We are facing the same problem on Google GKE clusters. After each node upgrade random instances have problems with recovering. The only workaround we found is removing one of failing instance PVC.

movergan · 2024-04-01T10:25:02Z

Just encountered into the same issue in GKE

suleimi · 2024-04-20T03:33:51Z

I keep running into this every other day

maaft · 2024-04-22T06:28:27Z

Also discovered this now.. is CNPG team aware of this? Whom to ping about this issue?

only working solution was to delete the failing Pod and PVC so a new node is created to join the cluster

oberai07 · 2024-04-26T13:35:57Z

I am facing similar issue. Do we have any fix available for below issue ? I tried to delete the pod but did not help.

{"level":"error","ts":"2024-02-29T07:12:33Z","msg":"PostgreSQL process exited with errors","logging_pod":"harbor-postgres-3","error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/run/lifecycle.(*PostgresLifecycle).Start\n\tinternal/cmd/manager/instance/run/lifecycle/lifecycle.go:98\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223"}

janosmiko · 2024-04-26T13:48:56Z

@gbartolini is it possible to modify the health check behavior of the pods managed by CNPG?

suleimi · 2024-04-26T20:25:17Z

This might be related to #3698. Give us some time to investigate.

@gbartolini Any updates on this would be greatly appreciated 🙏🏿

brettinternet · 2024-06-07T16:34:21Z

The only way I've been able to recover from this is to delete both the PVC and Pod

Dev-Destructor · 2024-07-05T08:54:50Z

Same Issue here. Luckily, Our production cluster survived because one of the 3 replicas was working. Deleting both PVC and Pod solved the issue but it is not a permanent solution.

we use GKE

maximemoreillon · 2024-07-14T03:12:58Z

I'm facing the same issue using the CNPG community addon for Microk8s, i.e. CNPG v1.22.0.
After a reboot of the cluster, pods sometimes end up in 0/1 Running with the Reconciler error.
I will try with v1.23.2.

kaotika · 2024-07-23T09:13:03Z

Seems to be gone with latest v1.23. The same for you?

maximemoreillon · 2024-07-24T06:23:22Z

Indeed, it seems good for my cluster too

cybercoder · 2024-07-28T11:19:19Z

I updated the operator to the latest version yesterday and restored the database; today it's the same again. It makes me nervous and angry

I have expanded the PVC and everything is fixed. There are no related logs!

amalic · 2024-08-06T10:18:36Z

I think I am facing the similar issue. I am running CNPG on AWS Spot Instances provisioned through Karpenter Cluster Autoscaler. Currently I have only one of the three instances working. The other two instance don't seem to come up.

I am using CNPG Operator v1.23.1

Seeing error messages like this on the instance that won't come up:

{"level":"info","ts":"2024-08-06T10:12:20Z","logger":"postgres","msg":"record","logging_pod":"pg-cluster-3","record":{"log_time":"2024-08-06 10:12:20.737 UTC","process_id":"18434","session_id":"66b1f704.4802","session_line_num":"2","session_start_time":"2024-08-06 10:12:20 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"XX000","message":"could not receive data from WAL stream: ERROR:  requested WAL segment 0000000A000001BB00000044 has already been removed","backend_type":"walreceiver"}}
{"level":"info","ts":"2024-08-06T10:12:20Z","logger":"postgres","msg":"record","logging_pod":"pg-cluster-3","record":{"log_time":"2024-08-06 10:12:20.913 UTC","user_name":"postgres","database_name":"postgres","process_id":"18441","connection_from":"[local]","session_id":"66b1f704.4809","session_line_num":"1","session_start_time":"2024-08-06 10:12:20 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend"}}

and

{"level":"info","ts":"2024-08-06T10:12:21Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-cluster","namespace":"xxx-development"},"namespace":"xxx-development","name":"pg-cluster","reconcileID":"f6812cba-7a90-41fe-a115-01cc8edac2d5","logging_pod":"pg-cluster-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}

Update: I managed to get the Cluster back into an healthy state. While the CRD Events were indicating a communication issue (saying somtehing like nodes not reporting healthy status). Turns out the issue was actually corrupted storage. After I deleted the PVC's before deleting the pods, new cluster instances werde created which after a short while showed up as healthy in CNPG's Grafana Dashboard.

orenzah added the triage Pending triage label Jan 10, 2024

orenzah assigned gbartolini Jan 10, 2024

gbartolini mentioned this issue Jan 17, 2024

[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

Closed

4 tasks

litaocdl mentioned this issue Jan 26, 2024

fix: enforce postgres.auto.conf to rw before run rewind #3728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680

[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680

orenzah commented Jan 10, 2024

gbartolini commented Jan 11, 2024

orenzah commented Jan 11, 2024

gbartolini commented Jan 17, 2024

janosmiko commented Feb 29, 2024 •

edited

Loading

ensdhvi commented Mar 28, 2024

movergan commented Apr 1, 2024

suleimi commented Apr 20, 2024

maaft commented Apr 22, 2024 •

edited

Loading

oberai07 commented Apr 26, 2024

janosmiko commented Apr 26, 2024

suleimi commented Apr 26, 2024

brettinternet commented Jun 7, 2024

Dev-Destructor commented Jul 5, 2024

maximemoreillon commented Jul 14, 2024 •

edited

Loading

kaotika commented Jul 23, 2024

maximemoreillon commented Jul 24, 2024

cybercoder commented Jul 28, 2024 •

edited

Loading

amalic commented Aug 6, 2024 •

edited

Loading

[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680

[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680

Comments

orenzah commented Jan 10, 2024

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

Version

What version of Kubernetes are you using?

What is your Kubernetes environment?

How did you install the operator?

What happened?

Cluster resource

Relevant log output

Code of Conduct

gbartolini commented Jan 11, 2024

orenzah commented Jan 11, 2024

gbartolini commented Jan 17, 2024

janosmiko commented Feb 29, 2024 • edited Loading

ensdhvi commented Mar 28, 2024

movergan commented Apr 1, 2024

suleimi commented Apr 20, 2024

maaft commented Apr 22, 2024 • edited Loading

oberai07 commented Apr 26, 2024

janosmiko commented Apr 26, 2024

suleimi commented Apr 26, 2024

brettinternet commented Jun 7, 2024

Dev-Destructor commented Jul 5, 2024

maximemoreillon commented Jul 14, 2024 • edited Loading

kaotika commented Jul 23, 2024

maximemoreillon commented Jul 24, 2024

cybercoder commented Jul 28, 2024 • edited Loading

amalic commented Aug 6, 2024 • edited Loading

janosmiko commented Feb 29, 2024 •

edited

Loading

maaft commented Apr 22, 2024 •

edited

Loading

maximemoreillon commented Jul 14, 2024 •

edited

Loading

cybercoder commented Jul 28, 2024 •

edited

Loading

amalic commented Aug 6, 2024 •

edited

Loading