Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680

Open
4 tasks done
orenzah opened this issue Jan 10, 2024 · 12 comments
Open
4 tasks done
Assignees
Labels
triage Pending triage

Comments

@orenzah
Copy link

orenzah commented Jan 10, 2024

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

oren@aperio.ai

Version

1.22.0

What version of Kubernetes are you using?

1.27

What is your Kubernetes environment?

Cloud: Amazon EKS

How did you install the operator?

Helm

What happened?

We deployed the operator using a Helm chart and the PostgreSQL cluster resource using a separate Helm chart. Three instances of PostgreSQL were deployed on EC2 Spot instances. After a few hours, it appears that the primary instance (postgres-1) did not gracefully shut down, leading to the absence of the relevant Write-Ahead Logging (WAL) file.

Subsequently, attempts to use pg_rewind have failed, with the process encountering an empty file. Additionally, the file "pg_xact/0000" is also empty, contributing to the inability to resolve the issue and leaving the old primary instance in a stuck state.

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  annotations:
    meta.helm.sh/release-name: postgres-cn
    meta.helm.sh/release-namespace: stg
  creationTimestamp: "2024-01-09T12:06:49Z"
  generation: 4
  labels:
    app.kubernetes.io/managed-by: Helm
  name: postgres
  namespace: stg
  resourceVersion: "346234286"
  uid: 69ddea55-1c72-4e9f-abf6-aa2e43a32db4
spec:
  affinity:
    podAntiAffinityType: preferred
  bootstrap:
    initdb:
      database: app
      encoding: UTF8
      import:
        databases:
        - webapi
        source:
          externalCluster: xxxxxx
        type: microservice
      localeCType: C
      localeCollate: C
      owner: app
  enableSuperuserAccess: true
  externalClusters:
  - connectionParameters:
      dbname: webapi
      host: xxxxxxxxxxxx
      user: postgres
    name: xxxxxxx
    password:
      key: postgres-password
      name: xxxxxxxxxxxxxx
  failoverDelay: 0
  imageName: xxxxxxx/postgresql:16.1-xxxxxxxxx
  inheritedMetadata:
    labels:
      app.kubernetes.io/name: postgres
      app.kubernetes.io/version: "16.1"
  instances: 3
  logLevel: info
  maxSyncReplicas: 2
  minSyncReplicas: 1
  monitoring:
    customQueriesConfigMap:
    - key: queries
      name: cnpg-default-monitoring
    disableDefaultQueries: false
    enablePodMonitor: false
  postgresGID: 26
  postgresUID: 26
  postgresql:
    parameters:
      archive_mode: "on"
      archive_timeout: 5min
      dynamic_shared_memory_type: posix
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_rotation_age: "0"
      log_rotation_size: "0"
      log_truncate_on_rotation: "false"
      logging_collector: "on"
      max_parallel_workers: "32"
      max_replication_slots: "32"
      max_worker_processes: "32"
      shared_memory_type: mmap
      shared_preload_libraries: ""
      ssl_max_protocol_version: TLSv1.3
      ssl_min_protocol_version: TLSv1.3
      wal_keep_size: 512MB
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
    syncReplicaElectionConstraint:
      enabled: false
  primaryUpdateMethod: restart
  primaryUpdateStrategy: unsupervised
  replicationSlots:
    highAvailability:
      enabled: true
      slotPrefix: _cnpg_
    updateInterval: 30
  resources: {}
  smartShutdownTimeout: 180
  startDelay: 3600
  stopDelay: 1800
  storage:
    resizeInUseVolumes: true
    size: 30Gi
    storageClass: gp3
  superuserSecret:
    name: postgres-superuser
  switchoverDelay: 3600
status:
  certificates:
    clientCASecret: postgres-ca
    expirations:
      postgres-ca: 2024-04-08 12:01:49 +0000 UTC
      postgres-replication: 2024-04-08 12:01:49 +0000 UTC
      postgres-server: 2024-04-08 12:01:49 +0000 UTC
    replicationTLSSecret: postgres-replication
    serverAltDNSNames:
    - postgres-rw
    - postgres-rw.stg
    - postgres-rw.stg.svc
    - postgres-r
    - postgres-r.stg
    - postgres-r.stg.svc
    - postgres-ro
    - postgres-ro.stg
    - postgres-ro.stg.svc
    serverCASecret: postgres-ca
    serverTLSSecret: postgres-server
  cloudNativePGCommitHash: 86b9dc80
  cloudNativePGOperatorHash: 411a67468c5e1357d387221ce5b11d906bd33d62324846a8c776d206b6c94162
  conditions:
  - lastTransitionTime: "2024-01-10T08:20:16Z"
    message: Cluster Is Not Ready
    reason: ClusterIsNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-01-10T12:18:14Z"
    message: Continuous archiving is working
    reason: ContinuousArchivingSuccess
    status: "True"
    type: ContinuousArchiving
  configMapResourceVersion:
    metrics:
      cnpg-default-monitoring: "345484626"
  currentPrimary: postgres-2
  currentPrimaryTimestamp: "2024-01-10T08:21:59.663283Z"
  healthyPVC:
  - postgres-1
  - postgres-2
  - postgres-3
  instanceNames:
  - postgres-1
  - postgres-2
  - postgres-3
  instances: 3
  instancesReportedState:
    postgres-1:
      isPrimary: false
    postgres-2:
      isPrimary: true
      timeLineID: 2
    postgres-3:
      isPrimary: false
      timeLineID: 2
  instancesStatus:
    healthy:
    - postgres-2
    - postgres-3
    replicating:
    - postgres-1
  latestGeneratedNode: 3
  managedRolesStatus: {}
  phase: Waiting for the instances to become active
  phaseReason: Some instances are not yet active. Please wait.
  poolerIntegrations:
    pgBouncerIntegration: {}
  pvcCount: 3
  readService: postgres-r
  readyInstances: 2
  secretsResourceVersion:
    applicationSecretVersion: "345484596"
    clientCaSecretVersion: "345484593"
    externalClusterSecretVersion:
      stg-postgres-postgresql: "231319709"
    replicationSecretVersion: "345484595"
    serverCaSecretVersion: "345484593"
    serverSecretVersion: "345484594"
    superuserSecretVersion: "341642273"
  targetPrimary: postgres-2
  targetPrimaryTimestamp: "2024-01-10T08:20:16.701057Z"
  timelineID: 2
  topology:
    instances:
      postgres-1: {}
      postgres-2: {}
      postgres-3: {}
    nodesUsed: 2
    successfullyExtracted: true
  writeService: postgres-rw

Relevant log output

{"level":"error","ts":"2024-01-10T10:58:19Z","msg":"Reconciler error","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"9b64cee9-7e71-4e05-9bb0-e1cbaee76c9d","error":"while activating instance: error starting PostgreSQL instance: exit status 1","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Cluster status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"This is an old primary instance, waiting for the switchover to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Switchover completed","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","targetPrimary":"postgres-2","currentPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Waiting for the new primary to be available","logging_pod":"postgres-1","primaryConnInfo":"host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres connect_timeout=5"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Extracting pg_controldata information","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","reason":"before pg_rewind"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_controldata","msg":"pg_control version number:            1300\nCatalog version number:               202307071\nDatabase system identifier:           7322068936411099156\nDatabase cluster state:               shut down\npg_control last modified:             Wed 10 Jan 2024 08:19:35 AM UTC\nLatest checkpoint location:           2/AA000028\nLatest checkpoint's REDO location:    2/AA000028\nLatest checkpoint's REDO WAL file:    0000000100000002000000AA\nLatest checkpoint's TimeLineID:       1\nLatest checkpoint's PrevTimeLineID:   1\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID:          0:36923\nLatest checkpoint's NextOID:          25092\nLatest checkpoint's NextMultiXactId:  1\nLatest checkpoint's NextMultiOffset:  0\nLatest checkpoint's oldestXID:        722\nLatest checkpoint's oldestXID's DB:   1\nLatest checkpoint's oldestActiveXID:  0\nLatest checkpoint's oldestMultiXid:   1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint:            Wed 10 Jan 2024 08:19:35 AM UTC\nFake LSN counter for unlogged rels:   0/3E8\nMinimum recovery ending location:     0/0\nMin recovery ending loc's timeline:   0\nBackup start location:                0/0\nBackup end location:                  0/0\nEnd-of-backup record required:        no\nwal_level setting:                    logical\nwal_log_hints setting:                on\nmax_connections setting:              100\nmax_worker_processes setting:         32\nmax_wal_senders setting:              10\nmax_prepared_xacts setting:           0\nmax_locks_per_xact setting:           64\ntrack_commit_timestamp setting:       off\nMaximum data alignment:               8\nDatabase block size:                  8192\nBlocks per segment of large relation: 131072\nWAL block size:                       8192\nBytes per WAL segment:                16777216\nMaximum length of identifiers:        64\nMaximum columns in an index:          32\nMaximum size of a TOAST chunk:        1996\nSize of a large-object chunk:         2048\nDate/time type storage:               64-bit integers\nFloat8 argument passing:              by value\nData page checksum version:           0\nMock authentication nonce:            4e9ac5de19e20369e90184039bdcf5c867fa08bd811982f2edb29d1f11d8fd7b\n","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Starting up pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"]}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: connected to server","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: servers diverged at WAL location 2/A9000860 on timeline 1","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: error: could not read file \"/var/lib/postgresql/data/pgdata/pg_wal/0000000100000002000000A8\": read 0 of 8192","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_rewind","msg":"pg_rewind: error: could not find previous WAL record at 2/A8241C48","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"error","ts":"2024-01-10T10:58:28Z","msg":"Failed to execute pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"],"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/postgres.(*Instance).Rewind\n\tpkg/management/postgres/instance.go:972\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).verifyPgDataCoherenceForPrimary\n\tinternal/management/controller/instance_startup.go:235\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).initialize\n\tinternal/management/controller/instance_controller.go:342\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).Reconcile\n\tinternal/management/controller/instance_controller.go:141\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"pg_rewind failed, starting the server to complete the crash recovery","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","uuid":"2f587c83-afa7-11ee-9aef-929973a308f2","logging_pod":"postgres-1","err":"error executing pg_rewind: exit status 1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Waiting for server to complete crash recovery","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Starting up instance","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["start","-w","-D","/var/lib/postgresql/data/pgdata","-o","-c port=5432 -c unix_socket_directories=/controller/run","-t 40000000"]}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"waiting for server to start....2024-01-10 10:58:28.483 UTC [348] LOG:  redirecting log output to logging collector process","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"2024-01-10 10:58:28.483 UTC [348] HINT:  Future log output will appear in directory \"/controller/log\".","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"2024-01-10 10:58:28.484 UTC [348] LOG:  ending log output to stderr","source":"/controller/log/postgres","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"2024-01-10 10:58:28.484 UTC [348] HINT:  Future log output will go to log destination \"csvlog\".","source":"/controller/log/postgres","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"1","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"ending log output to stderr","hint":"Future log output will go to log destination \"csvlog\".","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"2","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"starting PostgreSQL 16.1 (Debian 16.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"3","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.484 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"4","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv6 address \"::\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.488 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"5","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on Unix socket \"/controller/run/.s.PGSQL.5432\"","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.495 UTC","process_id":"352","session_id":"659e7854.160","session_line_num":"1","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system was shut down at 2024-01-10 08:19:35 UTC","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.495 UTC","process_id":"352","session_id":"659e7854.160","session_line_num":"2","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"XX000","message":"could not access status of transaction 36923","detail":"Could not read from file \"pg_xact/0000\" at offset 8192: read too few bytes.","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.496 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"6","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"startup process (PID 352) exited with exit code 1","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.497 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"7","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"aborting startup due to startup process failure","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"postgres","msg":"record","logging_pod":"postgres-1","record":{"log_time":"2024-01-10 10:58:28.497 UTC","process_id":"348","session_id":"659e7854.15c","session_line_num":"8","session_start_time":"2024-01-10 10:58:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system is shut down","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":" stopped waiting","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"pg_ctl: could not start server","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","logger":"pg_ctl","msg":"Examine the log output.","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-10T10:58:28Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.csv","logging_pod":"postgres-1"}
{"level":"error","ts":"2024-01-10T10:58:28Z","msg":"Reconciler error","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"stg"},"namespace":"stg","name":"postgres","reconcileID":"b6dbc5ac-aab1-4528-bc0c-f67aa8dac9b1","error":"while activating instance: error starting PostgreSQL instance: exit status 1","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

Code of Conduct

  • I agree to follow this project's Code of Conduct
@orenzah orenzah added the triage Pending triage label Jan 10, 2024
@gbartolini
Copy link
Contributor

From what I can see, it looks like an issue with the underlying file system (data corruption of that file). You need to recreate the PVC of that instance.

@orenzah
Copy link
Author

orenzah commented Jan 11, 2024

It happens 2 days in a row, we are using EC2 spot instances, and we managed to solve it by deleting the problematic PVC (gp3) but that manual PVC creation every day is probably not the desired behavior.

@gbartolini
Copy link
Contributor

This might be related to #3698. Give us some time to investigate.

@janosmiko
Copy link

janosmiko commented Feb 29, 2024

I face the same issue in multiple clusters in the same k8s cluster. I think it occurs when a network error happens.

The only way to resolve it is to delete the PVC of the failing pod and remove the pod. In this case, the operator is going to recreate the failing replica.

Logs:

{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"Starting CloudNativePG Instance Manager","logging_pod":"harbor-postgres-3","version":"1.22.1","build":{"Version":"1.22.1","Commit":"c7be872e","Date":"2024-02-02"}}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"starting tablespace manager","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"starting external server manager","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"setup","msg":"starting controller-runtime manager","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"roles_reconciler","msg":"starting up the runnable","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"roles_reconciler","msg":"skipping the RoleSynchronizer in replicas","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","logger":"roles_reconciler","msg":"setting up RoleSynchronizer loop","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting webserver","logging_pod":"harbor-postgres-3","address":":9187"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting webserver","logging_pod":"harbor-postgres-3","address":"localhost:8010"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting webserver","logging_pod":"harbor-postgres-3","address":":8000"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-02-29T07:12:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Installed configuration file","logging_pod":"harbor-postgres-3","pgdata":"/var/lib/postgresql/data/pgdata","filename":"pg_ident.conf"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Updated replication settings","logging_pod":"harbor-postgres-3","filename":"override.conf"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Found previous run flag","logging_pod":"harbor-postgres-3","filename":"/var/lib/postgresql/data/pgdata/cnpg_initialized-harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Extracting pg_controldata information","logging_pod":"harbor-postgres-3","reason":"postmaster start up"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"pg_controldata","msg":"pg_control version number:            1300\nCatalog version number:               202307071\nDatabase system identifier:           7321087858539200534\nDatabase cluster state:               shut down in recovery\npg_control last modified:             Thu 29 Feb 2024 06:43:08 AM UTC\nLatest checkpoint location:           E/10000028\nLatest checkpoint's REDO location:    E/10000028\nLatest checkpoint's REDO WAL file:    0000000C0000000E00000010\nLatest checkpoint's TimeLineID:       12\nLatest checkpoint's PrevTimeLineID:   12\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID:          0:93004\nLatest checkpoint's NextOID:          25537\nLatest checkpoint's NextMultiXactId:  2\nLatest checkpoint's NextMultiOffset:  3\nLatest checkpoint's oldestXID:        722\nLatest checkpoint's oldestXID's DB:   1\nLatest checkpoint's oldestActiveXID:  0\nLatest checkpoint's oldestMultiXid:   1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint:            Wed 28 Feb 2024 09:53:23 PM UTC\nFake LSN counter for unlogged rels:   0/3E8\nMinimum recovery ending location:     E/100000A0\nMin recovery ending loc's timeline:   12\nBackup start location:                0/0\nBackup end location:                  0/0\nEnd-of-backup record required:        no\nwal_level setting:                    logical\nwal_log_hints setting:                on\nmax_connections setting:              100\nmax_worker_processes setting:         32\nmax_wal_senders setting:              10\nmax_prepared_xacts setting:           0\nmax_locks_per_xact setting:           64\ntrack_commit_timestamp setting:       off\nMaximum data alignment:               8\nDatabase block size:                  8192\nBlocks per segment of large relation: 131072\nWAL block size:                       8192\nBytes per WAL segment:                16777216\nMaximum length of identifiers:        64\nMaximum columns in an index:          32\nMaximum size of a TOAST chunk:        1996\nSize of a large-object chunk:         2048\nDate/time type storage:               64-bit integers\nFloat8 argument passing:              by value\nData page checksum version:           0\nMock authentication nonce:            431926e09764e6c576b934232250a417e85da4f29af6306f5db1e88e1660bc4f\n","pipe":"stdout","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","msg":"Instance is still down, will retry in 1 second","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"0846c8b9-adae-4fa0-9085-82724a409769","uuid":"e5f15169-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.099 UTC [24] LOG:  pgaudit extension initialized","pipe":"stderr","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] LOG:  redirecting log output to logging collector process","pipe":"stderr","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] HINT:  Future log output will appear in directory \"/controller/log\".","pipe":"stderr","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"1","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"ending log output to stderr","hint":"Future log output will go to log destination \"csvlog\".","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"2","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"starting PostgreSQL 16.1 (Debian 16.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"3","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.152 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"4","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv6 address \"::\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] LOG:  ending log output to stderr","source":"/controller/log/postgres","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"2024-02-29 07:12:29.152 UTC [24] HINT:  Future log output will go to log destination \"csvlog\".","source":"/controller/log/postgres","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.167 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"5","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on Unix socket \"/controller/run/.s.PGSQL.5432\"","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.198 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"1","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system was shut down in recovery at 2024-02-29 06:43:08 UTC","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"wal-restore","msg":"Restored WAL file","logging_pod":"harbor-postgres-3","walName":"0000000D.history","startTime":"2024-02-29T07:12:29Z","endTime":"2024-02-29T07:12:29Z","elapsedWalTime":0.633726065}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"wal-restore","msg":"WAL restore command completed (parallel)","logging_pod":"harbor-postgres-3","walName":"0000000D.history","maxParallel":1,"successfulWalRestore":1,"failedWalRestore":0,"endOfWALStream":false,"startTime":"2024-02-29T07:12:29Z","downloadStartTime":"2024-02-29T07:12:29Z","downloadTotalTime":0.634063937,"totalTime":0.734779978}
{"level":"info","ts":"2024-02-29T07:12:29Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:29.964 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"2","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000D.history\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.066 UTC","user_name":"postgres","database_name":"postgres","process_id":"52","connection_from":"[local]","session_id":"65e02e5e.34","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","msg":"Updated replication settings","logging_pod":"harbor-postgres-3","filename":"override.conf"}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.222 UTC","user_name":"postgres","database_name":"postgres","process_id":"55","connection_from":"[local]","session_id":"65e02e5e.37","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.245 UTC","user_name":"postgres","database_name":"postgres","process_id":"56","connection_from":"[local]","session_id":"65e02e5e.38","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.246 UTC","user_name":"postgres","database_name":"postgres","process_id":"57","connection_from":"[local]","session_id":"65e02e5e.39","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"fbc281dc-1514-4b0a-9c39-0689e3a17f0a","uuid":"e6a01d01-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.250 UTC","user_name":"postgres","database_name":"postgres","process_id":"58","connection_from":"[local]","session_id":"65e02e5e.3a","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.305 UTC","user_name":"postgres","database_name":"postgres","process_id":"59","connection_from":"[local]","session_id":"65e02e5e.3b","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.571 UTC","user_name":"postgres","database_name":"postgres","process_id":"60","connection_from":"[local]","session_id":"65e02e5e.3c","session_line_num":"1","session_start_time":"2024-02-29 07:12:30 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"harbor-postgres-3","walName":"0000000E.history","options":["--endpoint-url","https://s3.eu-central-1.amazonaws.com","--cloud-provider","aws-s3","s3://TRUNCATED/","harbor-postgres"],"startTime":"2024-02-29T07:12:30Z","endTime":"2024-02-29T07:12:30Z","elapsedWalTime":0.555448566}
{"level":"info","ts":"2024-02-29T07:12:30Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:30.751 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"3","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"entering standby mode","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","msg":"Updated replication settings","logging_pod":"harbor-postgres-3","filename":"override.conf"}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.397 UTC","user_name":"postgres","database_name":"postgres","process_id":"74","connection_from":"[local]","session_id":"65e02e5f.4a","session_line_num":"1","session_start_time":"2024-02-29 07:12:31 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.399 UTC","user_name":"postgres","database_name":"postgres","process_id":"75","connection_from":"[local]","session_id":"65e02e5f.4b","session_line_num":"1","session_start_time":"2024-02-29 07:12:31 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"70ee1fc8-5c19-43f4-a7d1-3bc8df2a218d","uuid":"e75275e9-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"wal-restore","msg":"Restored WAL file","logging_pod":"harbor-postgres-3","walName":"0000000D.history","startTime":"2024-02-29T07:12:30Z","endTime":"2024-02-29T07:12:31Z","elapsedWalTime":0.524848024}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"wal-restore","msg":"WAL restore command completed (parallel)","logging_pod":"harbor-postgres-3","walName":"0000000D.history","maxParallel":1,"successfulWalRestore":1,"failedWalRestore":0,"endOfWALStream":false,"startTime":"2024-02-29T07:12:30Z","downloadStartTime":"2024-02-29T07:12:30Z","downloadTotalTime":0.525037538,"totalTime":0.633526301}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.418 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"4","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000D.history\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:31Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:31.911 UTC","user_name":"postgres","database_name":"postgres","process_id":"88","connection_from":"[local]","session_id":"65e02e5f.58","session_line_num":"1","session_start_time":"2024-02-29 07:12:31 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.059 UTC","user_name":"postgres","database_name":"postgres","process_id":"89","connection_from":"[local]","session_id":"65e02e60.59","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.072 UTC","user_name":"postgres","database_name":"postgres","process_id":"90","connection_from":"[local]","session_id":"65e02e60.5a","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.129 UTC","user_name":"postgres","database_name":"postgres","process_id":"91","connection_from":"[local]","session_id":"65e02e60.5b","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"wal-restore","msg":"Restored WAL file","logging_pod":"harbor-postgres-3","walName":"0000000D0000000E00000010","startTime":"2024-02-29T07:12:31Z","endTime":"2024-02-29T07:12:32Z","elapsedWalTime":0.614393645}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"wal-restore","msg":"WAL restore command completed (parallel)","logging_pod":"harbor-postgres-3","walName":"0000000D0000000E00000010","maxParallel":1,"successfulWalRestore":1,"failedWalRestore":0,"endOfWALStream":false,"startTime":"2024-02-29T07:12:31Z","downloadStartTime":"2024-02-29T07:12:31Z","downloadTotalTime":0.61485673,"totalTime":0.708725379}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.200 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"5","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000D0000000E00000010\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.392 UTC","user_name":"postgres","database_name":"postgres","process_id":"93","connection_from":"[local]","session_id":"65e02e60.5d","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.416 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"6","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"invalid resource manager ID in checkpoint record","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.416 UTC","process_id":"28","session_id":"65e02e5d.1c","session_line_num":"7","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"PANIC","sql_state_code":"XX000","message":"could not locate a valid checkpoint record","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.434 UTC","user_name":"postgres","database_name":"postgres","process_id":"94","connection_from":"[local]","session_id":"65e02e60.5e","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.494 UTC","user_name":"postgres","database_name":"postgres","process_id":"96","connection_from":"[local]","session_id":"65e02e60.60","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"harbor-postgres","namespace":"harbor"},"namespace":"harbor","name":"harbor-postgres","reconcileID":"d5a0ece4-2194-454f-80c1-1c370fda5f19","uuid":"e801d87a-d6d1-11ee-83aa-4ade4de24c82","logging_pod":"harbor-postgres-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.496 UTC","user_name":"postgres","database_name":"postgres","process_id":"97","connection_from":"[local]","session_id":"65e02e60.61","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:32Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:32.634 UTC","user_name":"postgres","database_name":"postgres","process_id":"98","connection_from":"[local]","session_id":"65e02e60.62","session_line_num":"1","session_start_time":"2024-02-29 07:12:32 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:33.283 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"6","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"startup process (PID 28) was terminated by signal 6: Aborted","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:33.283 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"7","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"aborting startup due to startup process failure","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"postgres","msg":"record","logging_pod":"harbor-postgres-3","record":{"log_time":"2024-02-29 07:12:33.285 UTC","process_id":"24","session_id":"65e02e5d.18","session_line_num":"8","session_start_time":"2024-02-29 07:12:29 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system is shut down","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Extracting pg_controldata information","logging_pod":"harbor-postgres-3","reason":"postmaster has exited"}
{"level":"error","ts":"2024-02-29T07:12:33Z","msg":"PostgreSQL process exited with errors","logging_pod":"harbor-postgres-3","error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/run/lifecycle.(*PostgresLifecycle).Start\n\tinternal/cmd/manager/instance/run/lifecycle/lifecycle.go:98\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"error","ts":"2024-02-29T07:12:33Z","msg":"error received after stop sequence was engaged","error":"exit status 1","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/internal.go:490"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Webserver exited","logging_pod":"harbor-postgres-3","address":":9187"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Webserver exited","logging_pod":"harbor-postgres-3","address":":8000"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.json","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Exited log pipe","fileName":"/controller/log/postgres","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"All workers finished","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Webserver exited","logging_pod":"harbor-postgres-3","address":"localhost:8010"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"All workers finished","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"All workers finished","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.csv","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2024-02-29T07:12:33Z","logger":"roles_reconciler","msg":"Terminated RoleSynchronizer loop","logging_pod":"harbor-postgres-3"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2024-02-29T07:12:33Z","msg":"Wait completed, proceeding to shutdown the manager"}

@ensdhvi
Copy link

ensdhvi commented Mar 28, 2024

We are facing the same problem on Google GKE clusters. After each node upgrade random instances have problems with recovering. The only workaround we found is removing one of failing instance PVC.

@movergan
Copy link

movergan commented Apr 1, 2024

Just encountered into the same issue in GKE

@suleimi
Copy link

suleimi commented Apr 20, 2024

I keep running into this every other day

@maaft
Copy link

maaft commented Apr 22, 2024

Also discovered this now.. is CNPG team aware of this? Whom to ping about this issue?

only working solution was to delete the failing Pod and PVC so a new node is created to join the cluster

@oberai07
Copy link

I am facing similar issue. Do we have any fix available for below issue ? I tried to delete the pod but did not help.

{"level":"error","ts":"2024-02-29T07:12:33Z","msg":"PostgreSQL process exited with errors","logging_pod":"harbor-postgres-3","error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/run/lifecycle.(*PostgresLifecycle).Start\n\tinternal/cmd/manager/instance/run/lifecycle/lifecycle.go:98\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223"}

@janosmiko
Copy link

@gbartolini is it possible to modify the health check behavior of the pods managed by CNPG?

@suleimi
Copy link

suleimi commented Apr 26, 2024

This might be related to #3698. Give us some time to investigate.

@gbartolini Any updates on this would be greatly appreciated 🙏🏿

@brettinternet
Copy link

The only way I've been able to recover from this is to delete both the PVC and Pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Pending triage
Projects
Development

No branches or pull requests

9 participants