[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

buroa · 2024-01-15T18:12:06Z

Is there an existing issue already for this bug?

I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

I have read the troubleshooting guide and I think this is a new bug.

Contact Details

skre@skre.me

Version

Helm 0.20.0

What version of Kubernetes are you using?

K8s 1.29.0

What is your Kubernetes environment?

Talos Linux

How did you install the operator?

HelmRelease: https://github.com/buroa/k8s-gitops/blob/master/kubernetes/apps/databases/cloudnative-pg/app/helmrelease.yaml

Cluster: https://github.com/buroa/k8s-gitops/blob/master/kubernetes/apps/databases/cloudnative-pg/cluster/cluster.yaml

What happened?

Upon rebooting a node that hosts the primary instance, it fails to come back up with that error.

Cluster resource

No response

Relevant log output

{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Cluster status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"This is an old primary instance, waiting for the switchover to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Switchover completed","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","targetPrimary":"postgres-2","currentPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Waiting for the new primary to be available","logging_pod":"postgres-1","primaryConnInfo":"host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres connect_timeout=5"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Extracting pg_controldata information","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","reason":"before pg_rewind"}
{"level":"info","ts":"2024-01-15T18:09:18Z","logger":"pg_controldata","msg":"pg_control version number:            1300\nCatalog version number:               202307071\nDatabase system identifier:           7293656391259123739\nDatabase cluster state:               shut down\npg_control last modified:             Mon 15 Jan 2024 06:08:32 PM UTC\nLatest checkpoint location:           63/9D000028\nLatest checkpoint's REDO location:    63/9D000028\nLatest checkpoint's REDO WAL file:    0000003A000000630000009D\nLatest checkpoint's TimeLineID:       58\nLatest checkpoint's PrevTimeLineID:   58\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID:          0:13381695\nLatest checkpoint's NextOID:          406136\nLatest checkpoint's NextMultiXactId:  623\nLatest checkpoint's NextMultiOffset:  1305\nLatest checkpoint's oldestXID:        722\nLatest checkpoint's oldestXID's DB:   1\nLatest checkpoint's oldestActiveXID:  0\nLatest checkpoint's oldestMultiXid:   1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint:            Mon 15 Jan 2024 06:08:32 PM UTC\nFake LSN counter for unlogged rels:   0/3E8\nMinimum recovery ending location:     0/0\nMin recovery ending loc's timeline:   0\nBackup start location:                0/0\nBackup end location:                  0/0\nEnd-of-backup record required:        no\nwal_level setting:                    logical\nwal_log_hints setting:                on\nmax_connections setting:              600\nmax_worker_processes setting:         32\nmax_wal_senders setting:              10\nmax_prepared_xacts setting:           0\nmax_locks_per_xact setting:           64\ntrack_commit_timestamp setting:       off\nMaximum data alignment:               8\nDatabase block size:                  8192\nBlocks per segment of large relation: 131072\nWAL block size:                       8192\nBytes per WAL segment:                16777216\nMaximum length of identifiers:        64\nMaximum columns in an index:          32\nMaximum size of a TOAST chunk:        1996\nSize of a large-object chunk:         2048\nDate/time type storage:               64-bit integers\nFloat8 argument passing:              by value\nData page checksum version:           0\nMock authentication nonce:            d9a7ef70134dad18b34d8794e2b1048ed290244459677435e847bceecbb9c2ef\n","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Starting up pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"]}
{"level":"info","ts":"2024-01-15T18:09:18Z","logger":"pg_rewind","msg":"pg_rewind: error: restore_command is not set in the target cluster","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"error","ts":"2024-01-15T18:09:18Z","msg":"Failed to execute pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"],"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/postgres.(*Instance).Rewind\n\tpkg/management/postgres/instance.go:972\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).verifyPgDataCoherenceForPrimary\n\tinternal/management/controller/instance_startup.go:235\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).initialize\n\tinternal/management/controller/instance_controller.go:342\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).Reconcile\n\tinternal/management/controller/instance_controller.go:141\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"pg_rewind failed, starting the server to complete the crash recovery","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","err":"error executing pg_rewind: exit status 1"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Waiting for server to complete crash recovery","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Starting up instance","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["start","-w","-D","/var/lib/postgresql/data/pgdata","-o","-c port=5432 -c unix_socket_directories=/controller/run","-t 40000000"]}

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

gbartolini · 2024-01-17T09:11:21Z

Can you please share the content of:

custom.conf
override.conf
postgresql.auto.conf

(They are inside PGDATA).

It strikes me that you are running with synchronous_commit set to off - which for self-healing is not recommended within a Kubernetes cluster unless you can live with some data loss.

gbartolini · 2024-01-17T15:55:31Z

I think this is related to #3680

buroa · 2024-01-17T15:59:23Z

Can you please share the content of:

custom.conf

override.conf

postgresql.auto.conf

(They are inside PGDATA).

It strikes me that you are running with synchronous_commit set to off - which for self-healing is not recommended within a Kubernetes cluster unless you can live with some data loss.

I'm fully aware of what synchronous_commit does. Here are my files you requested:

custom.conf
override.conf
postgresql.auto.conf

gbartolini · 2024-01-17T17:44:39Z

And what about the former primary?

buroa · 2024-01-17T18:33:05Z

And what about the former primary?

They look quite similar. I should note, nothing is particularly custom about my setup.

custom.conf
override.conf
postgresql.auto.conf

clrxbl · 2024-01-23T15:03:55Z

Have the same issue after deleting the PostgreSQL pods to update a service account. One of two instances can no longer boot back up. Downgrading cloudnativepg does not solve the problem.

leverege-jphelps · 2024-01-24T15:51:26Z

Seeing same running a custom image (timescale+jsquery+pgsql 16 base / gke 1.26 / cnpg 1.22). The system comes up and runs fine but if we start failover testing there seems to be a 50/50 chance that a pod becomes unrecoverable.

mnencia · 2024-01-26T15:05:22Z

This bug is a regression caused by a bad interaction between #3535 plus #3545 and a bug in pg_rewind behavior

The patch in #3728 will fix it.

We will prepare a patch release next week to address the issue.

gbartolini · 2024-01-26T15:12:21Z

As a workaround until their systems are upgraded, can users enable the ALTER SYSTEM command for now @mnencia?

mnencia · 2024-01-26T15:17:06Z

@gbartolini, this is correct, but only if the bug has not yet hit. Once you see the error, the only simple procedure you can do is to delete the node.

pg_rewind needs to be able to write all the files in the PostgreSQL data directory. For this reason, we always set `postgresql.auto.conf` mode to 600 before running it. After the PostgreSQL data directory is ready to be used, we revert the permission to be coherent with what the user specified in the `enableAlterSystem` configuration parameter. Closes: #3698 Signed-off-by: Tao Li <tao.li@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

pg_rewind needs to be able to write all the files in the PostgreSQL data directory. For this reason, we always set `postgresql.auto.conf` mode to 600 before running it. After the PostgreSQL data directory is ready to be used, we revert the permission to be coherent with what the user specified in the `enableAlterSystem` configuration parameter. Closes: #3698 Signed-off-by: Tao Li <tao.li@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> (cherry picked from commit 9ff7942)

buroa · 2024-01-29T13:52:34Z

Thank you! @leonardoce @gbartolini @mnencia

buroa added the triage Pending triage label Jan 15, 2024

buroa assigned gbartolini Jan 15, 2024

gbartolini mentioned this issue Jan 17, 2024

[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680

Open

4 tasks

litaocdl mentioned this issue Jan 26, 2024

fix: enforce postgres.auto.conf to rw before run rewind #3728

Merged

mnencia added bug 🐛 Something isn't working and removed triage Pending triage labels Jan 26, 2024

leonardoce closed this as completed in #3728 Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

buroa commented Jan 15, 2024 •

edited

gbartolini commented Jan 17, 2024

gbartolini commented Jan 17, 2024

buroa commented Jan 17, 2024

gbartolini commented Jan 17, 2024

buroa commented Jan 17, 2024

clrxbl commented Jan 23, 2024 •

edited

leverege-jphelps commented Jan 24, 2024

mnencia commented Jan 26, 2024

gbartolini commented Jan 26, 2024

mnencia commented Jan 26, 2024

buroa commented Jan 29, 2024 •

edited

[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

Comments

buroa commented Jan 15, 2024 • edited

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

Version

What version of Kubernetes are you using?

What is your Kubernetes environment?

How did you install the operator?

What happened?

Cluster resource

Relevant log output

Code of Conduct

gbartolini commented Jan 17, 2024

gbartolini commented Jan 17, 2024

buroa commented Jan 17, 2024

gbartolini commented Jan 17, 2024

buroa commented Jan 17, 2024

clrxbl commented Jan 23, 2024 • edited

leverege-jphelps commented Jan 24, 2024

mnencia commented Jan 26, 2024

gbartolini commented Jan 26, 2024

mnencia commented Jan 26, 2024

buroa commented Jan 29, 2024 • edited

buroa commented Jan 15, 2024 •

edited

clrxbl commented Jan 23, 2024 •

edited

buroa commented Jan 29, 2024 •

edited