Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

Closed
4 tasks done
buroa opened this issue Jan 15, 2024 · 11 comments · Fixed by #3728
Closed
4 tasks done

[Bug]: pg_rewind: error: restore_command is not set in the target cluster #3698

buroa opened this issue Jan 15, 2024 · 11 comments · Fixed by #3728
Assignees
Labels
bug 🐛 Something isn't working

Comments

@buroa
Copy link

buroa commented Jan 15, 2024

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

skre@skre.me

Version

Helm 0.20.0

What version of Kubernetes are you using?

K8s 1.29.0

What is your Kubernetes environment?

Talos Linux

How did you install the operator?

HelmRelease: https://github.com/buroa/k8s-gitops/blob/master/kubernetes/apps/databases/cloudnative-pg/app/helmrelease.yaml

Cluster: https://github.com/buroa/k8s-gitops/blob/master/kubernetes/apps/databases/cloudnative-pg/cluster/cluster.yaml

What happened?

Upon rebooting a node that hosts the primary instance, it fails to come back up with that error.

Cluster resource

No response

Relevant log output

{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Cluster status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"This is an old primary instance, waiting for the switchover to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","currentPrimary":"postgres-2","targetPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Switchover completed","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","targetPrimary":"postgres-2","currentPrimary":"postgres-2"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Waiting for the new primary to be available","logging_pod":"postgres-1","primaryConnInfo":"host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres connect_timeout=5"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Extracting pg_controldata information","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","reason":"before pg_rewind"}
{"level":"info","ts":"2024-01-15T18:09:18Z","logger":"pg_controldata","msg":"pg_control version number:            1300\nCatalog version number:               202307071\nDatabase system identifier:           7293656391259123739\nDatabase cluster state:               shut down\npg_control last modified:             Mon 15 Jan 2024 06:08:32 PM UTC\nLatest checkpoint location:           63/9D000028\nLatest checkpoint's REDO location:    63/9D000028\nLatest checkpoint's REDO WAL file:    0000003A000000630000009D\nLatest checkpoint's TimeLineID:       58\nLatest checkpoint's PrevTimeLineID:   58\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID:          0:13381695\nLatest checkpoint's NextOID:          406136\nLatest checkpoint's NextMultiXactId:  623\nLatest checkpoint's NextMultiOffset:  1305\nLatest checkpoint's oldestXID:        722\nLatest checkpoint's oldestXID's DB:   1\nLatest checkpoint's oldestActiveXID:  0\nLatest checkpoint's oldestMultiXid:   1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint:            Mon 15 Jan 2024 06:08:32 PM UTC\nFake LSN counter for unlogged rels:   0/3E8\nMinimum recovery ending location:     0/0\nMin recovery ending loc's timeline:   0\nBackup start location:                0/0\nBackup end location:                  0/0\nEnd-of-backup record required:        no\nwal_level setting:                    logical\nwal_log_hints setting:                on\nmax_connections setting:              600\nmax_worker_processes setting:         32\nmax_wal_senders setting:              10\nmax_prepared_xacts setting:           0\nmax_locks_per_xact setting:           64\ntrack_commit_timestamp setting:       off\nMaximum data alignment:               8\nDatabase block size:                  8192\nBlocks per segment of large relation: 131072\nWAL block size:                       8192\nBytes per WAL segment:                16777216\nMaximum length of identifiers:        64\nMaximum columns in an index:          32\nMaximum size of a TOAST chunk:        1996\nSize of a large-object chunk:         2048\nDate/time type storage:               64-bit integers\nFloat8 argument passing:              by value\nData page checksum version:           0\nMock authentication nonce:            d9a7ef70134dad18b34d8794e2b1048ed290244459677435e847bceecbb9c2ef\n","pipe":"stdout","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Starting up pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"]}
{"level":"info","ts":"2024-01-15T18:09:18Z","logger":"pg_rewind","msg":"pg_rewind: error: restore_command is not set in the target cluster","pipe":"stderr","logging_pod":"postgres-1"}
{"level":"error","ts":"2024-01-15T18:09:18Z","msg":"Failed to execute pg_rewind","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","options":["-P","--source-server","host=postgres-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=postgres-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"],"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/postgres.(*Instance).Rewind\n\tpkg/management/postgres/instance.go:972\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).verifyPgDataCoherenceForPrimary\n\tinternal/management/controller/instance_startup.go:235\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).initialize\n\tinternal/management/controller/instance_controller.go:342\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).Reconcile\n\tinternal/management/controller/instance_controller.go:141\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"pg_rewind failed, starting the server to complete the crash recovery","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgres","namespace":"databases"},"namespace":"databases","name":"postgres","reconcileID":"07f6cdf8-9e3c-4e18-9920-3d4c4b9b24ad","uuid":"3356ab4d-b3d1-11ee-9999-86245bf254b5","logging_pod":"postgres-1","err":"error executing pg_rewind: exit status 1"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Waiting for server to complete crash recovery","logging_pod":"postgres-1"}
{"level":"info","ts":"2024-01-15T18:09:18Z","msg":"Starting up instance","logging_pod":"postgres-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["start","-w","-D","/var/lib/postgresql/data/pgdata","-o","-c port=5432 -c unix_socket_directories=/controller/run","-t 40000000"]}

Code of Conduct

  • I agree to follow this project's Code of Conduct
@buroa buroa added the triage Pending triage label Jan 15, 2024
@gbartolini
Copy link
Contributor

Can you please share the content of:

  • custom.conf
  • override.conf
  • postgresql.auto.conf

(They are inside PGDATA).

It strikes me that you are running with synchronous_commit set to off - which for self-healing is not recommended within a Kubernetes cluster unless you can live with some data loss.

@gbartolini
Copy link
Contributor

I think this is related to #3680

@buroa
Copy link
Author

buroa commented Jan 17, 2024

Can you please share the content of:

  • custom.conf
  • override.conf
  • postgresql.auto.conf

(They are inside PGDATA).

It strikes me that you are running with synchronous_commit set to off - which for self-healing is not recommended within a Kubernetes cluster unless you can live with some data loss.

I'm fully aware of what synchronous_commit does. Here are my files you requested:

custom.conf
override.conf
postgresql.auto.conf

@gbartolini
Copy link
Contributor

And what about the former primary?

@buroa
Copy link
Author

buroa commented Jan 17, 2024

And what about the former primary?

They look quite similar. I should note, nothing is particularly custom about my setup.

custom.conf
override.conf
postgresql.auto.conf

@clrxbl
Copy link

clrxbl commented Jan 23, 2024

Have the same issue after deleting the PostgreSQL pods to update a service account. One of two instances can no longer boot back up. Downgrading cloudnativepg does not solve the problem.

@leverege-jphelps
Copy link

Seeing same running a custom image (timescale+jsquery+pgsql 16 base / gke 1.26 / cnpg 1.22). The system comes up and runs fine but if we start failover testing there seems to be a 50/50 chance that a pod becomes unrecoverable.

@mnencia mnencia added bug 🐛 Something isn't working and removed triage Pending triage labels Jan 26, 2024
@mnencia
Copy link
Member

mnencia commented Jan 26, 2024

This bug is a regression caused by a bad interaction between #3535 plus #3545 and a bug in pg_rewind behavior

The patch in #3728 will fix it.

We will prepare a patch release next week to address the issue.

@gbartolini
Copy link
Contributor

As a workaround until their systems are upgraded, can users enable the ALTER SYSTEM command for now @mnencia?

@mnencia
Copy link
Member

mnencia commented Jan 26, 2024

@gbartolini, this is correct, but only if the bug has not yet hit. Once you see the error, the only simple procedure you can do is to delete the node.

leonardoce added a commit that referenced this issue Jan 29, 2024
pg_rewind needs to be able to write all the files in the PostgreSQL
data directory. For this reason, we always set `postgresql.auto.conf` mode to
600 before running it.

After the PostgreSQL data directory is ready to be used, we revert the
permission to be coherent with what the user specified in the
`enableAlterSystem` configuration parameter.

Closes: #3698

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
cnpg-bot pushed a commit that referenced this issue Jan 29, 2024
pg_rewind needs to be able to write all the files in the PostgreSQL
data directory. For this reason, we always set `postgresql.auto.conf` mode to
600 before running it.

After the PostgreSQL data directory is ready to be used, we revert the
permission to be coherent with what the user specified in the
`enableAlterSystem` configuration parameter.

Closes: #3698

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 9ff7942)
cnpg-bot pushed a commit that referenced this issue Jan 29, 2024
pg_rewind needs to be able to write all the files in the PostgreSQL
data directory. For this reason, we always set `postgresql.auto.conf` mode to
600 before running it.

After the PostgreSQL data directory is ready to be used, we revert the
permission to be coherent with what the user specified in the
`enableAlterSystem` configuration parameter.

Closes: #3698

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 9ff7942)
cnpg-bot pushed a commit that referenced this issue Jan 29, 2024
pg_rewind needs to be able to write all the files in the PostgreSQL
data directory. For this reason, we always set `postgresql.auto.conf` mode to
600 before running it.

After the PostgreSQL data directory is ready to be used, we revert the
permission to be coherent with what the user specified in the
`enableAlterSystem` configuration parameter.

Closes: #3698

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 9ff7942)
@buroa
Copy link
Author

buroa commented Jan 29, 2024

Thank you! @leonardoce @gbartolini @mnencia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Development

Successfully merging a pull request may close this issue.

5 participants