Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: set wal_sender_timeout to 5min when joining through pg_basebackup #3586

Merged
merged 4 commits into from
Dec 20, 2023

Conversation

fcanovai
Copy link
Contributor

@fcanovai fcanovai commented Dec 18, 2023

We explicitly set a high-enough wal_sender_timeout for join-related pg_basebackup executions. A short timeout could not be enough in case the instance is slow to send data, like when the I/O is overloaded.

Fixes #3337

@github-actions github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.20 release-1.21 labels Dec 18, 2023
Copy link
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@leonardoce
Copy link
Contributor

@leonardoce
Copy link
Contributor

leonardoce commented Dec 19, 2023

Oops... this doesn't work with PostgreSQL 11:

{"level":"info","ts":"2023-12-19T13:28:38Z","msg":"DB not available, will retry","logging_pod":"cluster-example-2-join","err":"failed to connect to `host=cluster-example-rw user=streaming_replica database=postgres`: server error (FATAL: parameter \"wal_sender_timeout\" cannot be changed now (SQLSTATE 55P02))"}
❯ k exec -ti cluster-example-1 -- psql
Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
psql (11.22 (Debian 11.22-1.pgdg110+1))
Type "help" for help.

postgres=# select version();
                                                            version                                                            
-------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.22 (Debian 11.22-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

postgres=# \x
Expanded display is on.
postgres=# select * from pg_settings where name  = 'wal_sender_timeout';
-[ RECORD 1 ]---+---------------------------------------------------
name            | wal_sender_timeout
setting         | 5000
unit            | ms
category        | Replication / Sending Servers
short_desc      | Sets the maximum time to wait for WAL replication.
extra_desc      | 
context         | sighup
vartype         | integer
source          | configuration file
min_val         | 0
max_val         | 2147483647
enumvals        | 
boot_val        | 60000
reset_val       | 5000
sourcefile      | /var/lib/postgresql/data/pgdata/custom.conf
sourceline      | 31
pending_restart | f

postgres=# 

@leonardoce
Copy link
Contributor

@armru
Copy link
Member

armru commented Dec 19, 2023

Would it make sense to make this parameter configurable?

@leonardoce
Copy link
Contributor

Would it make sense to make this parameter configurable?

Probably yes, but if you can't have a fsync in 5 minutes, you're in big problems. Bigger than being able to clone a replica.
@fcanovai what do you think about it?

@leonardoce
Copy link
Contributor

@leonardoce
Copy link
Contributor

We changed it to 0, meaning that we disable wal_sender_timeout for pg_basebackup. Let's see if the E2e tests are fine.
If they are, we can merge this.

fcanovai and others added 4 commits December 20, 2023 09:20
We explicitly set a high-enough wal_sender_timeout for join-related
pg_basebackup executions. A short timeout could not be enough in case
the instance is slow to send data, like when the I/O is overloaded.

Fixes cloudnative-pg#3337

Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
@leonardoce
Copy link
Contributor

/ok-to-merge e2e tests are fine

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Dec 20, 2023
@leonardoce leonardoce merged commit e052a1e into cloudnative-pg:main Dec 20, 2023
26 of 27 checks passed
cnpg-bot pushed a commit that referenced this pull request Dec 20, 2023
…3586)

We explicitly disable wal_sender_timeout for join-related pg_basebackup
executions. A short timeout could not be enough if
the instance is slow to send data, like when the I/O is overloaded.

Fixes #3337

Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit e052a1e)
cnpg-bot pushed a commit that referenced this pull request Dec 20, 2023
…3586)

We explicitly disable wal_sender_timeout for join-related pg_basebackup
executions. A short timeout could not be enough if
the instance is slow to send data, like when the I/O is overloaded.

Fixes #3337

Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit e052a1e)
mnencia pushed a commit that referenced this pull request Mar 13, 2024
Removing _wal_sender_timeout_ on pgbasebackup joining cluster. This PR
is similar to #3586 but fixes the problem for nodes joining a cluster.

Signed-off-by: Augusto Ribeiro Silva <ars@unsilo.com>
Signed-off-by: Augusto Ribeiro Silva <augusto.mcc@gmail.com>
Co-authored-by: Augusto Ribeiro Silva <ars@unsilo.com>
cnpg-bot pushed a commit that referenced this pull request Mar 13, 2024
Removing _wal_sender_timeout_ on pgbasebackup joining cluster. This PR
is similar to #3586 but fixes the problem for nodes joining a cluster.

Signed-off-by: Augusto Ribeiro Silva <ars@unsilo.com>
Signed-off-by: Augusto Ribeiro Silva <augusto.mcc@gmail.com>
Co-authored-by: Augusto Ribeiro Silva <ars@unsilo.com>
(cherry picked from commit 58eaf1f)
cnpg-bot pushed a commit that referenced this pull request Mar 13, 2024
Removing _wal_sender_timeout_ on pgbasebackup joining cluster. This PR
is similar to #3586 but fixes the problem for nodes joining a cluster.

Signed-off-by: Augusto Ribeiro Silva <ars@unsilo.com>
Signed-off-by: Augusto Ribeiro Silva <augusto.mcc@gmail.com>
Co-authored-by: Augusto Ribeiro Silva <ars@unsilo.com>
(cherry picked from commit 58eaf1f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-requested ◀️ This pull request should be backported to all supported releases ok to merge 👌 This PR can be merged release-1.20 release-1.21
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Issue with New Postgres Replica Failing to Join Cluster After Sync Completion
6 participants