fix: set wal_sender_timeout to 5min when joining through pg_basebackup #3586

fcanovai · 2023-12-18T08:28:33Z

We explicitly set a high-enough wal_sender_timeout for join-related pg_basebackup executions. A short timeout could not be enough in case the instance is slow to send data, like when the I/O is overloaded.

Fixes #3337

github-actions · 2023-12-18T08:28:46Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

leonardoce · 2023-12-19T09:45:44Z

E2e tests: https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/7260051781

leonardoce · 2023-12-19T13:29:11Z

Oops... this doesn't work with PostgreSQL 11:

{"level":"info","ts":"2023-12-19T13:28:38Z","msg":"DB not available, will retry","logging_pod":"cluster-example-2-join","err":"failed to connect to `host=cluster-example-rw user=streaming_replica database=postgres`: server error (FATAL: parameter \"wal_sender_timeout\" cannot be changed now (SQLSTATE 55P02))"}

❯ k exec -ti cluster-example-1 -- psql
Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
psql (11.22 (Debian 11.22-1.pgdg110+1))
Type "help" for help.

postgres=# select version();
                                                            version                                                            
-------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.22 (Debian 11.22-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

postgres=# \x
Expanded display is on.
postgres=# select * from pg_settings where name  = 'wal_sender_timeout';
-[ RECORD 1 ]---+---------------------------------------------------
name            | wal_sender_timeout
setting         | 5000
unit            | ms
category        | Replication / Sending Servers
short_desc      | Sets the maximum time to wait for WAL replication.
extra_desc      | 
context         | sighup
vartype         | integer
source          | configuration file
min_val         | 0
max_val         | 2147483647
enumvals        | 
boot_val        | 60000
reset_val       | 5000
sourcefile      | /var/lib/postgresql/data/pgdata/custom.conf
sourceline      | 31
pending_restart | f

postgres=#

leonardoce · 2023-12-19T13:51:05Z

E2e: https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/7262754322

armru · 2023-12-19T15:53:54Z

Would it make sense to make this parameter configurable?

leonardoce · 2023-12-19T15:56:09Z

Would it make sense to make this parameter configurable?

Probably yes, but if you can't have a fsync in 5 minutes, you're in big problems. Bigger than being able to clone a replica.
@fcanovai what do you think about it?

leonardoce · 2023-12-19T16:42:48Z

E2e: https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/7264867924

leonardoce · 2023-12-19T16:43:57Z

We changed it to 0, meaning that we disable wal_sender_timeout for pg_basebackup. Let's see if the E2e tests are fine.
If they are, we can merge this.

We explicitly set a high-enough wal_sender_timeout for join-related pg_basebackup executions. A short timeout could not be enough in case the instance is slow to send data, like when the I/O is overloaded. Fixes cloudnative-pg#3337 Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com>

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

leonardoce · 2023-12-20T08:21:26Z

/ok-to-merge e2e tests are fine

…3586) We explicitly disable wal_sender_timeout for join-related pg_basebackup executions. A short timeout could not be enough if the instance is slow to send data, like when the I/O is overloaded. Fixes #3337 Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> (cherry picked from commit e052a1e)

Removing _wal_sender_timeout_ on pgbasebackup joining cluster. This PR is similar to #3586 but fixes the problem for nodes joining a cluster. Signed-off-by: Augusto Ribeiro Silva <ars@unsilo.com> Signed-off-by: Augusto Ribeiro Silva <augusto.mcc@gmail.com> Co-authored-by: Augusto Ribeiro Silva <ars@unsilo.com>

Removing _wal_sender_timeout_ on pgbasebackup joining cluster. This PR is similar to #3586 but fixes the problem for nodes joining a cluster. Signed-off-by: Augusto Ribeiro Silva <ars@unsilo.com> Signed-off-by: Augusto Ribeiro Silva <augusto.mcc@gmail.com> Co-authored-by: Augusto Ribeiro Silva <ars@unsilo.com> (cherry picked from commit 58eaf1f)

fcanovai requested review from gbartolini, leonardoce, mnencia, phisco, sxd and armru as code owners December 18, 2023 08:28

github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.20 release-1.21 labels Dec 18, 2023

fcanovai force-pushed the dev/walsendertimeout branch from e900587 to 67f7502 Compare December 18, 2023 08:29

leonardoce approved these changes Dec 19, 2023

View reviewed changes

mnencia approved these changes Dec 19, 2023

View reviewed changes

leonardoce force-pushed the dev/walsendertimeout branch from 67f7502 to 29f0516 Compare December 19, 2023 09:44

leonardoce self-assigned this Dec 19, 2023

leonardoce force-pushed the dev/walsendertimeout branch from 29f0516 to 34211c2 Compare December 19, 2023 13:16

leonardoce force-pushed the dev/walsendertimeout branch from 34211c2 to 93274b0 Compare December 19, 2023 13:49

sxd approved these changes Dec 19, 2023

View reviewed changes

leonardoce force-pushed the dev/walsendertimeout branch from 781cca1 to f9a9823 Compare December 19, 2023 16:41

fcanovai and others added 4 commits December 20, 2023 09:20

chore: review

319675f

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

chore: improve error message

e8d9795

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

chore: set wal_sender_timeout to 0, meaning infinity

9537a28

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

leonardoce force-pushed the dev/walsendertimeout branch from f9a9823 to 9537a28 Compare December 20, 2023 08:20

cnpg-bot added the ok to merge 👌 This PR can be merged label Dec 20, 2023

leonardoce merged commit e052a1e into cloudnative-pg:main Dec 20, 2023
26 of 27 checks passed

augustoribeiro mentioned this pull request Feb 26, 2024

fix: pgbasebackup cluster join wal_sender_timeout configuration #3947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set wal_sender_timeout to 5min when joining through pg_basebackup #3586

fix: set wal_sender_timeout to 5min when joining through pg_basebackup #3586

fcanovai commented Dec 18, 2023 •

edited

github-actions bot commented Dec 18, 2023

leonardoce commented Dec 19, 2023

leonardoce commented Dec 19, 2023 •

edited

leonardoce commented Dec 19, 2023

armru commented Dec 19, 2023 •

edited

leonardoce commented Dec 19, 2023

leonardoce commented Dec 19, 2023

leonardoce commented Dec 19, 2023

leonardoce commented Dec 20, 2023

fix: set wal_sender_timeout to 5min when joining through pg_basebackup #3586

fix: set wal_sender_timeout to 5min when joining through pg_basebackup #3586

Conversation

fcanovai commented Dec 18, 2023 • edited

github-actions bot commented Dec 18, 2023

leonardoce commented Dec 19, 2023

leonardoce commented Dec 19, 2023 • edited

leonardoce commented Dec 19, 2023

armru commented Dec 19, 2023 • edited

leonardoce commented Dec 19, 2023

leonardoce commented Dec 19, 2023

leonardoce commented Dec 19, 2023

leonardoce commented Dec 20, 2023

fcanovai commented Dec 18, 2023 •

edited

leonardoce commented Dec 19, 2023 •

edited

armru commented Dec 19, 2023 •

edited