fix: prevent stuck cluster when primary database is down but pod is up #2966

jsilvela · 2023-10-02T17:46:01Z

Fixes a condition that could happen when the primary database
is down, but the pod is still up.
In such case, the reconciliation of replication slots or managed roles
could error out, and prevent the instance manager from shutting down
the primary. This would leave the cluster unable to fail over.

github-actions · 2023-10-02T17:46:13Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

github-actions · 2023-10-02T17:46:14Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

github-actions · 2023-10-02T17:46:44Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

jsilvela · 2023-10-02T17:53:06Z

/test

github-actions · 2023-10-02T17:53:19Z

@jsilvela, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/6383672533

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

leonardoce · 2023-10-03T09:42:20Z

/test

github-actions · 2023-10-03T09:42:34Z

@leonardoce, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/6391306590

#2966) Fixes a condition that could result in the primary database being down while the corresponding pod is still up. When this happens, the reconciliation of replication slots and managed roles will error out, preventing the instance manager from shutting down the primary and preventing switch-overs and fail-overs from finishing correctly. Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit 5e36af8)

#2966) Fixes a condition that could result in the primary database being down while the corresponding pod is still up. When this happens, the reconciliation of replication slots and managed roles will error out, preventing the instance manager from shutting down the primary and preventing switch-overs and fail-overs from finishing correctly. Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

jsilvela requested review from fcanovai, gbartolini, leonardoce, mnencia, phisco, sxd and armru as code owners October 2, 2023 17:46

github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.19 release-1.20 labels Oct 2, 2023

fix: prevent deadlock when database is down

42f9366

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

jsilvela force-pushed the dev/cnp-4198 branch from 65d9741 to 42f9366 Compare October 2, 2023 17:46

jsilvela added the no-issue label Oct 2, 2023

jsilvela changed the title ~~fix: prevent deadlock when database is down~~ fix: prevent deadlock when primary database is down Oct 2, 2023

jsilvela changed the title ~~fix: prevent deadlock when primary database is down~~ fix: prevent stuck cluster when primary database is down but pod is up Oct 3, 2023

chore: review

91cbd7c

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

leonardoce force-pushed the dev/cnp-4198 branch from 593520e to 91cbd7c Compare October 3, 2023 09:12

leonardoce approved these changes Oct 3, 2023

View reviewed changes

mnencia approved these changes Oct 3, 2023

View reviewed changes

leonardoce merged commit 5e36af8 into main Oct 3, 2023
21 of 22 checks passed

leonardoce deleted the dev/cnp-4198 branch October 3, 2023 12:15

github-actions bot mentioned this pull request Oct 3, 2023

Backport failure for pull request 2966 #2972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent stuck cluster when primary database is down but pod is up #2966

fix: prevent stuck cluster when primary database is down but pod is up #2966

jsilvela commented Oct 2, 2023 •

edited

github-actions bot commented Oct 2, 2023

github-actions bot commented Oct 2, 2023

github-actions bot commented Oct 2, 2023

jsilvela commented Oct 2, 2023

github-actions bot commented Oct 2, 2023

leonardoce commented Oct 3, 2023

github-actions bot commented Oct 3, 2023

fix: prevent stuck cluster when primary database is down but pod is up #2966

fix: prevent stuck cluster when primary database is down but pod is up #2966

Conversation

jsilvela commented Oct 2, 2023 • edited

github-actions bot commented Oct 2, 2023

github-actions bot commented Oct 2, 2023

github-actions bot commented Oct 2, 2023

jsilvela commented Oct 2, 2023

github-actions bot commented Oct 2, 2023

leonardoce commented Oct 3, 2023

github-actions bot commented Oct 3, 2023

jsilvela commented Oct 2, 2023 •

edited