Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Make sure pvcs have correct value of the label instanceRole and 'Role' #3930

Merged
merged 10 commits into from
Mar 8, 2024

Conversation

YanniHu1996
Copy link
Contributor

@YanniHu1996 YanniHu1996 commented Feb 26, 2024

closes: #3810

@github-actions github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 labels Feb 26, 2024
Copy link
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@YanniHu1996 YanniHu1996 changed the title fix: Reconcile instanceRole in pvc based on curren primary fix: Make sure pvcs have correct value of the label instanceRole Feb 26, 2024
@YanniHu1996 YanniHu1996 force-pushed the dev/3810 branch 4 times, most recently from 11ffc26 to 57730a2 Compare February 26, 2024 14:00
@YanniHu1996 YanniHu1996 changed the title fix: Make sure pvcs have correct value of the label instanceRole fix: Make sure pvcs have correct value of the label instanceRole and 'Role' Feb 27, 2024
@YanniHu1996
Copy link
Contributor Author

@armru
Copy link
Member

armru commented Mar 4, 2024

/test limit=local

Copy link
Contributor

github-actions bot commented Mar 4, 2024

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/8140875763

@mnencia mnencia force-pushed the dev/3810 branch 2 times, most recently from 618e7d9 to df99cb7 Compare March 5, 2024 15:57
@mnencia
Copy link
Member

mnencia commented Mar 7, 2024

/test

Copy link
Contributor

github-actions bot commented Mar 7, 2024

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/8185064309

YanniHu1996 and others added 10 commits March 8, 2024 18:25
Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>
Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>
Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>
Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
…iation in two different processes

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@mnencia
Copy link
Member

mnencia commented Mar 8, 2024

/ok-to-merge

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Mar 8, 2024
@mnencia
Copy link
Member

mnencia commented Mar 8, 2024

This patch solves the PVC label issue. However, there is still a way the cluster could fail when you delete all the instances in sequence. If the timing is just right, you can end up in a situation where only one not-ready pod remains (a former primary that needs to run pg_rewind).

The cluster remains stuck in a phase like Instance Status Extraction Error: HTTP communication issue, and the operator logs the following message over and over:

{"level":"info","ts":"2024-03-08T18:00:54Z","msg":"Cannot update target primary: operation cannot be fulfilled. An immediate retry will be scheduled","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"cluster-example","namespace":"default"},"namespace":"default","name":"cluster-example","reconcileID":"4e0c2cca-d287-45ff-87c5-edd28b494b53","uuid":"cde5207a-dd75-11ee-9130-62304a1b1e3f","error":"unable to evaluate failover logic, unable to fetch the instances status"}

However, the situation is easy to recover: if you delete the not-ready pod, the operator recreates the primary first, and the cluster becomes healthy again.

Given that this patch improves the system's resiliency, I will merge it and open a new issue to address this corner case.

@mnencia mnencia merged commit 2ed27a9 into cloudnative-pg:main Mar 8, 2024
26 of 27 checks passed
@mnencia mnencia deleted the dev/3810 branch March 8, 2024 18:30
cnpg-bot pushed a commit that referenced this pull request Mar 8, 2024
…d 'Role' (#3930)

This patch makes sure that the PVCs labels are always synchronized with the
labels on the Pods. This is important when all the pods are deleted and the operator
needs to decide which Pod recreate first.

Closes: #3810

Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit 2ed27a9)
mnencia pushed a commit that referenced this pull request Mar 11, 2024
…d 'Role' (#3930)

This patch makes sure that the PVCs labels are always synchronized with the
labels on the Pods. This is important when all the pods are deleted and the operator
needs to decide which Pod recreate first.

Closes: #3810

Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit 2ed27a9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-requested ◀️ This pull request should be backported to all supported releases ok to merge 👌 This PR can be merged release-1.21 release-1.22
Development

Successfully merging this pull request may close these issues.

[Bug]: Deleting all the pods in sequence may break the cluster
4 participants