fix: Make sure pvcs have correct value of the label `instanceRole` and 'Role' #3930

YanniHu1996 · 2024-02-26T07:21:51Z

closes: #3810

github-actions · 2024-02-26T07:22:03Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

YanniHu1996 · 2024-02-27T06:37:59Z

the e2e from fork
https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/8059176715

armru · 2024-03-04T13:19:08Z

/test limit=local

github-actions · 2024-03-04T13:19:22Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/8140875763

mnencia · 2024-03-07T08:24:49Z

/test

github-actions · 2024-03-07T08:25:03Z

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/8185064309

Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

…iation in two different processes Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia · 2024-03-08T18:12:15Z

/ok-to-merge

mnencia · 2024-03-08T18:20:02Z

This patch solves the PVC label issue. However, there is still a way the cluster could fail when you delete all the instances in sequence. If the timing is just right, you can end up in a situation where only one not-ready pod remains (a former primary that needs to run pg_rewind).

The cluster remains stuck in a phase like Instance Status Extraction Error: HTTP communication issue, and the operator logs the following message over and over:

{"level":"info","ts":"2024-03-08T18:00:54Z","msg":"Cannot update target primary: operation cannot be fulfilled. An immediate retry will be scheduled","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"cluster-example","namespace":"default"},"namespace":"default","name":"cluster-example","reconcileID":"4e0c2cca-d287-45ff-87c5-edd28b494b53","uuid":"cde5207a-dd75-11ee-9130-62304a1b1e3f","error":"unable to evaluate failover logic, unable to fetch the instances status"}

However, the situation is easy to recover: if you delete the not-ready pod, the operator recreates the primary first, and the cluster becomes healthy again.

Given that this patch improves the system's resiliency, I will merge it and open a new issue to address this corner case.

…d 'Role' (#3930) This patch makes sure that the PVCs labels are always synchronized with the labels on the Pods. This is important when all the pods are deleted and the operator needs to decide which Pod recreate first. Closes: #3810 Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit 2ed27a9)

YanniHu1996 requested review from fcanovai, gbartolini, leonardoce, mnencia, phisco, sxd and armru as code owners February 26, 2024 07:21

github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 labels Feb 26, 2024

YanniHu1996 changed the title ~~fix: Reconcile instanceRole in pvc based on curren primary~~ fix: Make sure pvcs have correct value of the label instanceRole Feb 26, 2024

YanniHu1996 force-pushed the dev/3810 branch 4 times, most recently from 11ffc26 to 57730a2 Compare February 26, 2024 14:00

YanniHu1996 changed the title ~~fix: Make sure pvcs have correct value of the label instanceRole~~ fix: Make sure pvcs have correct value of the label instanceRole and 'Role' Feb 27, 2024

YanniHu1996 force-pushed the dev/3810 branch from 0a04516 to bac6107 Compare February 27, 2024 03:17

YanniHu1996 force-pushed the dev/3810 branch from bac6107 to 7eeb847 Compare February 27, 2024 07:53

armru force-pushed the dev/3810 branch 3 times, most recently from 8212a7b to 61efe34 Compare March 4, 2024 13:19

armru approved these changes Mar 5, 2024

View reviewed changes

mnencia force-pushed the dev/3810 branch 2 times, most recently from 618e7d9 to df99cb7 Compare March 5, 2024 15:57

YanniHu1996 and others added 10 commits March 8, 2024 18:25

fix: Reconcile instanceRole in pvc based on curren primary

7a72d35

Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>

test: fix unit test

2a059f3

Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>

fix fix golint

1975ce3

Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>

fix: removed reconciling serial of pvc

79a7acd

Signed-off-by: YanniHu1996 <yantian.hu@enterprisedb.com>

chore: add back serial reconciler, fix tests

c201ac4

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

refactor: split serial reconciliation and the other metadata reconcil…

59178db

…iation in two different processes Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

chore: better func name

7eb7a6a

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

chore: avoid declaring var

fef108b

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

chore: style

5acbb73

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

chore: fix unit tests I broke with the last commit

be96209

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia force-pushed the dev/3810 branch from 53d00ab to be96209 Compare March 8, 2024 17:32

cnpg-bot added the ok to merge 👌 This PR can be merged label Mar 8, 2024

mnencia approved these changes Mar 8, 2024

View reviewed changes

mnencia mentioned this pull request Mar 8, 2024

[Bug]: Recreate the primary pod if no pod is ready #4049

Closed

mnencia merged commit 2ed27a9 into cloudnative-pg:main Mar 8, 2024
26 of 27 checks passed

mnencia deleted the dev/3810 branch March 8, 2024 18:30

github-actions bot mentioned this pull request Mar 8, 2024

Backport failure for pull request 3930 #4050

Closed

mnencia mentioned this pull request Apr 15, 2024

[Bug]: One pod stuck, cannot reconcile #3974

Closed

4 tasks

jsilvela mentioned this pull request May 16, 2024

Fix faulty logic on expected instances/PVCs based on cnpg.io/nodeSerial #3466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Make sure pvcs have correct value of the label `instanceRole` and 'Role' #3930

fix: Make sure pvcs have correct value of the label `instanceRole` and 'Role' #3930

YanniHu1996 commented Feb 26, 2024 •

edited

github-actions bot commented Feb 26, 2024

YanniHu1996 commented Feb 27, 2024

armru commented Mar 4, 2024

github-actions bot commented Mar 4, 2024

mnencia commented Mar 7, 2024

github-actions bot commented Mar 7, 2024

mnencia commented Mar 8, 2024

mnencia commented Mar 8, 2024 •

edited

fix: Make sure pvcs have correct value of the label instanceRole and 'Role' #3930

fix: Make sure pvcs have correct value of the label instanceRole and 'Role' #3930

Conversation

YanniHu1996 commented Feb 26, 2024 • edited

github-actions bot commented Feb 26, 2024

YanniHu1996 commented Feb 27, 2024

armru commented Mar 4, 2024

github-actions bot commented Mar 4, 2024

mnencia commented Mar 7, 2024

github-actions bot commented Mar 7, 2024

mnencia commented Mar 8, 2024

mnencia commented Mar 8, 2024 • edited

fix: Make sure pvcs have correct value of the label `instanceRole` and 'Role' #3930

fix: Make sure pvcs have correct value of the label `instanceRole` and 'Role' #3930

YanniHu1996 commented Feb 26, 2024 •

edited

mnencia commented Mar 8, 2024 •

edited