Fix stale references to old nodes during health probe #29566

christarazi · 2023-12-02T07:39:33Z

node/manager: Add info logs for added and deleted nodes
health/server: Fix stale references to old nodes during health probe

Related: #28382

Fix bug where deleted nodes would reappear in the cilium_node_connectivity_* metrics

Similar to how useful log msgs are when endpoints created and deleted, this log is useful for understanding when nodes are added and deleted in production clusters. Signed-off-by: Chris Tarazi <chris@isovalent.com>

Given the order of operations in prober.OnIdle, it is possible for the health probe to have a stale references to a deleted nodes. When that occurs, node connectivity metrics which were previously deleted [1] would be brought back, causing confusion. If users defined alerts for node connectivity health checks metrics (see example below), then this would erroneously trigger because the old nodes would appear in the metric labels as a failing health check. Example given deletion of "kind-worker2" node: ``` cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod e_type="remote_intra_cluster" type="endpoint" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod e_type="remote_intra_cluster" type="node" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type= "local_node" type="endpoint" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type= "local_node" type="node" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker2" target_node_type ="remote_intra_cluster" type="endpoint" 0.000000 ``` Fixes: d9e1ff8 ("cilium-health: Remove unnecessary goroutine") [1]: e9f97cd ("Ensures prometheus metrics associated with a deleted node are no longer reported.") Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi · 2023-12-02T07:41:21Z

cc @derailed @tommyp1ckles JFYI

christarazi · 2023-12-02T07:41:25Z

/test

marseel

Thanks, looks good to me. But I guess it only removes a single probe which could potentially happen after the node was removed and not any prolonged reappearance?

As a side note, I do not really like how GetClusterNodes API call that is underneath getNodes is implemented as it's not stateless. If something errors out on the client side, retry will return a different set of nodes added/removed later on so the client and server can become out of sync.
In case of error, we could probably zero out clientID here
and clean up prober state in code
to kind of initiate a full relist from API.

pkg/health/server/server.go

asauber

I agree with @marseel that this this deserves a refactor to remove additional side effect conditions on error. I think this is worth merging as-is to improve the current behavior.

Fixes cilium#29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

Fixes #29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

[ upstream commit 100818f ] Fixes #29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

christarazi requested a review from a team as a code owner December 2, 2023 07:39

christarazi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/health Relates to the cilium-health component labels Dec 2, 2023

christarazi requested a review from marseel December 2, 2023 07:39

christarazi added 2 commits December 1, 2023 23:39

node/manager: Add info logs for added and deleted nodes

ef1200f

Similar to how useful log msgs are when endpoints created and deleted, this log is useful for understanding when nodes are added and deleted in production clusters. Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi force-pushed the pr/christarazi/fix-phantom-node-metric branch from 78e77ec to 1adb2d2 Compare December 2, 2023 07:39

christarazi added needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Dec 2, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.14.5 Dec 2, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.13.10 Dec 2, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.12.17 Dec 2, 2023

christarazi changed the title ~~pr/christarazi/fix phantom node metric~~ Fix stale references to old nodes during health probe Dec 2, 2023

marseel approved these changes Dec 4, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 4, 2023

marseel reviewed Dec 4, 2023

View reviewed changes

pkg/health/server/server.go Show resolved Hide resolved

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 4, 2023

ti-mo enabled auto-merge December 4, 2023 15:40

asauber approved these changes Dec 4, 2023

View reviewed changes

ti-mo added this pull request to the merge queue Dec 4, 2023

Merged via the queue into cilium:main with commit 7c7b723 Dec 4, 2023
61 checks passed

nbusseneau removed the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label Dec 5, 2023

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.10 Dec 5, 2023

nbusseneau mentioned this pull request Dec 5, 2023

v1.14 Backports 2023-12-05 #29641

Merged

10 tasks

nbusseneau added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Dec 5, 2023

github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Dec 6, 2023

maintainer-s-little-helper bot moved this from Needs backport from main to Backport done to v1.14 in 1.14.5 Dec 6, 2023

nbusseneau added backport-pending/1.12 and removed needs-backport/1.12 labels Dec 6, 2023

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.17 Dec 6, 2023

github-actions bot added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Dec 6, 2023

maintainer-s-little-helper bot removed this from Backport pending to v1.12 in 1.12.17 Dec 6, 2023

github-actions bot removed the backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. label Dec 6, 2023

maintainer-s-little-helper bot removed this from Backport pending to v1.13 in 1.13.10 Dec 6, 2023

maintainer-s-little-helper bot added this to Backport done to v1.12 in 1.12.17 Dec 6, 2023

github-actions bot added the backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. label Dec 6, 2023

This was referenced Dec 11, 2023

Prepare for release v1.14.5 #29787

Merged

Prepare for release v1.13.10 #29789

Merged

Prepare for release v1.12.17 #29791

Merged

joestringer mentioned this pull request Dec 14, 2023

Prepare for release v1.15.0-rc.0 #29883

Merged

marseel mentioned this pull request Jan 29, 2024

Fix various health-server probing bugs. #30504

Merged

marseel mentioned this pull request Feb 22, 2024

health-server: Do not cleanup health checking result on node updates. #30917

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale references to old nodes during health probe #29566

Fix stale references to old nodes during health probe #29566

christarazi commented Dec 2, 2023 •

edited

christarazi commented Dec 2, 2023

christarazi commented Dec 2, 2023

marseel left a comment

asauber left a comment

Fix stale references to old nodes during health probe #29566

Fix stale references to old nodes during health probe #29566

Conversation

christarazi commented Dec 2, 2023 • edited

christarazi commented Dec 2, 2023

christarazi commented Dec 2, 2023

marseel left a comment

Choose a reason for hiding this comment

asauber left a comment

Choose a reason for hiding this comment

christarazi commented Dec 2, 2023 •

edited