Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stale references to old nodes during health probe #29566

Merged

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Dec 2, 2023

  • node/manager: Add info logs for added and deleted nodes
  • health/server: Fix stale references to old nodes during health probe

Related: #28382

Fix bug where deleted nodes would reappear in the cilium_node_connectivity_* metrics

@christarazi christarazi requested a review from a team as a code owner December 2, 2023 07:39
@christarazi christarazi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/health Relates to the cilium-health component labels Dec 2, 2023
Similar to how useful log msgs are when endpoints created and deleted,
this log is useful for understanding when nodes are added and deleted in
production clusters.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Given the order of operations in prober.OnIdle, it is possible for the
health probe to have a stale references to a deleted nodes. When that
occurs, node connectivity metrics which were previously deleted [1]
would be brought back, causing confusion. If users defined alerts for
node connectivity health checks metrics (see example below), then this
would erroneously trigger because the old nodes would appear in the
metric labels as a failing health check.

Example given deletion of "kind-worker2" node:

```
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod
e_type="remote_intra_cluster" type="endpoint"                                                                        1.000000
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod
e_type="remote_intra_cluster" type="node"                                                                            1.000000
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type=
"local_node" type="endpoint"                                                                                         1.000000
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type=
"local_node" type="node"                                                                                             1.000000

cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker2" target_node_type
="remote_intra_cluster" type="endpoint"                                                                              0.000000
```

Fixes: d9e1ff8 ("cilium-health: Remove unnecessary goroutine")

[1]: e9f97cd ("Ensures prometheus metrics associated with a deleted
node are no longer reported.")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-phantom-node-metric branch from 78e77ec to 1adb2d2 Compare December 2, 2023 07:39
@christarazi christarazi added needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Dec 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.5 Dec 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.10 Dec 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.17 Dec 2, 2023
@christarazi
Copy link
Member Author

cc @derailed @tommyp1ckles JFYI

@christarazi
Copy link
Member Author

/test

@christarazi christarazi changed the title pr/christarazi/fix phantom node metric Fix stale references to old nodes during health probe Dec 2, 2023
Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me. But I guess it only removes a single probe which could potentially happen after the node was removed and not any prolonged reappearance?

As a side note, I do not really like how GetClusterNodes API call that is underneath getNodes is implemented as it's not stateless. If something errors out on the client side, retry will return a different set of nodes added/removed later on so the client and server can become out of sync.
In case of error, we could probably zero out clientID here
and clean up prober state in code
to kind of initiate a full relist from API.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 4, 2023
@marseel marseel added dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. labels Dec 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 4, 2023
@ti-mo ti-mo enabled auto-merge December 4, 2023 15:40
Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @marseel that this this deserves a refactor to remove additional side effect conditions on error. I think this is worth merging as-is to improve the current behavior.

@ti-mo ti-mo added this pull request to the merge queue Dec 4, 2023
Merged via the queue into cilium:main with commit 7c7b723 Dec 4, 2023
61 checks passed
@nbusseneau nbusseneau removed the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label Dec 5, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.10 Dec 5, 2023
@nbusseneau nbusseneau mentioned this pull request Dec 5, 2023
10 tasks
@nbusseneau nbusseneau added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Dec 5, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Dec 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport done to v1.14 in 1.14.5 Dec 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.17 Dec 6, 2023
@github-actions github-actions bot added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Dec 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.12 in 1.12.17 Dec 6, 2023
@github-actions github-actions bot removed the backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. label Dec 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.13 in 1.13.10 Dec 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.12 in 1.12.17 Dec 6, 2023
@github-actions github-actions bot added the backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. label Dec 6, 2023
marseel added a commit to marseel/cilium that referenced this pull request Jan 30, 2024
Fixes cilium#29566

There were three issues with health-reporting/probing:
- Whenever node was updated, it was received in nodesAdded
and was overriding icmp result reporting node as unreachable
- If Icmp probe stopped working and there were no node updates,
it was reporting node as healthy even though probe was failing.
- Http prober was not triggered at the start and only after
  probeInterval.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Jan 30, 2024
Fixes #29566

There were three issues with health-reporting/probing:
- Whenever node was updated, it was received in nodesAdded
and was overriding icmp result reporting node as unreachable
- If Icmp probe stopped working and there were no node updates,
it was reporting node as healthy even though probe was failing.
- Http prober was not triggered at the start and only after
  probeInterval.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
joamaki pushed a commit that referenced this pull request Jan 30, 2024
[ upstream commit 100818f ]

Fixes #29566

There were three issues with health-reporting/probing:
- Whenever node was updated, it was received in nodesAdded
and was overriding icmp result reporting node as unreachable
- If Icmp probe stopped working and there were no node updates,
it was reporting node as healthy even though probe was failing.
- Http prober was not triggered at the start and only after
  probeInterval.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
aanm pushed a commit that referenced this pull request Jan 31, 2024
[ upstream commit 100818f ]

Fixes #29566

There were three issues with health-reporting/probing:
- Whenever node was updated, it was received in nodesAdded
and was overriding icmp result reporting node as unreachable
- If Icmp probe stopped working and there were no node updates,
it was reporting node as healthy even though probe was failing.
- Http prober was not triggered at the start and only after
  probeInterval.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
joamaki pushed a commit that referenced this pull request Jan 31, 2024
[ upstream commit 100818f ]

Fixes #29566

There were three issues with health-reporting/probing:
- Whenever node was updated, it was received in nodesAdded
and was overriding icmp result reporting node as unreachable
- If Icmp probe stopped working and there were no node updates,
it was reporting node as healthy even though probe was failing.
- Http prober was not triggered at the start and only after
  probeInterval.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
joamaki pushed a commit that referenced this pull request Jan 31, 2024
[ upstream commit 100818f ]

Fixes #29566

There were three issues with health-reporting/probing:
- Whenever node was updated, it was received in nodesAdded
and was overriding icmp result reporting node as unreachable
- If Icmp probe stopped working and there were no node updates,
it was reporting node as healthy even though probe was failing.
- Http prober was not triggered at the start and only after
  probeInterval.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health Relates to the cilium-health component area/metrics Impacts statistics / metrics gathering, eg via Prometheus. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.12.17
Backport done to v1.12
1.14.5
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

5 participants