Skip to content

Commit

Permalink
health/server: Fix stale references to old nodes during health probe
Browse files Browse the repository at this point in the history
[ upstream commit 7c7b723 ]

Given the order of operations in prober.OnIdle, it is possible for the
health probe to have a stale references to a deleted nodes. When that
occurs, node connectivity metrics which were previously deleted [1]
would be brought back, causing confusion. If users defined alerts for
node connectivity health checks metrics (see example below), then this
would erroneously trigger because the old nodes would appear in the
metric labels as a failing health check.

Example given deletion of "kind-worker2" node:

```
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod
e_type="remote_intra_cluster" type="endpoint"                                                                        1.000000
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod
e_type="remote_intra_cluster" type="node"                                                                            1.000000
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type=
"local_node" type="endpoint"                                                                                         1.000000
cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type=
"local_node" type="node"                                                                                             1.000000

cilium_node_connectivity_status                          source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker2" target_node_type
="remote_intra_cluster" type="endpoint"                                                                              0.000000
```

Fixes: d9e1ff8 ("cilium-health: Remove unnecessary goroutine")

[1]: e9f97cd ("Ensures prometheus metrics associated with a deleted
node are no longer reported.")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
  • Loading branch information
christarazi authored and nbusseneau committed Dec 6, 2023
1 parent ac0e18c commit 46d5b5a
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions pkg/health/server/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -336,14 +336,14 @@ func (s *Server) runActiveServices() error {
prober := newProber(s, nodesAdded)
prober.MaxRTT = s.ProbeInterval
prober.OnIdle = func() {
// Fetch results and update set of nodes to probe every
// ProbeInterval
s.updateCluster(prober.getResults())
// Update set of nodes to probe every ProbeInterval and then fetch
// results
if nodesAdded, nodesRemoved, err := s.getNodes(); err != nil {
log.WithError(err).Error("unable to get cluster nodes")
} else {
prober.setNodes(nodesAdded, nodesRemoved)
}
s.updateCluster(prober.getResults())
}
prober.RunLoop()
defer prober.Stop()
Expand Down

0 comments on commit 46d5b5a

Please sign in to comment.