New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stale references to old nodes during health probe #29566
Fix stale references to old nodes during health probe #29566
Conversation
Similar to how useful log msgs are when endpoints created and deleted, this log is useful for understanding when nodes are added and deleted in production clusters. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Given the order of operations in prober.OnIdle, it is possible for the health probe to have a stale references to a deleted nodes. When that occurs, node connectivity metrics which were previously deleted [1] would be brought back, causing confusion. If users defined alerts for node connectivity health checks metrics (see example below), then this would erroneously trigger because the old nodes would appear in the metric labels as a failing health check. Example given deletion of "kind-worker2" node: ``` cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod e_type="remote_intra_cluster" type="endpoint" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-control-plane" target_nod e_type="remote_intra_cluster" type="node" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type= "local_node" type="endpoint" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker" target_node_type= "local_node" type="node" 1.000000 cilium_node_connectivity_status source_cluster="kind-kind" source_node_name="kind-worker" target_cluster="kind-kind" target_node_name="kind-worker2" target_node_type ="remote_intra_cluster" type="endpoint" 0.000000 ``` Fixes: d9e1ff8 ("cilium-health: Remove unnecessary goroutine") [1]: e9f97cd ("Ensures prometheus metrics associated with a deleted node are no longer reported.") Signed-off-by: Chris Tarazi <chris@isovalent.com>
78e77ec
to
1adb2d2
Compare
cc @derailed @tommyp1ckles JFYI |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me. But I guess it only removes a single probe which could potentially happen after the node was removed and not any prolonged reappearance?
As a side note, I do not really like how GetClusterNodes
API call that is underneath getNodes
is implemented as it's not stateless. If something errors out on the client side, retry will return a different set of nodes added/removed later on so the client and server can become out of sync.
In case of error, we could probably zero out clientID here
and clean up prober state in code
to kind of initiate a full relist from API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @marseel that this this deserves a refactor to remove additional side effect conditions on error. I think this is worth merging as-is to improve the current behavior.
Fixes cilium#29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Fixes #29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
[ upstream commit 100818f ] Fixes #29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 100818f ] Fixes #29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 100818f ] Fixes #29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 100818f ] Fixes #29566 There were three issues with health-reporting/probing: - Whenever node was updated, it was received in nodesAdded and was overriding icmp result reporting node as unreachable - If Icmp probe stopped working and there were no node updates, it was reporting node as healthy even though probe was failing. - Http prober was not triggered at the start and only after probeInterval. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
Related: #28382