-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health-server prober bugfixes #29745
Conversation
The health-server prober maintains a cache of the nodes in the cluster to avoid requesting the entire list of cluster nodes on each interval. When calling the daemon API's /cluster/nodes endpoint, it subscribes as a client so that it only receives a diff since the last request, which it applies to the cache. If a client-side error is encountered when making this request, the cache is not updated and events are lost since the server is unaware. This commit flushes the prober's node cache if any error occurs and sets its clientID to 0. On the subsequent request, the prober will be subscribed as a new client and receive the full list of nodes. Signed-off-by: Tim Horner <timothy.horner@isovalent.com>
The logic in the prober.setNodes function should skip adding nodes to the prober's cache if the IP address is empty. This is dependent on resolveIP() returning a nil address if it is unable to resolve the IP, however this will only happen if net.ResolveIPAddr returns an error. In the case that the address is an empty string, net.ResolveIPAddr will return a pointer to an empty net.IPAddr{} and error will be nil. This allows a node with no IP addresses to be added to the cache. server.resolveIP() has been updated to skip the address if it is empty. This prevents the prober from trying to probe and report on the empty addresses. The following is the output of `cilium-health status` for a faux CiliumNode with no addresses before and after this commit (snipped for brevity): Before: --- root@kind-worker:/home/cilium# cilium-health status Probe time: 2023-12-08T14:58:11Z Nodes: kind-kind/kind-worker2: Host connectivity to : ICMP to stack: Connection timed out Secondary connectivity to : ICMP to stack: Connection timed out Endpoint connectivity to 10.244.1.202: ICMP to stack: Connection timed out Secondary connectivity to fd00:10:244:1::c879: ICMP to stack: Connection timed out --- After: --- root@kind-control-plane:/home/cilium# cilium-health status Probe time: 2023-12-08T14:44:35Z Nodes: kind-kind/kind-worker2: Endpoint connectivity to 10.244.1.202: ICMP to stack: Connection timed out Secondary connectivity to fd00:10:244:1::c879: ICMP to stack: Connection timed out --- Signed-off-by: Tim Horner <timothy.horner@isovalent.com>
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
What was a bit confusing for me was that response contains "NodesAdded" and "NodesRemoved", but it seems like "NodesAdded" also contains "NodesUpdated":
Line 494 in 7df3194
for i, added := range c.NodesAdded { |
so in case we are missing IP we should receive later on updated node
All seems good except for one unresolved conversation, is it blocking? https://github.com/cilium/cilium/pull/29745/files#r1431802011 |
@dylandreimerink thanks for pointing that out. Responded/resolved 👍 |
This PR fixes 2 separate bugs in the health-server prober:
The 1st commit resets the cache if a client-side error is encountered:
The 2nd commit ensures we actually skip adding empty IP addresses to the prober's node cache: