Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health-server prober bugfixes #29745

Merged
merged 2 commits into from
Jan 10, 2024
Merged

Conversation

thorn3r
Copy link
Contributor

@thorn3r thorn3r commented Dec 8, 2023

This PR fixes 2 separate bugs in the health-server prober:

The 1st commit resets the cache if a client-side error is encountered:

    reset health-server's node cache on error
    
    The health-server prober maintains a cache of the nodes in the cluster
    to avoid requesting the entire list of cluster nodes on each interval.
    When calling the daemon API's /cluster/nodes endpoint, it subscribes as
    a client so that it only receives a diff since the last request, which
    it applies to the cache.
    
    If a client-side error is encountered when making this request, the
    cache is not updated and events are lost since the server is unaware.
    
    This commit flushes the prober's node cache if any error occurs and sets
    its clientID to 0. On the subsequent request, the prober will be
    subscribed as a new client and receive the full list of nodes.

The 2nd commit ensures we actually skip adding empty IP addresses to the prober's node cache:

    prevent adding empty IPs to health-server cache
    
    The logic in the prober.setNodes function should skip adding nodes to
    the prober's cache if the IP address is empty. This is dependant on
    resolveIP() returning a nil address if it is unable to resolve the IP,
    however this will only happen if net.ResolveIPAddr returns an error. In
    the case that the address is an empty string, net.ResolveIPAddr will
    return a pointer to an empty net.IPAddr{} and error will be nil. This
    allows a node with no IP addresses to be added to the cache.
    
    server.resolveIP() has been updated to skip the address if it is empty.
    This prevents the prober from trying to probe and report on the empty
    addresses.
    
    The following is the output of `cilium-health status` for a faux
    CiliumNode with no addresses before and after this commit (snipped for
    brevity):
    
    Before:
    ---
    root@kind-worker:/home/cilium# cilium-health status
    Probe time:   2023-12-08T14:58:11Z
    Nodes:
      kind-kind/kind-worker2:
        Host connectivity to :
          ICMP to stack:   Connection timed out
          Secondary connectivity to :
            ICMP to stack:   Connection timed out
        Endpoint connectivity to 10.244.1.202:
          ICMP to stack:   Connection timed out
          Secondary connectivity to fd00:10:244:1::c879:
            ICMP to stack:   Connection timed out
    ---
    
    After:
    ---
    root@kind-control-plane:/home/cilium# cilium-health status
    Probe time:   2023-12-08T14:44:35Z
    Nodes:
      kind-kind/kind-worker2:
        Endpoint connectivity to 10.244.1.202:
          ICMP to stack:   Connection timed out
          Secondary connectivity to fd00:10:244:1::c879:
            ICMP to stack:   Connection timed out
    ---
Fix bugs in health-server that cause the state in the prober's cache to drift and allow nodes with empty IP addresses to be added.

The health-server prober maintains a cache of the nodes in the cluster
to avoid requesting the entire list of cluster nodes on each interval.
When calling the daemon API's /cluster/nodes endpoint, it subscribes as
a client so that it only receives a diff since the last request, which
it applies to the cache.

If a client-side error is encountered when making this request, the
cache is not updated and events are lost since the server is unaware.

This commit flushes the prober's node cache if any error occurs and sets
its clientID to 0. On the subsequent request, the prober will be
subscribed as a new client and receive the full list of nodes.

Signed-off-by: Tim Horner <timothy.horner@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 8, 2023
@thorn3r thorn3r added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Dec 8, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 8, 2023
The logic in the prober.setNodes function should skip adding nodes to
the prober's cache if the IP address is empty. This is dependent on
resolveIP() returning a nil address if it is unable to resolve the IP,
however this will only happen if net.ResolveIPAddr returns an error. In
the case that the address is an empty string, net.ResolveIPAddr will
return a pointer to an empty net.IPAddr{} and error will be nil. This
allows a node with no IP addresses to be added to the cache.

server.resolveIP() has been updated to skip the address if it is empty.
This prevents the prober from trying to probe and report on the empty
addresses.

The following is the output of `cilium-health status` for a faux
CiliumNode with no addresses before and after this commit (snipped for
brevity):

Before:
---
root@kind-worker:/home/cilium# cilium-health status
Probe time:   2023-12-08T14:58:11Z
Nodes:
  kind-kind/kind-worker2:
    Host connectivity to :
      ICMP to stack:   Connection timed out
      Secondary connectivity to :
        ICMP to stack:   Connection timed out
    Endpoint connectivity to 10.244.1.202:
      ICMP to stack:   Connection timed out
      Secondary connectivity to fd00:10:244:1::c879:
        ICMP to stack:   Connection timed out
---

After:
---
root@kind-control-plane:/home/cilium# cilium-health status
Probe time:   2023-12-08T14:44:35Z
Nodes:
  kind-kind/kind-worker2:
    Endpoint connectivity to 10.244.1.202:
      ICMP to stack:   Connection timed out
      Secondary connectivity to fd00:10:244:1::c879:
        ICMP to stack:   Connection timed out
---

Signed-off-by: Tim Horner <timothy.horner@isovalent.com>
@thorn3r
Copy link
Contributor Author

thorn3r commented Dec 8, 2023

/test

@thorn3r thorn3r marked this pull request as ready for review December 8, 2023 18:13
@thorn3r thorn3r requested a review from a team as a code owner December 8, 2023 18:13
@joestringer joestringer added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Dec 14, 2023
Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

What was a bit confusing for me was that response contains "NodesAdded" and "NodesRemoved", but it seems like "NodesAdded" also contains "NodesUpdated":

for i, added := range c.NodesAdded {
🤯
so in case we are missing IP we should receive later on updated node

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Dec 19, 2023
@dylandreimerink
Copy link
Member

All seems good except for one unresolved conversation, is it blocking? https://github.com/cilium/cilium/pull/29745/files#r1431802011

@thorn3r
Copy link
Contributor Author

thorn3r commented Jan 9, 2024

@dylandreimerink thanks for pointing that out. Responded/resolved 👍

@thorn3r thorn3r added affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Jan 9, 2024
@dylandreimerink dylandreimerink added this pull request to the merge queue Jan 10, 2024
Merged via the queue into cilium:main with commit bf8d26a Jan 10, 2024
61 checks passed
@jibi jibi mentioned this pull request Jan 12, 2024
32 tasks
@jibi jibi added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Jan 12, 2024
@giorio94 giorio94 added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jan 29, 2024
@marseel marseel mentioned this pull request Jan 31, 2024
17 tasks
@nbusseneau nbusseneau added the affects/v1.12 This issue affects v1.12 branch label Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants