Inconsistent view of nodes across agents #11532

diversario · 2020-05-14T14:50:57Z

Bug report

Cilium version (run cilium version)

Client: 1.7.3 952090308 2020-04-29T15:29:53-07:00 go version go1.13.10 linux/amd64
Daemon: 1.7.3 952090308 2020-04-29T15:29:53-07:00 go version go1.13.10 linux/amd64

Kernel version (run uname -a)

Linux ip-10-101-8-142 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

Orchestration system version in use (e.g. kubectl version, Mesos, ...)

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.9", GitCommit:"2e808b7cb054ee242b68e62455323aa783991f03", GitTreeState:"clean", BuildDate:"2020-01-18T23:24:23Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Upload a system dump

will be passed to a maintainer privately

How to reproduce the issue

Don't know the exact steps, but probably what lead to this was a constantly rolling cluster – i.e., nodes were continuously being replaced by kops rolling-update (we were running a resiliency test) over the course of 7 days. After a couple of days I noticed that one agent was reporting cilium_unreachable_nodes=1. After investigating, I found that one agent was still listing a node that's been terminated over 9 hours ago in cilium-health status and in the node list:

root@ip-10-101-12-162:~# cilium-health status | grep -A4 'ip-10-101-20-58.ec2.internal'
  ip-10-101-20-58.ec2.internal:
    Endpoint connectivity to 100.67.34.21:
      ICMP to stack:   Connection timed out
      HTTP to agent:   Get http://100.67.34.21:4240/hello: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  ip-10-101-21-110.ec2.internal:
root@ip-10-101-12-162:~# cilium node list | grep '58'
ip-10-101-22-225.ec2.internal   10.101.22.225   100.67.58.0/24

The text was updated successfully, but these errors were encountered:

aanm · 2020-05-15T12:52:34Z

Seems a little bit similar to #11511 @seanpowell-f5 is the node with connectivity issues still alive in the cluster?

diversario · 2020-05-22T13:25:54Z

This seems to be related to high node churn. In a cluster that just went through a large node (30 -> 700 -> 260) churn, and large pod churn (3000 -> 27000 -> 3000), many cilium agents report unreachable nodes that exceed the actual in-cluster node count:

503 unreachable nodes:

but only 258 nodes in the cluster:

I went to get some more info about what Cilium agents think:

➜ kubectl exec -it cilium-mtbjj -- bash
root@ip-10-101-8-221:~# cilium status
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.16 (v1.16.9) [linux/amd64]
Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Partial   []
Cilium:                 Ok        OK
NodeMonitor:            Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
IPAM:                   IPv4: 7/255 allocated from 100.65.218.0/24,
Controller Status:      34/34 healthy
Proxy Status:           OK, ip 100.65.218.10, 0 redirects active on ports 10000-20000
Cluster health:                              1/503 reachable   (2020-05-22T10:36:40Z)
  Name                                       IP                Reachable   Endpoints reachable
  ip-10-101-8-221.ec2.internal (localhost)   10.101.8.221      false       false
  ip-10-101-0-11.ec2.internal                10.101.0.11       false       false
  ip-10-101-0-112.ec2.internal               10.101.0.112      false       false
  ip-10-101-0-115.ec2.internal               10.101.0.115      false       false
  ip-10-101-0-120.ec2.internal               10.101.0.120      false       false
  ip-10-101-0-139.ec2.internal               10.101.0.139      false       false
  ip-10-101-0-140.ec2.internal               10.101.0.140      false       false
  ip-10-101-0-165.ec2.internal               10.101.0.165      false       false
  ip-10-101-0-166.ec2.internal               10.101.0.166      false       false
  ip-10-101-0-17.ec2.internal                10.101.0.17       false       false
  ip-10-101-0-179.ec2.internal               10.101.0.179      false       false
  ...root@ip-10-101-8-221:~#

root@ip-10-101-8-221:~# cilium-health status | head -n 100
Probe time:   2020-05-22T10:36:40Z
Nodes:
  ip-10-101-8-221.ec2.internal (localhost):
    Host connectivity to 10.101.8.221:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.65.218.93:
      ICMP to stack:   Connection timed out
  ip-10-101-0-11.ec2.internal:
    Host connectivity to 10.101.0.11:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.112.39:
      ICMP to stack:   Connection timed out
  ip-10-101-0-112.ec2.internal:
    Host connectivity to 10.101.0.112:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.65.123.137:
      ICMP to stack:   Connection timed out
  ip-10-101-0-115.ec2.internal:
    Host connectivity to 10.101.0.115:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.65.91.161:
      ICMP to stack:   Connection timed out
  ip-10-101-0-120.ec2.internal:
    Host connectivity to 10.101.0.120:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.225.93:
      ICMP to stack:   Connection timed out
  ip-10-101-0-139.ec2.internal:
    Host connectivity to 10.101.0.139:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.99.41:
      ICMP to stack:   OK, RTT=18.827537ms
  ip-10-101-0-140.ec2.internal:
    Host connectivity to 10.101.0.140:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.40.88:
      ICMP to stack:   Connection timed out
  ip-10-101-0-165.ec2.internal:
    Host connectivity to 10.101.0.165:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.56.174:
      ICMP to stack:   Connection timed out
  ip-10-101-0-166.ec2.internal:
    Host connectivity to 10.101.0.166:
      ICMP to stack:   OK, RTT=19.466825ms
    Endpoint connectivity to 100.64.117.45:
      ICMP to stack:   Connection timed out
  ip-10-101-0-17.ec2.internal:
    Host connectivity to 10.101.0.17:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.65.26.64:
      ICMP to stack:   Connection timed out
  ip-10-101-0-179.ec2.internal:
    Host connectivity to 10.101.0.179:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.9.28:
      ICMP to stack:   Connection timed out
  ip-10-101-0-181.ec2.internal:
    Host connectivity to 10.101.0.181:
      ICMP to stack:   OK, RTT=17.876896ms
    Endpoint connectivity to 100.65.48.219:
      ICMP to stack:   OK, RTT=18.540522ms
  ip-10-101-0-183.ec2.internal:
    Host connectivity to 10.101.0.183:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.23.55:
      ICMP to stack:   Connection timed out
  ip-10-101-0-184.ec2.internal:
    Host connectivity to 10.101.0.184:
      ICMP to stack:   OK, RTT=18.836192ms
    Endpoint connectivity to 100.64.78.200:
      ICMP to stack:   OK, RTT=19.608865ms
...

Looking up instances listed in the above output:

root@ip-10-101-8-221:~# cilium node list | egrep 'ip-10-101-8-221.ec2.internal|ip-10-101-0-11.ec2.internal|ip-10-101-0-112.ec2.internal|ip-10-101-0-115.ec2.internal|ip-10-101-0-120.ec2.internal|ip-10-101-0-139.ec2.internal|ip-10-101-0-140.ec2.internal|ip-10-101-0-165.ec2.internal|ip-10-101-0-166.ec2.internal|ip-10-101-0-17.ec2.internal|ip-10-101-0-179.ec2.internal|ip-10-101-0-181.ec2.internal|ip-10-101-0-183.ec2.internal|ip-10-101-0-184.ec2.internal'
ip-10-101-0-17.ec2.internal     10.101.0.17     100.65.26.0/24
ip-10-101-0-181.ec2.internal    10.101.0.181    100.65.48.0/24
ip-10-101-0-183.ec2.internal    10.101.0.183    100.64.23.0/24
ip-10-101-8-221.ec2.internal    10.101.8.221    100.65.218.0/24

➜ kubectl get ciliumnode ip-10-101-8-221.ec2.internal ip-10-101-0-11.ec2.internal ip-10-101-0-112.ec2.internal ip-10-101-0-115.ec2.internal ip-10-101-0-120.ec2.internal ip-10-101-0-139.ec2.internal ip-10-101-0-140.ec2.internal ip-10-101-0-165.ec2.internal ip-10-101-0-166.ec2.internal ip-10-101-0-17.ec2.internal ip-10-101-0-179.ec2.internal ip-10-101-0-181.ec2.internal ip-10-101-0-183.ec2.internal ip-10-101-0-184.ec2.internal
NAME                           AGE
ip-10-101-8-221.ec2.internal   3h42m
ip-10-101-0-17.ec2.internal    3h40m
ip-10-101-0-181.ec2.internal   3h40m
ip-10-101-0-183.ec2.internal   4h4m
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-11.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-112.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-115.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-120.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-139.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-140.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-165.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-166.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-179.ec2.internal" not found
Error from server (NotFound): ciliumnodes.cilium.io "ip-10-101-0-184.ec2.internal" not found

➜ kubectl get node ip-10-101-8-221.ec2.internal ip-10-101-0-11.ec2.internal ip-10-101-0-112.ec2.internal ip-10-101-0-115.ec2.internal ip-10-101-0-120.ec2.internal ip-10-101-0-139.ec2.internal ip-10-101-0-140.ec2.internal ip-10-101-0-165.ec2.internal ip-10-101-0-166.ec2.internal ip-10-101-0-17.ec2.internal ip-10-101-0-179.ec2.internal ip-10-101-0-181.ec2.internal ip-10-101-0-183.ec2.internal ip-10-101-0-184.ec2.internal
NAME                           STATUS   ROLES   AGE     VERSION
ip-10-101-8-221.ec2.internal   Ready    node    3h41m   v1.16.9
ip-10-101-0-17.ec2.internal    Ready    node    4h1m    v1.16.9
ip-10-101-0-181.ec2.internal   Ready    node    4h1m    v1.16.9
ip-10-101-0-183.ec2.internal   Ready    node    4h3m    v1.16.9
Error from server (NotFound): nodes "ip-10-101-0-11.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-112.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-115.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-120.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-139.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-140.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-165.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-166.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-179.ec2.internal" not found
Error from server (NotFound): nodes "ip-10-101-0-184.ec2.internal" not found

Instances found in ciliumnodes and kubectl get node do indeed exist in EC2.

What's really bizarre here is that cilium-health status shows some dead instances responding OK:

  ip-10-101-0-139.ec2.internal:
    Host connectivity to 10.101.0.139:
      ICMP to stack:   Connection timed out
    Endpoint connectivity to 100.64.99.41:
      ICMP to stack:   OK, RTT=18.827537ms

ip-10-101-0-139.ec2.internal double-confirmed to be non-existent.

Also:

root@ip-10-101-8-221:~# cilium-health status | grep '.ec2.internal' | wc -l
503
root@ip-10-101-8-221:~# cilium node list | wc -l
259

diversario · 2020-05-22T15:03:25Z

The agent in the previous post has fixed itself, it appears:

This reconciliation took ~3.5 hours.

stale · 2020-07-23T05:56:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

trynity · 2020-07-28T11:13:02Z

Both this and the other linked issue are going to be/have been marked as stale. Is there any further insight we could get from this @aanm?

aanm · 2020-07-28T12:05:00Z

@trynity It would be helpful to have steps to replicate the issue.

tgraf · 2020-07-29T09:56:50Z

@trynity What version of Cilium did you experience this?

NodeAdd and NodeUpdate update the node state for clients so that they can return the changes when client requests so. If a node was added and then updated, its old and new version would be on the added list and its old on the removed list. Instead, we can just update the node on the added list. Note that the setNodes() function on pkg/health/server/prober.go first deletes the removed nodes and then adds the new ones, which means that the old version of the node would be added and remain as stale on the health server. This was found during investigation of issues with inconsistent health reports when nodes are added/removed from the cluster (e.g., cilium#11532), and it seems to fix inconsistencies observed a small-scale test I did to reproduce the issue. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

NodeAdd and NodeUpdate update the node state for clients so that they can return the changes when client requests so. If a node was added and then updated, its old and new version would be on the added list and its old on the removed list. Instead, we can just update the node on the added list. Note that the setNodes() function on pkg/health/server/prober.go first deletes the removed nodes and then adds the new ones, which means that the old version of the node would be added and remain as stale on the health server. This was found during investigation of issues with inconsistent health reports when nodes are added/removed from the cluster (e.g., #11532), and it seems to fix inconsistencies observed a small-scale test I did to reproduce the issue. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 5550c0f ] NodeAdd and NodeUpdate update the node state for clients so that they can return the changes when client requests so. If a node was added and then updated, its old and new version would be on the added list and its old on the removed list. Instead, we can just update the node on the added list. Note that the setNodes() function on pkg/health/server/prober.go first deletes the removed nodes and then adds the new ones, which means that the old version of the node would be added and remain as stale on the health server. This was found during investigation of issues with inconsistent health reports when nodes are added/removed from the cluster (e.g., #11532), and it seems to fix inconsistencies observed a small-scale test I did to reproduce the issue. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>

[ upstream commit 5550c0f ] NodeAdd and NodeUpdate update the node state for clients so that they can return the changes when client requests so. If a node was added and then updated, its old and new version would be on the added list and its old on the removed list. Instead, we can just update the node on the added list. Note that the setNodes() function on pkg/health/server/prober.go first deletes the removed nodes and then adds the new ones, which means that the old version of the node would be added and remain as stale on the health server. This was found during investigation of issues with inconsistent health reports when nodes are added/removed from the cluster (e.g., #11532), and it seems to fix inconsistencies observed a small-scale test I did to reproduce the issue. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

aanm added area/health Relates to the cilium-health component kind/community-report This was reported by a user in the Cilium community, eg via Slack. labels May 15, 2020

stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 23, 2020

stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 28, 2020

kkourt mentioned this issue Aug 27, 2020

Fix bug where cilium-health reports connectivity failures to stale IPs #12989

Merged

joestringer closed this as completed in #12989 Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent view of nodes across agents #11532

Inconsistent view of nodes across agents #11532

diversario commented May 14, 2020

aanm commented May 15, 2020

diversario commented May 22, 2020 •

edited

diversario commented May 22, 2020

stale bot commented Jul 23, 2020

trynity commented Jul 28, 2020

aanm commented Jul 28, 2020

tgraf commented Jul 29, 2020

Inconsistent view of nodes across agents #11532

Inconsistent view of nodes across agents #11532

Comments

diversario commented May 14, 2020

Bug report

aanm commented May 15, 2020

diversario commented May 22, 2020 • edited

diversario commented May 22, 2020

stale bot commented Jul 23, 2020

trynity commented Jul 28, 2020

aanm commented Jul 28, 2020

tgraf commented Jul 29, 2020

diversario commented May 22, 2020 •

edited