Fix bug where cilium-health reports connectivity failures to stale IPs #12989

kkourt · 2020-08-27T14:33:22Z

NodeAdd and NodeUpdate update the node state for clients so that they
can return the changes when client requests so. If a node was added and
then updated, its old and new version would be on the added list and its
old on the removed list. Instead, we can just update the node on the
added list.

Note that the setNodes() function on pkg/health/server/prober.go first
deletes the removed nodes and then adds the new ones, which means that
the old version of the node would be added and remain as stale on the
health server.

This was found during investigation of issues with inconsistent health
reports when nodes are added/removed from the cluster (e.g., #11532),
and it seems to fix inconsistencies observed a small-scale test I did to
reproduce the issue.

Fixes #11532.

daemon/cmd/status.go

NodeAdd and NodeUpdate update the node state for clients so that they can return the changes when client requests so. If a node was added and then updated, its old and new version would be on the added list and its old on the removed list. Instead, we can just update the node on the added list. Note that the setNodes() function on pkg/health/server/prober.go first deletes the removed nodes and then adds the new ones, which means that the old version of the node would be added and remain as stale on the health server. This was found during investigation of issues with inconsistent health reports when nodes are added/removed from the cluster (e.g., cilium#11532), and it seems to fix inconsistencies observed a small-scale test I did to reproduce the issue. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

christarazi · 2020-08-27T16:55:41Z

test-me-please

joestringer

Thanks for digging into this! I have some questions below, even if this may address some use cases I don't understand how we don't cause regressions in other areas with the current version. Would be good to get a clearer outline of the exact set of steps we believe are happening and how this addresses that case, and make sure we're not missing other cases in the fix for this case.

I think this also highlights that if we can come up with some better ways to automate testing in this area, we would benefit greatly.

daemon/cmd/status.go

kkourt · 2020-08-27T19:05:33Z

Just re-read through a few times and I get it now :-)

I also wrote a more detailed explanation, hope it helps.

pchaigno · 2020-08-27T19:07:33Z

4.9 pipeline failed with a bunch of seemingly-unrelated errors, restarting to see if they are flakes.
Previous run available here: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.9/303/testReport/.
retest-4.9

joestringer · 2020-08-27T19:09:40Z

Filed #12993 to follow up on 4.9 failures, seems like an unrelated issue.

joestringer · 2020-08-28T01:12:09Z

K8s-1.12-Kernel-netnext CI failure Suite-k8s-1.12.K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing and DSR is a known flaky test, can retry but I'm guessing it'll pass

joestringer · 2020-08-28T01:12:16Z

retest-net-next

joestringer · 2020-08-28T01:13:05Z

4.9 kernel CI failures got worse this time, Up from 22 failures to 71 🤔 :
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.9/304/testReport/

EDIT: Oh, I see. This time the K8sIstioTest Istio Bookinfo Demo test ran first out of the whole suite, so when it failed (& failed to clean up), it likely caused even more of the subsequent tests to fail this time.

joestringer · 2020-08-28T01:47:42Z

Observed that this is not the only PR to be affected by the 4.9 issues so do not plan to have that block this PR. They also look completely unrelated to the code changes here.

Only other failure is in net-next job, which is known failure #12994

kkourt requested a review from a team August 27, 2020 14:33

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 27, 2020

kkourt changed the title ~~health incosistencies fix~~ Health incosistencies fix during node addition removal Aug 27, 2020

kkourt changed the title ~~Health incosistencies fix during node addition removal~~ Health incosistencies fix during nodes addition/removal Aug 27, 2020

pchaigno added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 27, 2020

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 27, 2020

pchaigno added area/health Relates to the cilium-health component kind/bug This is a bug in the Cilium logic. needs-backport/1.7 labels Aug 27, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.7.8 Aug 27, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.8.3 Aug 27, 2020

pchaigno reviewed Aug 27, 2020

View reviewed changes

daemon/cmd/status.go Outdated Show resolved Hide resolved

pchaigno reviewed Aug 27, 2020

View reviewed changes

daemon/cmd/status.go Outdated Show resolved Hide resolved

kkourt force-pushed the pr/kkourt/health-incosistencies-fix branch from bd1862f to 2622587 Compare August 27, 2020 16:25

kkourt force-pushed the pr/kkourt/health-incosistencies-fix branch from 2622587 to 0d5f890 Compare August 27, 2020 16:26

christarazi approved these changes Aug 27, 2020

View reviewed changes

fristonio approved these changes Aug 27, 2020

View reviewed changes

fristonio requested a review from pchaigno August 27, 2020 17:30

joestringer requested changes Aug 27, 2020

View reviewed changes

daemon/cmd/status.go Show resolved Hide resolved

daemon/cmd/status.go Show resolved Hide resolved

pchaigno approved these changes Aug 27, 2020

View reviewed changes

joestringer approved these changes Aug 27, 2020

View reviewed changes

daemon/cmd/status.go Show resolved Hide resolved

joestringer mentioned this pull request Aug 27, 2020

CI: K8sHealthTest checks cilium-health status between nodes: Cilium cannot be installed: rendered manifests contain a resource that already exists. #12993

Closed

joestringer changed the title ~~Health incosistencies fix during nodes addition/removal~~ Fix bug where cilium-health reports connectivity failures to stale IPs Aug 27, 2020

joestringer merged commit 5550c0f into cilium:master Aug 28, 2020

kkourt deleted the pr/kkourt/health-incosistencies-fix branch August 28, 2020 06:09

kaworu added backport-pending/1.7 and removed needs-backport/1.7 labels Aug 28, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.8 Aug 28, 2020

kaworu mentioned this pull request Aug 28, 2020

v1.7 backports 2020-08-28 #13002

Merged

joestringer added backport-done/1.7 and removed backport-pending/1.7 labels Aug 28, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.8 Aug 28, 2020

christarazi mentioned this pull request Sep 3, 2020

v1.8 backports 2020-09-02 #13060

Merged

christarazi added backport-pending/1.8 and removed needs-backport/1.8 labels Sep 3, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.3 Sep 3, 2020

joestringer added backport-done/1.8 and removed backport-pending/1.8 labels Sep 4, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.3 Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where cilium-health reports connectivity failures to stale IPs #12989

Fix bug where cilium-health reports connectivity failures to stale IPs #12989

kkourt commented Aug 27, 2020 •

edited

christarazi commented Aug 27, 2020

joestringer left a comment

kkourt commented Aug 27, 2020

pchaigno commented Aug 27, 2020 •

edited

joestringer commented Aug 27, 2020

joestringer commented Aug 28, 2020

joestringer commented Aug 28, 2020

joestringer commented Aug 28, 2020 •

edited

joestringer commented Aug 28, 2020 •

edited

Fix bug where cilium-health reports connectivity failures to stale IPs #12989

Fix bug where cilium-health reports connectivity failures to stale IPs #12989

Conversation

kkourt commented Aug 27, 2020 • edited

christarazi commented Aug 27, 2020

joestringer left a comment

Choose a reason for hiding this comment

kkourt commented Aug 27, 2020

pchaigno commented Aug 27, 2020 • edited

joestringer commented Aug 27, 2020

joestringer commented Aug 28, 2020

joestringer commented Aug 28, 2020

joestringer commented Aug 28, 2020 • edited

joestringer commented Aug 28, 2020 • edited

kkourt commented Aug 27, 2020 •

edited

pchaigno commented Aug 27, 2020 •

edited

joestringer commented Aug 28, 2020 •

edited

joestringer commented Aug 28, 2020 •

edited