Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where cilium-health reports connectivity failures to stale IPs #12989

Merged

Conversation

kkourt
Copy link
Contributor

@kkourt kkourt commented Aug 27, 2020

NodeAdd and NodeUpdate update the node state for clients so that they
can return the changes when client requests so. If a node was added and
then updated, its old and new version would be on the added list and its
old on the removed list. Instead, we can just update the node on the
added list.

Note that the setNodes() function on pkg/health/server/prober.go first
deletes the removed nodes and then adds the new ones, which means that
the old version of the node would be added and remain as stale on the
health server.

This was found during investigation of issues with inconsistent health
reports when nodes are added/removed from the cluster (e.g., #11532),
and it seems to fix inconsistencies observed a small-scale test I did to
reproduce the issue.

Fixes #11532.

@kkourt kkourt requested a review from a team August 27, 2020 14:33
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 27, 2020
@kkourt kkourt changed the title health incosistencies fix Health incosistencies fix during node addition removal Aug 27, 2020
@kkourt kkourt changed the title Health incosistencies fix during node addition removal Health incosistencies fix during nodes addition/removal Aug 27, 2020
@pchaigno pchaigno added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 27, 2020
@pchaigno pchaigno added area/health Relates to the cilium-health component kind/bug This is a bug in the Cilium logic. needs-backport/1.7 labels Aug 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.8 Aug 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.3 Aug 27, 2020
daemon/cmd/status.go Outdated Show resolved Hide resolved
daemon/cmd/status.go Outdated Show resolved Hide resolved
@kkourt kkourt force-pushed the pr/kkourt/health-incosistencies-fix branch from bd1862f to 2622587 Compare August 27, 2020 16:25
NodeAdd and NodeUpdate update the node state for clients so that they
can return the changes when client requests so. If a node was added and
then updated, its old and new version would be on the added list and its
old on the removed list. Instead, we can just update the node on the
added list.

Note that the setNodes() function on pkg/health/server/prober.go first
deletes the removed nodes and then adds the new ones, which means that
the old version of the node would be added and remain as stale on the
health server.

This was found during investigation of issues with inconsistent health
reports when nodes are added/removed from the cluster (e.g., cilium#11532),
and it seems to fix inconsistencies observed a small-scale test I did to
reproduce the issue.

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
@kkourt kkourt force-pushed the pr/kkourt/health-incosistencies-fix branch from 2622587 to 0d5f890 Compare August 27, 2020 16:26
@christarazi
Copy link
Member

test-me-please

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this! I have some questions below, even if this may address some use cases I don't understand how we don't cause regressions in other areas with the current version. Would be good to get a clearer outline of the exact set of steps we believe are happening and how this addresses that case, and make sure we're not missing other cases in the fix for this case.

I think this also highlights that if we can come up with some better ways to automate testing in this area, we would benefit greatly.

daemon/cmd/status.go Show resolved Hide resolved
daemon/cmd/status.go Show resolved Hide resolved
daemon/cmd/status.go Show resolved Hide resolved
@kkourt
Copy link
Contributor Author

kkourt commented Aug 27, 2020

Just re-read through a few times and I get it now :-)

I also wrote a more detailed explanation, hope it helps.

@pchaigno
Copy link
Member

pchaigno commented Aug 27, 2020

4.9 pipeline failed with a bunch of seemingly-unrelated errors, restarting to see if they are flakes.
Previous run available here: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.9/303/testReport/.
retest-4.9

@joestringer
Copy link
Member

Filed #12993 to follow up on 4.9 failures, seems like an unrelated issue.

@joestringer joestringer changed the title Health incosistencies fix during nodes addition/removal Fix bug where cilium-health reports connectivity failures to stale IPs Aug 27, 2020
@joestringer
Copy link
Member

K8s-1.12-Kernel-netnext CI failure Suite-k8s-1.12.K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing and DSR is a known flaky test, can retry but I'm guessing it'll pass

@joestringer
Copy link
Member

retest-net-next

@joestringer
Copy link
Member

joestringer commented Aug 28, 2020

4.9 kernel CI failures got worse this time, Up from 22 failures to 71 🤔 :
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.9/304/testReport/

EDIT: Oh, I see. This time the K8sIstioTest Istio Bookinfo Demo test ran first out of the whole suite, so when it failed (& failed to clean up), it likely caused even more of the subsequent tests to fail this time.

@joestringer
Copy link
Member

joestringer commented Aug 28, 2020

Observed that this is not the only PR to be affected by the 4.9 issues so do not plan to have that block this PR. They also look completely unrelated to the code changes here.

Only other failure is in net-next job, which is known failure #12994

@joestringer joestringer merged commit 5550c0f into cilium:master Aug 28, 2020
@kkourt kkourt deleted the pr/kkourt/health-incosistencies-fix branch August 28, 2020 06:09
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.8 Aug 28, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.8 Aug 28, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.3 Sep 3, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.3 Sep 3, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.3 Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health Relates to the cilium-health component kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.7.8
Backport done to v1.7
1.8.3
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

Inconsistent view of nodes across agents
6 participants