New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where cilium-health reports connectivity failures to stale IPs #12989
Fix bug where cilium-health reports connectivity failures to stale IPs #12989
Conversation
bd1862f
to
2622587
Compare
NodeAdd and NodeUpdate update the node state for clients so that they can return the changes when client requests so. If a node was added and then updated, its old and new version would be on the added list and its old on the removed list. Instead, we can just update the node on the added list. Note that the setNodes() function on pkg/health/server/prober.go first deletes the removed nodes and then adds the new ones, which means that the old version of the node would be added and remain as stale on the health server. This was found during investigation of issues with inconsistent health reports when nodes are added/removed from the cluster (e.g., cilium#11532), and it seems to fix inconsistencies observed a small-scale test I did to reproduce the issue. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
2622587
to
0d5f890
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this! I have some questions below, even if this may address some use cases I don't understand how we don't cause regressions in other areas with the current version. Would be good to get a clearer outline of the exact set of steps we believe are happening and how this addresses that case, and make sure we're not missing other cases in the fix for this case.
I think this also highlights that if we can come up with some better ways to automate testing in this area, we would benefit greatly.
I also wrote a more detailed explanation, hope it helps. |
4.9 pipeline failed with a bunch of seemingly-unrelated errors, restarting to see if they are flakes. |
Filed #12993 to follow up on 4.9 failures, seems like an unrelated issue. |
K8s-1.12-Kernel-netnext CI failure |
retest-net-next |
4.9 kernel CI failures got worse this time, Up from 22 failures to 71 🤔 : EDIT: Oh, I see. This time the |
Observed that this is not the only PR to be affected by the 4.9 issues so do not plan to have that block this PR. They also look completely unrelated to the code changes here. Only other failure is in net-next job, which is known failure #12994 |
NodeAdd and NodeUpdate update the node state for clients so that they
can return the changes when client requests so. If a node was added and
then updated, its old and new version would be on the added list and its
old on the removed list. Instead, we can just update the node on the
added list.
Note that the setNodes() function on pkg/health/server/prober.go first
deletes the removed nodes and then adds the new ones, which means that
the old version of the node would be added and remain as stale on the
health server.
This was found during investigation of issues with inconsistent health
reports when nodes are added/removed from the cluster (e.g., #11532),
and it seems to fix inconsistencies observed a small-scale test I did to
reproduce the issue.
Fixes #11532.