Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Redeploy DNS after endpointRoutes reconfiguration #16767

Merged
merged 1 commit into from Jul 6, 2021

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jul 2, 2021

In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Fixes: #16717
Fixes: #16227
Related: #14955

@joestringer joestringer requested a review from a team July 2, 2021 18:36
@joestringer joestringer requested a review from a team as a code owner July 2, 2021 18:36
@joestringer joestringer added the release-note/ci This PR makes changes to the CI. label Jul 2, 2021
@joestringer joestringer requested review from jibi and nebril July 2, 2021 18:36
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 2, 2021
@joestringer joestringer requested a review from pchaigno July 2, 2021 18:36
@joestringer
Copy link
Member Author

joestringer commented Jul 2, 2021

test-me-please

EDIT: All green.

test/k8sT/DatapathConfiguration.go Outdated Show resolved Hide resolved
In general up until now, Cilium has expected endpointRoutes mode to be
set to exactly one value upon deployment and for that value to stay the
same for the remainder of operation. Toggling it can lead to a mix of
endpoints in different datapath modes which is not well covered in CI.

In Github issue #16717 we observed that if the testsuite toggles this
setting then we can end up with kubedns pods remaining in endpoint
routes mode, even though the rest of the daemon (and other pods) are not
configured in this mode. This can lead to connectivity issues in DNS,
and a range of test failures in subsequent tests because DNS is broken.

Longer term to resolve this, we could improve on Cilium to ensure that
users can successfully toggle this setting on or off at runtime and
properly handle this case, or alternatively shift all logic over to
endpoint-routes mode by default and disable the other option.

Given that CI for the master branch is in a poor state due to this issue
today, and that part of the issue is CI reconfiguring the datapath state
of Cilium during the test setup in an unsupported manner, this commit
proposes to force DNS pod redeployment as part of setup any time a test
reconfigures the endpointRoutes mode. This should mitigate the testing
side issue while we mull over the right longer-term solution.

Signed-off-by: Joe Stringer <joe@cilium.io>
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.3 Jul 6, 2021
@joestringer
Copy link
Member Author

test-me-please

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version LGTM as well.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.9 Jul 6, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.9.9 Jul 6, 2021
@joestringer
Copy link
Member Author

joestringer commented Jul 6, 2021

On the fence about v1.9 backport, from out-of-band discussion with @pchaigno this flake is likely caused by #16227 which was backported to v1.9. I briefly looked through scheduled CI job runs for v1.9 and didn't observe the issue at all. We'll hold off for now. Now that we know how to detect this problem, it's easy to figure out if we see this particular issue. If we see it on v1.9 we can backport it.

@joestringer
Copy link
Member Author

Hit #16813, which seems at first glance like maybe #15459 resurfacing. Given that this is addressing one of the most common flakes in the tree today, bypassing the failing build to merge.

CI-3.0 not necessary to pass since this only changes ginkgo test code.

@nbusseneau
Copy link
Member

Marking for backport to 1.9 following discussion here: #18031 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.10.3
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

CI: K8sFQDNTest Restart Cilium validate that FQDN is still working: Error reaching kube-dns before test
7 participants