-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: K8sFQDNTest Restart Cilium validate that FQDN is still working: Error reaching kube-dns before test #16717
Comments
SymptomsAll DNS connections from We can trace one request for example. On the source node:
On the destination node:
Monitor aggregation is enabled, so only Datapath AnalysisEndpoint routes are disabled, so these We can first get the
We can then check the BPF programs attached to the node:
Here, we see that, contrary to other containers, the CoreDNS pod actually has two BPF programs attached, one at ingress and one at egress. That should be the case only when endpoint routes are enabled. Similarly, in the routes, we can see that it's the only endpoint with a route:
Therefore the DNS packet is sent to the stack. It flows through netfilter and hits the
Since endpoint routes are disabled in the agent, rules installed in Root CauseThis is actually a known limitation of Cilium since #16227 (changing the status of endpoint routes on an existing Cilium installation is not supported, even though we do it in CI). If endpoint routes are enabled/disabled in the agent, the setting is not reflected in existing endpoints (including the CoreDNS endpoint in our case). We usually work around it by deleting existing pods so that their routes are reinstalled from scratch. That would be a short solution here. |
This would only happen in our 4.9 CI job because of the following condition: Lines 1149 to 1156 in e6f34c3
So if ENABLE_NODEPORT is defined (only undefined on 4.9), we redirect the packet to the lxc device instead of passing to the stack and we therefore skip the FORWARD-filter table.
|
I believe that K8sDatapathConfig is the only suite which may switch up endpoint routes mode on/off, and in this particular failure case the K8sDatapathConfig tests ran immediately prior to K8sFQDN. I wonder if we're just not cleaning up the environment properly enough in the AfterAll / BeforeAll steps of one of these two contexts. This could explain why we don't see the failure more often - it requires particular groups of tests to be run in a particular order. |
Yep, but checking if DNS resolves is the first thing we do after any Cilium deployment AFAIK. So any test without endpoint routes running after a test with endpoints routes should fail. |
Seems like that should be established as part of this path: Line 76 in eb9a5c4
... cilium/test/k8sT/assertionHelpers.go Line 177 in eb9a5c4
... cilium/test/helpers/kubectl.go Line 2048 in eb9a5c4
... cilium/test/helpers/kubectl.go Line 1973 in eb9a5c4
However the failing line is later than this, so whatever check we did above was functionally different from the actual DNS lookup. Line 89 in eb9a5c4
EDIT: Yep, we validate DNS from one of the hosts, not from pods: cilium/test/helpers/kubectl.go Lines 1747 to 1748 in eb9a5c4
I don't know off-hand how different host DNS resolution is but it may provide some hints here. |
Hm. DNS resolution from a hostns pod should be the same as long as it's on a different node than the CoreDNS pod. Once the request reaches the destination node via the tunnel it's basically indistinguishable from a request from a pod (except from policy point of view, but we're not concerned with that here). |
I've looked into the following failures and also observed that there are per-endpoint routes for the DNS pod: https://jenkins.cilium.io/job/cilium-master-k8s-1.21-kernel-4.9/544/testReport/junit/Suite-k8s-1/21/K8sDemosTest_Tests_Star_Wars_Demo/ Here's a one-liner I've been using to establish which endpoints are configured with endpoint-routes mode in a CI sysdump:
EDIT: Oh and here's some useful pointers to get the repro above to work: https://github.com/tomnomnom/gron
|
In general up until now, Cilium has expected endpointRoutes mode to be set to exactly one value upon deployment and for that value to stay the same for the remainder of operation. Toggling it can lead to a mix of endpoints in different datapath modes which is not well covered in CI. In Github issue #16717 we observed that if the testsuite toggles this setting then we can end up with kubedns pods remaining in endpoint routes mode, even though the rest of the daemon (and other pods) are not configured in this mode. This can lead to connectivity issues in DNS, and a range of test failures in subsequent tests because DNS is broken. Longer term to resolve this, we could improve on Cilium to ensure that users can successfully toggle this setting on or off at runtime and properly handle this case, or alternatively shift all logic over to endpoint-routes mode by default and disable the other option. Given that CI for the master branch is in a poor state due to this issue today, and that part of the issue is CI reconfiguring the datapath state of Cilium during the test setup in an unsupported manner, this commit proposes to force DNS pod redeployment as part of setup any time a test reconfigures the endpointRoutes mode. This should mitigate the testing side issue while we mull over the right longer-term solution. Signed-off-by: Joe Stringer <joe@cilium.io>
In general up until now, Cilium has expected endpointRoutes mode to be set to exactly one value upon deployment and for that value to stay the same for the remainder of operation. Toggling it can lead to a mix of endpoints in different datapath modes which is not well covered in CI. In Github issue #16717 we observed that if the testsuite toggles this setting then we can end up with kubedns pods remaining in endpoint routes mode, even though the rest of the daemon (and other pods) are not configured in this mode. This can lead to connectivity issues in DNS, and a range of test failures in subsequent tests because DNS is broken. Longer term to resolve this, we could improve on Cilium to ensure that users can successfully toggle this setting on or off at runtime and properly handle this case, or alternatively shift all logic over to endpoint-routes mode by default and disable the other option. Given that CI for the master branch is in a poor state due to this issue today, and that part of the issue is CI reconfiguring the datapath state of Cilium during the test setup in an unsupported manner, this commit proposes to force DNS pod redeployment as part of setup any time a test reconfigures the endpointRoutes mode. This should mitigate the testing side issue while we mull over the right longer-term solution. Signed-off-by: Joe Stringer <joe@cilium.io>
In general up until now, Cilium has expected endpointRoutes mode to be set to exactly one value upon deployment and for that value to stay the same for the remainder of operation. Toggling it can lead to a mix of endpoints in different datapath modes which is not well covered in CI. In Github issue #16717 we observed that if the testsuite toggles this setting then we can end up with kubedns pods remaining in endpoint routes mode, even though the rest of the daemon (and other pods) are not configured in this mode. This can lead to connectivity issues in DNS, and a range of test failures in subsequent tests because DNS is broken. Longer term to resolve this, we could improve on Cilium to ensure that users can successfully toggle this setting on or off at runtime and properly handle this case, or alternatively shift all logic over to endpoint-routes mode by default and disable the other option. Given that CI for the master branch is in a poor state due to this issue today, and that part of the issue is CI reconfiguring the datapath state of Cilium during the test setup in an unsupported manner, this commit proposes to force DNS pod redeployment as part of setup any time a test reconfigures the endpointRoutes mode. This should mitigate the testing side issue while we mull over the right longer-term solution. Signed-off-by: Joe Stringer <joe@cilium.io>
Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes") didn't go quite far enough: It ensured that between individual tests in a given file, the DNS pods would be redeployed during the next run if there were significant enough datapath changes. However, the way it did this was by storing state within the 'kubectl' variable, which is recreated in each test file. So if the last test in one CI run enabled endpoint routes mode, then the DNS pods would not be redeployed to disable endpoint routes mode as part of the next test. Fix it by redeploying DNS after removing Cilium from the cluster. Kubernetes will remove the current DNS pods and reschedule them, but they will not launch until the next test deploys a new version of Cilium. Reported-by: Chris Tarazi <chris@isovalent.com> Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes") Related: cilium#16717 Signed-off-by: Joe Stringer <joe@cilium.io>
Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes") didn't go quite far enough: It ensured that between individual tests in a given file, the DNS pods would be redeployed during the next run if there were significant enough datapath changes. However, the way it did this was by storing state within the 'kubectl' variable, which is recreated in each test file. So if the last test in one CI run enabled endpoint routes mode, then the DNS pods would not be redeployed to disable endpoint routes mode as part of the next test. Fix it by redeploying DNS after removing Cilium from the cluster. Kubernetes will remove the current DNS pods and reschedule them, but they will not launch until the next test deploys a new version of Cilium. Reported-by: Chris Tarazi <chris@isovalent.com> Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes") Related: #16717 Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit a0e7712 ] In general up until now, Cilium has expected endpointRoutes mode to be set to exactly one value upon deployment and for that value to stay the same for the remainder of operation. Toggling it can lead to a mix of endpoints in different datapath modes which is not well covered in CI. In Github issue cilium#16717 we observed that if the testsuite toggles this setting then we can end up with kubedns pods remaining in endpoint routes mode, even though the rest of the daemon (and other pods) are not configured in this mode. This can lead to connectivity issues in DNS, and a range of test failures in subsequent tests because DNS is broken. Longer term to resolve this, we could improve on Cilium to ensure that users can successfully toggle this setting on or off at runtime and properly handle this case, or alternatively shift all logic over to endpoint-routes mode by default and disable the other option. Given that CI for the master branch is in a poor state due to this issue today, and that part of the issue is CI reconfiguring the datapath state of Cilium during the test setup in an unsupported manner, this commit proposes to force DNS pod redeployment as part of setup any time a test reconfigures the endpointRoutes mode. This should mitigate the testing side issue while we mull over the right longer-term solution. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c18cfc8 ] Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes") didn't go quite far enough: It ensured that between individual tests in a given file, the DNS pods would be redeployed during the next run if there were significant enough datapath changes. However, the way it did this was by storing state within the 'kubectl' variable, which is recreated in each test file. So if the last test in one CI run enabled endpoint routes mode, then the DNS pods would not be redeployed to disable endpoint routes mode as part of the next test. Fix it by redeploying DNS after removing Cilium from the cluster. Kubernetes will remove the current DNS pods and reschedule them, but they will not launch until the next test deploys a new version of Cilium. Reported-by: Chris Tarazi <chris@isovalent.com> Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes") Related: #16717 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: André Martins <andre@cilium.io>
[ upstream commit a0e7712 ] In general up until now, Cilium has expected endpointRoutes mode to be set to exactly one value upon deployment and for that value to stay the same for the remainder of operation. Toggling it can lead to a mix of endpoints in different datapath modes which is not well covered in CI. In Github issue #16717 we observed that if the testsuite toggles this setting then we can end up with kubedns pods remaining in endpoint routes mode, even though the rest of the daemon (and other pods) are not configured in this mode. This can lead to connectivity issues in DNS, and a range of test failures in subsequent tests because DNS is broken. Longer term to resolve this, we could improve on Cilium to ensure that users can successfully toggle this setting on or off at runtime and properly handle this case, or alternatively shift all logic over to endpoint-routes mode by default and disable the other option. Given that CI for the master branch is in a poor state due to this issue today, and that part of the issue is CI reconfiguring the datapath state of Cilium during the test setup in an unsupported manner, this commit proposes to force DNS pod redeployment as part of setup any time a test reconfigures the endpointRoutes mode. This should mitigate the testing side issue while we mull over the right longer-term solution. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c18cfc8 ] Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes") didn't go quite far enough: It ensured that between individual tests in a given file, the DNS pods would be redeployed during the next run if there were significant enough datapath changes. However, the way it did this was by storing state within the 'kubectl' variable, which is recreated in each test file. So if the last test in one CI run enabled endpoint routes mode, then the DNS pods would not be redeployed to disable endpoint routes mode as part of the next test. Fix it by redeploying DNS after removing Cilium from the cluster. Kubernetes will remove the current DNS pods and reschedule them, but they will not launch until the next test deploys a new version of Cilium. Reported-by: Chris Tarazi <chris@isovalent.com> Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes") Related: #16717 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: André Martins <andre@cilium.io>
Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes") didn't go quite far enough: It ensured that between individual tests in a given file, the DNS pods would be redeployed during the next run if there were significant enough datapath changes. However, the way it did this was by storing state within the 'kubectl' variable, which is recreated in each test file. So if the last test in one CI run enabled endpoint routes mode, then the DNS pods would not be redeployed to disable endpoint routes mode as part of the next test. Fix it by redeploying DNS after removing Cilium from the cluster. Kubernetes will remove the current DNS pods and reschedule them, but they will not launch until the next test deploys a new version of Cilium. Reported-by: Chris Tarazi <chris@isovalent.com> Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes") Related: cilium#16717 Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit a0e7712 ] In general up until now, Cilium has expected endpointRoutes mode to be set to exactly one value upon deployment and for that value to stay the same for the remainder of operation. Toggling it can lead to a mix of endpoints in different datapath modes which is not well covered in CI. In Github issue #16717 we observed that if the testsuite toggles this setting then we can end up with kubedns pods remaining in endpoint routes mode, even though the rest of the daemon (and other pods) are not configured in this mode. This can lead to connectivity issues in DNS, and a range of test failures in subsequent tests because DNS is broken. Longer term to resolve this, we could improve on Cilium to ensure that users can successfully toggle this setting on or off at runtime and properly handle this case, or alternatively shift all logic over to endpoint-routes mode by default and disable the other option. Given that CI for the master branch is in a poor state due to this issue today, and that part of the issue is CI reconfiguring the datapath state of Cilium during the test setup in an unsupported manner, this commit proposes to force DNS pod redeployment as part of setup any time a test reconfigures the endpointRoutes mode. This should mitigate the testing side issue while we mull over the right longer-term solution. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit c18cfc8 ] Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes") didn't go quite far enough: It ensured that between individual tests in a given file, the DNS pods would be redeployed during the next run if there were significant enough datapath changes. However, the way it did this was by storing state within the 'kubectl' variable, which is recreated in each test file. So if the last test in one CI run enabled endpoint routes mode, then the DNS pods would not be redeployed to disable endpoint routes mode as part of the next test. Fix it by redeploying DNS after removing Cilium from the cluster. Kubernetes will remove the current DNS pods and reschedule them, but they will not launch until the next test deploys a new version of Cilium. Reported-by: Chris Tarazi <chris@isovalent.com> Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes") Related: #16717 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit a0e7712 ] In general up until now, Cilium has expected endpointRoutes mode to be set to exactly one value upon deployment and for that value to stay the same for the remainder of operation. Toggling it can lead to a mix of endpoints in different datapath modes which is not well covered in CI. In Github issue #16717 we observed that if the testsuite toggles this setting then we can end up with kubedns pods remaining in endpoint routes mode, even though the rest of the daemon (and other pods) are not configured in this mode. This can lead to connectivity issues in DNS, and a range of test failures in subsequent tests because DNS is broken. Longer term to resolve this, we could improve on Cilium to ensure that users can successfully toggle this setting on or off at runtime and properly handle this case, or alternatively shift all logic over to endpoint-routes mode by default and disable the other option. Given that CI for the master branch is in a poor state due to this issue today, and that part of the issue is CI reconfiguring the datapath state of Cilium during the test setup in an unsupported manner, this commit proposes to force DNS pod redeployment as part of setup any time a test reconfigures the endpointRoutes mode. This should mitigate the testing side issue while we mull over the right longer-term solution. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit c18cfc8 ] Commit a0e7712 ("test: Redeploy DNS after changing endpointRoutes") didn't go quite far enough: It ensured that between individual tests in a given file, the DNS pods would be redeployed during the next run if there were significant enough datapath changes. However, the way it did this was by storing state within the 'kubectl' variable, which is recreated in each test file. So if the last test in one CI run enabled endpoint routes mode, then the DNS pods would not be redeployed to disable endpoint routes mode as part of the next test. Fix it by redeploying DNS after removing Cilium from the cluster. Kubernetes will remove the current DNS pods and reschedule them, but they will not launch until the next test deploys a new version of Cilium. Reported-by: Chris Tarazi <chris@isovalent.com> Fixes: 0e77127dcd7 ("test: Redeploy DNS after changing endpointRoutes") Related: #16717 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
https://jenkins.cilium.io/job/cilium-master-k8s-1.17-kernel-4.9/132/testReport/Suite-k8s-1/17/K8sFQDNTest_Restart_Cilium_validate_that_FQDN_is_still_working/
3205f837_K8sFQDNTest_Restart_Cilium_validate_that_FQDN_is_still_working.zip
Stacktrace
Standard Output
Standard Error
Click to show
The text was updated successfully, but these errors were encountered: