-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AntreaProxy failed to create ovs flows and groups for K8s service #2127
Comments
@dantingl thanks for reporting it.
And the flows for Service and Proxy cannot be reinstalled successfully.
It seemed AntreaProxy cannot handle reconnection correctly. We should definitely enhance this. @dantingl did you see antrea-ovs container restarted? If you still has the setup, could you help check ovs-vswitch.log under "/var/log/antrea/openvswitch/" of that Node to see if the process restarted around 00:47:38? I'd like to understand which situation the reconnection happened in. |
Here is ovs-vswitch.log around 00:47:38 |
Thanks @dantingl. I can see there was only single restart. Obviously this case was not handled properly. @hongliangl Have you started fixing it? If not, I could work on it. The issue seems serious to me as once antrea-ovs is restarted (could due to liveness probing failure or OOM killer), all Pods on a Node will basically lose all connections as it cannot access any services, including cluster DNS service. cc @antoninbas, we may want to backport this too. |
Sorry, I have not started fixing it. |
I talked to @tnqn. Since no one has started working on this, I can take a stab at fixing this today. |
I found the issue ( |
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127
The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127
Describe the bug
Pod failed to do nslookup.
To Reproduce
When running K8s conformance test, hit this issue, not sure how to reproduce it.
Expected
$ nslookup google.com
Server: 172.30.0.10
Address: 172.30.0.10#53
Non-authoritative answer:
Name: google.com
Address: 172.217.5.110
Name: google.com
Address: 2607:f8b0:4005:808::200e
Actual behavior
$ cat /etc/resolv.conf
search dns-8064.svc.cluster.local svc.cluster.local cluster.local upi.openshift.test
nameserver 172.30.0.10
options ndots:5
$ nslookup google.com
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; connection timed out; no servers could be reached
Versions:
antrea: 0.13.1
oc version
Client Version: 4.7.5
Server Version: 4.5.6
Kubernetes Version: v1.18.3+002a51f
uname -r
4.18.0-193.14.3.el8_2.x86_64
Additional context
antrea-agent log: https://gist.github.com/dantingl/9a1759abebb5d34b6c62b62ce82318eb
restart antrea-agent could work.
(Please consider pasting long output into a GitHub gist or any other pastebin.)
The text was updated successfully, but these errors were encountered: