Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AntreaProxy failed to create ovs flows and groups for K8s service #2127

Closed
dantingl opened this issue Apr 26, 2021 · 6 comments · Fixed by #2134
Closed

AntreaProxy failed to create ovs flows and groups for K8s service #2127

dantingl opened this issue Apr 26, 2021 · 6 comments · Fixed by #2134
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@dantingl
Copy link

Describe the bug
Pod failed to do nslookup.

To Reproduce
When running K8s conformance test, hit this issue, not sure how to reproduce it.

Expected
$ nslookup google.com
Server: 172.30.0.10
Address: 172.30.0.10#53

Non-authoritative answer:
Name: google.com
Address: 172.217.5.110
Name: google.com
Address: 2607:f8b0:4005:808::200e

Actual behavior
$ cat /etc/resolv.conf
search dns-8064.svc.cluster.local svc.cluster.local cluster.local upi.openshift.test
nameserver 172.30.0.10
options ndots:5
$ nslookup google.com
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; connection timed out; no servers could be reached

Versions:
antrea: 0.13.1

oc version
Client Version: 4.7.5
Server Version: 4.5.6
Kubernetes Version: v1.18.3+002a51f

uname -r
4.18.0-193.14.3.el8_2.x86_64

Additional context
antrea-agent log: https://gist.github.com/dantingl/9a1759abebb5d34b6c62b62ce82318eb
restart antrea-agent could work.
(Please consider pasting long output into a GitHub gist or any other pastebin.)

@dantingl dantingl added the kind/bug Categorizes issue or PR as related to a bug. label Apr 26, 2021
@hongliangl hongliangl self-assigned this Apr 26, 2021
@tnqn
Copy link
Member

tnqn commented Apr 28, 2021

@dantingl thanks for reporting it.
From the logs, the agent lost connection with ovs-vswitchd:

W0426 00:47:38.247751       1 entry.go:359] InboundError EOF
I0426 00:47:38.249500       1 entry.go:314] Closing OpenFlow message stream.
W0426 00:47:38.249562       1 entry.go:314] Received ERROR message from switch 00:00:7a:38:f8:88:27:49. Err: EOF
I0426 00:47:38.249609       1 ofctrl_bridge.go:240] OFSwitch is disconnected: 00:00:7a:38:f8:88:27:49
I0426 00:47:38.249638       1 entry.go:314] Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt.
I0426 00:47:39.249819       1 entry.go:314] Connected to socket /var/run/openvswitch/br-int.mgmt
I0426 00:47:39.249957       1 entry.go:359] New connection..
I0426 00:47:39.249992       1 entry.go:314] Send hello with OF version: 4
I0426 00:47:39.250420       1 entry.go:359] Received Openflow 1.3 Hello message
I0426 00:47:39.254422       1 entry.go:314] Received ofp1.3 Switch feature response: {Header:{Version:4 Type:6 Length:32 Xid:21524} DPID:00:00:7a:38:f8:88:27:49 Buffers:0 NumTables:254 AuxilaryId:0 pad:[0 0] Capabilities:79 Actions:0 Ports:[]}
I0426 00:47:39.254454       1 entry.go:359] Openflow Connection for new switch: 00:00:7a:38:f8:88:27:49
I0426 00:47:39.254853       1 ofctrl_bridge.go:220] OFSwitch is connected: 00:00:7a:38:f8:88:27:49

And the flows for Service and Proxy cannot be reinstalled successfully.

I0426 00:47:39.254969       1 agent.go:365] Replaying OF flows to OVS bridge
E0426 00:47:39.275588       1 client.go:688] Error when replaying cached group 1: message is canceled because of disconnection from the Switch
E0426 00:47:39.276873       1 client.go:688] Error when replaying cached group 50: message is canceled because of disconnection from the Switch
E0426 00:47:39.276890       1 client.go:688] Error when replaying cached group 53: message is canceled because of disconnection from the Switch
E0426 00:47:39.276902       1 client.go:688] Error when replaying cached group 20: message is canceled because of disconnection from the Switch
...
E0426 00:47:39.296232       1 entry.go:314] Received Vendor error: OFPBFC_MSG_FAILED on ONFT_BUNDLE_CONTROL message
E0426 00:47:39.296290       1 client.go:681] Error when replaying cached flows: one message in bundle failed
E0426 00:47:39.296326       1 entry.go:314] Received OpenFlow1.3 error: OFPBAC_BAD_OUT_GROUP on message OFPT_EXPERIMENTER
E0426 00:47:39.300134       1 entry.go:314] Received Vendor error: OFPBFC_MSG_FAILED on ONFT_BUNDLE_CONTROL message
...
E0426 00:47:39.528599       1 client.go:681] Error when replaying cached flows: one message in bundle failed
I0426 00:47:39.528621       1 agent.go:367] Flow replay completed
E0426 00:47:39.528685       1 entry.go:314] Received OpenFlow1.3 error: OFPBAC_BAD_OUT_GROUP on message OFPT_EXPERIMENTER
I0426 00:47:39.530647       1 agent.go:439] Cleaning up flow-restore-wait config
I0426 00:47:39.531103       1 agent.go:452] Cleaned up flow-restore-wait config

It seemed AntreaProxy cannot handle reconnection correctly. We should definitely enhance this.

@dantingl did you see antrea-ovs container restarted? If you still has the setup, could you help check ovs-vswitch.log under "/var/log/antrea/openvswitch/" of that Node to see if the process restarted around 00:47:38? I'd like to understand which situation the reconnection happened in.

@tnqn tnqn added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 28, 2021
@tnqn tnqn added this to the Antrea v1.1 release milestone Apr 28, 2021
@dantingl
Copy link
Author

Here is ovs-vswitch.log around 00:47:38
https://gist.github.com/dantingl/e2a087f12641110239ddff65748ff17c

@tnqn
Copy link
Member

tnqn commented Apr 28, 2021

Thanks @dantingl. I can see there was only single restart. Obviously this case was not handled properly.

@hongliangl Have you started fixing it? If not, I could work on it. The issue seems serious to me as once antrea-ovs is restarted (could due to liveness probing failure or OOM killer), all Pods on a Node will basically lose all connections as it cannot access any services, including cluster DNS service.

cc @antoninbas, we may want to backport this too.

@hongliangl
Copy link
Contributor

Sorry, I have not started fixing it.

@antoninbas
Copy link
Contributor

I talked to @tnqn. Since no one has started working on this, I can take a stab at fixing this today.

@antoninbas antoninbas self-assigned this Apr 28, 2021
@antoninbas
Copy link
Contributor

I found the issue (Reset method for the group not called during flow replay, and its implementation is not correct). I will work on improving the TestOVSFlowReplay e2e test and submit a PR.

antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 28, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
@antoninbas antoninbas added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 28, 2021
antoninbas added a commit that referenced this issue Apr 29, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 29, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this issue Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this issue Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this issue May 1, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
antoninbas added a commit to antoninbas/antrea that referenced this issue May 1, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit to antoninbas/antrea that referenced this issue May 1, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this issue May 3, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants