AntreaProxy failed to create ovs flows and groups for K8s service #2127

dantingl · 2021-04-26T09:28:01Z

Describe the bug
Pod failed to do nslookup.

To Reproduce
When running K8s conformance test, hit this issue, not sure how to reproduce it.

Expected
$ nslookup google.com
Server: 172.30.0.10
Address: 172.30.0.10#53

Non-authoritative answer:
Name: google.com
Address: 172.217.5.110
Name: google.com
Address: 2607:f8b0:4005:808::200e

Actual behavior
$ cat /etc/resolv.conf
search dns-8064.svc.cluster.local svc.cluster.local cluster.local upi.openshift.test
nameserver 172.30.0.10
options ndots:5
$ nslookup google.com
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; reply from unexpected source: 10.0.0.87#5353, expected 172.30.0.10#53
;; connection timed out; no servers could be reached

Versions:
antrea: 0.13.1

oc version
Client Version: 4.7.5
Server Version: 4.5.6
Kubernetes Version: v1.18.3+002a51f

uname -r
4.18.0-193.14.3.el8_2.x86_64

Additional context
antrea-agent log: https://gist.github.com/dantingl/9a1759abebb5d34b6c62b62ce82318eb
restart antrea-agent could work.
(Please consider pasting long output into a GitHub gist or any other pastebin.)

tnqn · 2021-04-28T04:21:07Z

@dantingl thanks for reporting it.
From the logs, the agent lost connection with ovs-vswitchd:

W0426 00:47:38.247751       1 entry.go:359] InboundError EOF
I0426 00:47:38.249500       1 entry.go:314] Closing OpenFlow message stream.
W0426 00:47:38.249562       1 entry.go:314] Received ERROR message from switch 00:00:7a:38:f8:88:27:49. Err: EOF
I0426 00:47:38.249609       1 ofctrl_bridge.go:240] OFSwitch is disconnected: 00:00:7a:38:f8:88:27:49
I0426 00:47:38.249638       1 entry.go:314] Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt.
I0426 00:47:39.249819       1 entry.go:314] Connected to socket /var/run/openvswitch/br-int.mgmt
I0426 00:47:39.249957       1 entry.go:359] New connection..
I0426 00:47:39.249992       1 entry.go:314] Send hello with OF version: 4
I0426 00:47:39.250420       1 entry.go:359] Received Openflow 1.3 Hello message
I0426 00:47:39.254422       1 entry.go:314] Received ofp1.3 Switch feature response: {Header:{Version:4 Type:6 Length:32 Xid:21524} DPID:00:00:7a:38:f8:88:27:49 Buffers:0 NumTables:254 AuxilaryId:0 pad:[0 0] Capabilities:79 Actions:0 Ports:[]}
I0426 00:47:39.254454       1 entry.go:359] Openflow Connection for new switch: 00:00:7a:38:f8:88:27:49
I0426 00:47:39.254853       1 ofctrl_bridge.go:220] OFSwitch is connected: 00:00:7a:38:f8:88:27:49

And the flows for Service and Proxy cannot be reinstalled successfully.

I0426 00:47:39.254969       1 agent.go:365] Replaying OF flows to OVS bridge
E0426 00:47:39.275588       1 client.go:688] Error when replaying cached group 1: message is canceled because of disconnection from the Switch
E0426 00:47:39.276873       1 client.go:688] Error when replaying cached group 50: message is canceled because of disconnection from the Switch
E0426 00:47:39.276890       1 client.go:688] Error when replaying cached group 53: message is canceled because of disconnection from the Switch
E0426 00:47:39.276902       1 client.go:688] Error when replaying cached group 20: message is canceled because of disconnection from the Switch
...
E0426 00:47:39.296232       1 entry.go:314] Received Vendor error: OFPBFC_MSG_FAILED on ONFT_BUNDLE_CONTROL message
E0426 00:47:39.296290       1 client.go:681] Error when replaying cached flows: one message in bundle failed
E0426 00:47:39.296326       1 entry.go:314] Received OpenFlow1.3 error: OFPBAC_BAD_OUT_GROUP on message OFPT_EXPERIMENTER
E0426 00:47:39.300134       1 entry.go:314] Received Vendor error: OFPBFC_MSG_FAILED on ONFT_BUNDLE_CONTROL message
...
E0426 00:47:39.528599       1 client.go:681] Error when replaying cached flows: one message in bundle failed
I0426 00:47:39.528621       1 agent.go:367] Flow replay completed
E0426 00:47:39.528685       1 entry.go:314] Received OpenFlow1.3 error: OFPBAC_BAD_OUT_GROUP on message OFPT_EXPERIMENTER
I0426 00:47:39.530647       1 agent.go:439] Cleaning up flow-restore-wait config
I0426 00:47:39.531103       1 agent.go:452] Cleaned up flow-restore-wait config

It seemed AntreaProxy cannot handle reconnection correctly. We should definitely enhance this.

@dantingl did you see antrea-ovs container restarted? If you still has the setup, could you help check ovs-vswitch.log under "/var/log/antrea/openvswitch/" of that Node to see if the process restarted around 00:47:38? I'd like to understand which situation the reconnection happened in.

dantingl · 2021-04-28T04:48:40Z

Here is ovs-vswitch.log around 00:47:38
https://gist.github.com/dantingl/e2a087f12641110239ddff65748ff17c

tnqn · 2021-04-28T05:16:12Z

Thanks @dantingl. I can see there was only single restart. Obviously this case was not handled properly.

@hongliangl Have you started fixing it? If not, I could work on it. The issue seems serious to me as once antrea-ovs is restarted (could due to liveness probing failure or OOM killer), all Pods on a Node will basically lose all connections as it cannot access any services, including cluster DNS service.

cc @antoninbas, we may want to backport this too.

hongliangl · 2021-04-28T05:56:16Z

Sorry, I have not started fixing it.

antoninbas · 2021-04-28T14:30:58Z

I talked to @tnqn. Since no one has started working on this, I can take a stab at fixing this today.

antoninbas · 2021-04-28T17:03:45Z

I found the issue (Reset method for the group not called during flow replay, and its implementation is not correct). I will work on improving the TestOVSFlowReplay e2e test and submit a PR.

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127