Flaky Antrea-upgrade tests #1608

Dyanngg · 2020-12-03T05:58:52Z

Upgrade tests from N-1/N-2 has been pretty flaky in recent PR pre-check runs (https://github.com/vmware-tanzu/antrea/actions?query=workflow%3A%22Antrea+upgrade%22) as well as for post-checkin runs in ToT.
Specifically, the NetworkPolicy test changed by #1586 has been failing at test/e2e/networkpolicy_test.go#L263.
It appears the NetworkPolicy created before upgrading no longer work properly.
It might be related to the following comment: https://github.com/vmware-tanzu/antrea/pull/1586/files#r527796591

Dyanngg · 2020-12-03T05:59:01Z

/cc @tnqn

tnqn · 2020-12-03T12:38:35Z

@Dyanngg Thanks for filing it. I noticed this flake as well after changing the PodSelector from Everything to MatchExpressions, however it recovered that day after I added 2 seconds of latency and I cannot reproduce it even once on local testbeds with or without the sleep.

I checked logs when it failed on CI, the agent received the policy almost at the same time the controller processed the policy and we never expected 2 seconds delay to realize a policy.

I will investigate more to see the root cause.

antoninbas · 2020-12-09T20:12:24Z

I'm running into this issue a lot as well. Seems to happen more than 50% of the time

tnqn · 2020-12-10T16:30:03Z

After digging into the problem, I think the failure was caused by netdev datapath implementation bug:
Some dp flow cache were not flushed after installing new flows that drop new connections. For example:

Before installing drop flows, 10.10.1.43 can connect 10.10.2.44

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=100;
 cookie=0x15000000000000, duration=1800.006s, table=100, n_packets=14, n_bytes=1036, priority=200,reg1=0x5 actions=drop
 cookie=0x15000000000000, duration=1800.006s, table=100, n_packets=15, n_bytes=1110, priority=200,reg1=0x7 actions=drop
 cookie=0x15000000000000, duration=1800.354s, table=100, n_packets=11, n_bytes=814, priority=0 actions=resubmit(,101)

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-appctl dpctl/dump-flows
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:3, bytes:198, used:1.571s, flags:F., actions:ct(zone=65520,nat),recirc(0x31)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),ct_state(+new-est-inv+trk),ct_mark(0),ct_label(0/0xffffffff),recirc_id(0x31),in_port(1),packet_type(ns=0,id=0),eth(src=7e:b9:a1:8b:f6:c1,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(src=10.10.1.43/255.255.254.0,dst=10.10.2.44,proto=6,ttl=63,frag=no),tcp(dst=80), packets:0, bytes:0, used:never, actions:set(eth(src=86:a0:34:80:2d:31,dst=96:d6:69:77:8c:8c)),set(ipv4(ttl=62)),ct(commit,zone=65520),recirc(0x32)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0x32),in_port(1),packet_type(ns=0,id=0),eth(dst=96:d6:69:77:8c:8c),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:8

After installing drop flows, the corresponding dp flows were not flushed:

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=100;
 cookie=0x15000000000000, duration=1813.653s, table=100, n_packets=14, n_bytes=1036, priority=200,reg1=0x5 actions=drop
 cookie=0x15000000000000, duration=1813.653s, table=100, n_packets=15, n_bytes=1110, priority=200,reg1=0x7 actions=drop
 cookie=0x15000000000000, duration=1.740s, table=100, n_packets=0, n_bytes=0, priority=200,reg1=0x3 actions=drop
 cookie=0x15000000000000, duration=1814.001s, table=100, n_packets=12, n_bytes=888, priority=0 actions=resubmit(,101)

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-appctl dpctl/dump-flows
flow-dump from the main thread:
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:3, bytes:198, used:8.333s, flags:F., actions:ct(zone=65520,nat),recirc(0x31)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),ct_state(+new-est-inv+trk),ct_mark(0),ct_label(0/0xffffffff),recirc_id(0x31),in_port(1),packet_type(ns=0,id=0),eth(src=7e:b9:a1:8b:f6:c1,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(src=10.10.1.43/255.255.254.0,dst=10.10.2.44,proto=6,ttl=63,frag=no),tcp(dst=80), packets:0, bytes:0, used:never, actions:set(eth(src=86:a0:34:80:2d:31,dst=96:d6:69:77:8c:8c)),set(ipv4(ttl=62)),ct(commit,zone=65520),recirc(0x32)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0x32),in_port(1),packet_type(ns=0,id=0),eth(dst=96:d6:69:77:8c:8c),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:8

Then 10.10.1.43 can still connect 10.10.2.44 because of the cache. Note that both stats of openflow and dp flow increased 1 packet, though the openflow's action was drop while the dp flow's action was outputting to port 8.

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=100;
 cookie=0x15000000000000, duration=1820.491s, table=100, n_packets=14, n_bytes=1036, priority=200,reg1=0x5 actions=drop
 cookie=0x15000000000000, duration=1820.491s, table=100, n_packets=15, n_bytes=1110, priority=200,reg1=0x7 actions=drop
 cookie=0x15000000000000, duration=8.578s, table=100, n_packets=1, n_bytes=74, priority=200,reg1=0x3 actions=drop
 cookie=0x15000000000000, duration=1820.839s, table=100, n_packets=12, n_bytes=888, priority=0 actions=resubmit(,101)

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-appctl dpctl/dump-flows
flow-dump from the main thread:
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:7, bytes:470, used:2.652s, flags:SF., actions:ct(zone=65520,nat),recirc(0x31)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),ct_state(+new-est-inv+trk),ct_mark(0),ct_label(0/0xffffffff),recirc_id(0x31),in_port(1),packet_type(ns=0,id=0),eth(src=7e:b9:a1:8b:f6:c1,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(src=10.10.1.43/255.255.254.0,dst=10.10.2.44,proto=6,ttl=63,frag=no),tcp(dst=80), packets:1, bytes:74, used:2.655s, flags:S, actions:set(eth(src=86:a0:34:80:2d:31,dst=96:d6:69:77:8c:8c)),set(ipv4(ttl=62)),ct(commit,zone=65520),recirc(0x32)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0x32),in_port(1),packet_type(ns=0,id=0),eth(dst=96:d6:69:77:8c:8c),eth_type(0x0800),ipv4(frag=no), packets:1, bytes:74, used:2.655s, flags:S, actions:8

The bug only happened when the source and destination were on different nodes, that may be why it didn't always fail or succeed. I'm trying to find a way to reproduce it without K8s context so that it can be reproduced easily with some flows.
But for Antrea's upgrade test, I guess we could workaround it by waiting for 10 seconds after pre-checking the traffic so that the cache can expire.

srikartati · 2020-12-10T17:39:51Z

Hi @tnqn, Is this with geneve tunnel? From ./ci/kind/test-upgrade-antrea.sh it seems so.
If the tunnel is geneve, this might be the same issue as #897

antoninbas · 2020-12-10T19:21:29Z

@tnqn yes that does seem like a duplicate of #897 and only affects Geneve tunnels. We already reached out to the OVS team a while ago with steps to reproduce it without K8s / Antrea, but they haven't got back to us with a fix. Once #1643 is closed, we can close this issue in favor of #897

tnqn · 2020-12-11T02:14:23Z

Thanks @srikartati and @antoninbas for the information! After looking at the symptom of #897, I think it's same issue.

Dyanngg added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 3, 2020

tnqn self-assigned this Dec 3, 2020

srikartati mentioned this issue Dec 3, 2020

Upgrade go-ipfix to v0.3.1 #1582

Merged

tnqn mentioned this issue Dec 10, 2020

Fix flaky upgrade e2e test #1643

Merged

tnqn closed this as completed in #1643 Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky Antrea-upgrade tests #1608

Flaky Antrea-upgrade tests #1608

Dyanngg commented Dec 3, 2020

Dyanngg commented Dec 3, 2020

tnqn commented Dec 3, 2020

antoninbas commented Dec 9, 2020

tnqn commented Dec 10, 2020

srikartati commented Dec 10, 2020

antoninbas commented Dec 10, 2020

tnqn commented Dec 11, 2020

Flaky Antrea-upgrade tests #1608

Flaky Antrea-upgrade tests #1608

Comments

Dyanngg commented Dec 3, 2020

Dyanngg commented Dec 3, 2020

tnqn commented Dec 3, 2020

antoninbas commented Dec 9, 2020

tnqn commented Dec 10, 2020

srikartati commented Dec 10, 2020

antoninbas commented Dec 10, 2020

tnqn commented Dec 11, 2020