Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Antrea-upgrade tests #1608

Closed
Dyanngg opened this issue Dec 3, 2020 · 7 comments · Fixed by #1643
Closed

Flaky Antrea-upgrade tests #1608

Dyanngg opened this issue Dec 3, 2020 · 7 comments · Fixed by #1643
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@Dyanngg
Copy link
Contributor

Dyanngg commented Dec 3, 2020

Upgrade tests from N-1/N-2 has been pretty flaky in recent PR pre-check runs (https://github.com/vmware-tanzu/antrea/actions?query=workflow%3A%22Antrea+upgrade%22) as well as for post-checkin runs in ToT.
Specifically, the NetworkPolicy test changed by #1586 has been failing at test/e2e/networkpolicy_test.go#L263.
It appears the NetworkPolicy created before upgrading no longer work properly.
It might be related to the following comment: https://github.com/vmware-tanzu/antrea/pull/1586/files#r527796591

@Dyanngg Dyanngg added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 3, 2020
@Dyanngg
Copy link
Contributor Author

Dyanngg commented Dec 3, 2020

/cc @tnqn

@tnqn
Copy link
Member

tnqn commented Dec 3, 2020

@Dyanngg Thanks for filing it. I noticed this flake as well after changing the PodSelector from Everything to MatchExpressions, however it recovered that day after I added 2 seconds of latency and I cannot reproduce it even once on local testbeds with or without the sleep.

I checked logs when it failed on CI, the agent received the policy almost at the same time the controller processed the policy and we never expected 2 seconds delay to realize a policy.

I will investigate more to see the root cause.

@tnqn tnqn self-assigned this Dec 3, 2020
@antoninbas
Copy link
Contributor

I'm running into this issue a lot as well. Seems to happen more than 50% of the time

@tnqn
Copy link
Member

tnqn commented Dec 10, 2020

After digging into the problem, I think the failure was caused by netdev datapath implementation bug:
Some dp flow cache were not flushed after installing new flows that drop new connections. For example:

Before installing drop flows, 10.10.1.43 can connect 10.10.2.44

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=100;
 cookie=0x15000000000000, duration=1800.006s, table=100, n_packets=14, n_bytes=1036, priority=200,reg1=0x5 actions=drop
 cookie=0x15000000000000, duration=1800.006s, table=100, n_packets=15, n_bytes=1110, priority=200,reg1=0x7 actions=drop
 cookie=0x15000000000000, duration=1800.354s, table=100, n_packets=11, n_bytes=814, priority=0 actions=resubmit(,101)

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-appctl dpctl/dump-flows
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:3, bytes:198, used:1.571s, flags:F., actions:ct(zone=65520,nat),recirc(0x31)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),ct_state(+new-est-inv+trk),ct_mark(0),ct_label(0/0xffffffff),recirc_id(0x31),in_port(1),packet_type(ns=0,id=0),eth(src=7e:b9:a1:8b:f6:c1,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(src=10.10.1.43/255.255.254.0,dst=10.10.2.44,proto=6,ttl=63,frag=no),tcp(dst=80), packets:0, bytes:0, used:never, actions:set(eth(src=86:a0:34:80:2d:31,dst=96:d6:69:77:8c:8c)),set(ipv4(ttl=62)),ct(commit,zone=65520),recirc(0x32)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0x32),in_port(1),packet_type(ns=0,id=0),eth(dst=96:d6:69:77:8c:8c),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:8

After installing drop flows, the corresponding dp flows were not flushed:

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=100;
 cookie=0x15000000000000, duration=1813.653s, table=100, n_packets=14, n_bytes=1036, priority=200,reg1=0x5 actions=drop
 cookie=0x15000000000000, duration=1813.653s, table=100, n_packets=15, n_bytes=1110, priority=200,reg1=0x7 actions=drop
 cookie=0x15000000000000, duration=1.740s, table=100, n_packets=0, n_bytes=0, priority=200,reg1=0x3 actions=drop
 cookie=0x15000000000000, duration=1814.001s, table=100, n_packets=12, n_bytes=888, priority=0 actions=resubmit(,101)

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-appctl dpctl/dump-flows
flow-dump from the main thread:
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:3, bytes:198, used:8.333s, flags:F., actions:ct(zone=65520,nat),recirc(0x31)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),ct_state(+new-est-inv+trk),ct_mark(0),ct_label(0/0xffffffff),recirc_id(0x31),in_port(1),packet_type(ns=0,id=0),eth(src=7e:b9:a1:8b:f6:c1,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(src=10.10.1.43/255.255.254.0,dst=10.10.2.44,proto=6,ttl=63,frag=no),tcp(dst=80), packets:0, bytes:0, used:never, actions:set(eth(src=86:a0:34:80:2d:31,dst=96:d6:69:77:8c:8c)),set(ipv4(ttl=62)),ct(commit,zone=65520),recirc(0x32)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0x32),in_port(1),packet_type(ns=0,id=0),eth(dst=96:d6:69:77:8c:8c),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:8

Then 10.10.1.43 can still connect 10.10.2.44 because of the cache. Note that both stats of openflow and dp flow increased 1 packet, though the openflow's action was drop while the dp flow's action was outputting to port 8.

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=100;
 cookie=0x15000000000000, duration=1820.491s, table=100, n_packets=14, n_bytes=1036, priority=200,reg1=0x5 actions=drop
 cookie=0x15000000000000, duration=1820.491s, table=100, n_packets=15, n_bytes=1110, priority=200,reg1=0x7 actions=drop
 cookie=0x15000000000000, duration=8.578s, table=100, n_packets=1, n_bytes=74, priority=200,reg1=0x3 actions=drop
 cookie=0x15000000000000, duration=1820.839s, table=100, n_packets=12, n_bytes=888, priority=0 actions=resubmit(,101)

root@k8s-03:~# kubectl exec -it $agent -n kube-system -c antrea-ovs -- ovs-appctl dpctl/dump-flows
flow-dump from the main thread:
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:7, bytes:470, used:2.652s, flags:SF., actions:ct(zone=65520,nat),recirc(0x31)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),ct_state(+new-est-inv+trk),ct_mark(0),ct_label(0/0xffffffff),recirc_id(0x31),in_port(1),packet_type(ns=0,id=0),eth(src=7e:b9:a1:8b:f6:c1,dst=aa:bb:cc:dd:ee:ff),eth_type(0x0800),ipv4(src=10.10.1.43/255.255.254.0,dst=10.10.2.44,proto=6,ttl=63,frag=no),tcp(dst=80), packets:1, bytes:74, used:2.655s, flags:S, actions:set(eth(src=86:a0:34:80:2d:31,dst=96:d6:69:77:8c:8c)),set(ipv4(ttl=62)),ct(commit,zone=65520),recirc(0x32)
tunnel(tun_id=0x0,src=172.18.0.4,dst=172.18.0.3,flags(-df+csum+key)),recirc_id(0x32),in_port(1),packet_type(ns=0,id=0),eth(dst=96:d6:69:77:8c:8c),eth_type(0x0800),ipv4(frag=no), packets:1, bytes:74, used:2.655s, flags:S, actions:8

The bug only happened when the source and destination were on different nodes, that may be why it didn't always fail or succeed. I'm trying to find a way to reproduce it without K8s context so that it can be reproduced easily with some flows.
But for Antrea's upgrade test, I guess we could workaround it by waiting for 10 seconds after pre-checking the traffic so that the cache can expire.

@srikartati
Copy link
Member

Hi @tnqn, Is this with geneve tunnel? From ./ci/kind/test-upgrade-antrea.sh it seems so.
If the tunnel is geneve, this might be the same issue as #897

@antoninbas
Copy link
Contributor

@tnqn yes that does seem like a duplicate of #897 and only affects Geneve tunnels. We already reached out to the OVS team a while ago with steps to reproduce it without K8s / Antrea, but they haven't got back to us with a fix. Once #1643 is closed, we can close this issue in favor of #897

@tnqn
Copy link
Member

tnqn commented Dec 11, 2020

Thanks @srikartati and @antoninbas for the information! After looking at the symptom of #897, I think it's same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants