test/K8sServices: send datagrams in one block for fragment support tests #11016

qmonnet · 2020-04-16T14:05:31Z

Tests for IPv4 fragments support introduced a flake in the CI. The test consists in sending a fragmented datagram and counting (from the conntrack table) that all fragments were processed as expected. But sometimes, an additional packet is observed, leading to a failure and to a message like the following:

Failed to account for IPv4 fragments (in)
     Expected
         <[]int | len:2, cap:2>: [21, 20]
     To satisfy at least one of these matchers: [%!s(*matchers.EqualMatcher=&{[16 24]}) %!s(*matchers.EqualMatcher=&{[20 20]})]

A normal datagram, as seen by the destination pod, looks like:

09:02:22.178149 IP (tos 0x0, ttl 63, id 61115, offset 0, flags [+], proto UDP (17), length 1444)
    10.10.0.230.12345 > testds-smpbw.69:  1416 tftp-#0
09:02:22.178151 IP (tos 0x0, ttl 63, id 61115, offset 1424, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:22.178233 IP (tos 0x0, ttl 63, id 61115, offset 2848, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:22.178265 IP (tos 0x0, ttl 63, id 61115, offset 4272, flags [none], proto UDP (17), length 876)
    10.10.0.230 > testds-smpbw: udp

When reproducing the flake, we could observe the additional packet:

09:02:26.535728 IP (tos 0x0, ttl 63, id 61232, offset 0, flags [DF], proto UDP (17), length 540)
    10.10.0.230.12345 > testds-smpbw.69: [udp sum ok]  512 tftp-#0
09:02:26.536103 IP (tos 0x0, ttl 63, id 61233, offset 0, flags [+], proto UDP (17), length 1444)
    10.10.0.230.12345 > testds-smpbw.69:  1416 tftp-#0
09:02:26.536162 IP (tos 0x0, ttl 63, id 61233, offset 1424, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:26.536274 IP (tos 0x0, ttl 63, id 61233, offset 2848, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:26.536422 IP (tos 0x0, ttl 63, id 61233, offset 4272, flags [none], proto UDP (17), length 364)
    10.10.0.230 > testds-smpbw: udp

We note that data is split into two datagrams, the first packet being standalone and the rest being fragmented. The total length received is the same (3*1444 + 540 + 364 == 3*1444 + 876 + sizeof(IP, UDP headers)).

The fact that the first packet is 512 byte-long is a hint to the probable cause of the flake. We send the packets with netcat, but source the data from /dev/zero with dd as follows:

dd if=/dev/zero bs=512 count=10 | nc ...

Most of the time, the different blocks written by dd are passed quick enough that netcat processes them in one go. But if the machine is under a heavier load at that moment, it is likely that a small latency is introduced between the blocks, and netcat sends the data in several chunks (datagrams).

Let's solve this by copying from /dev/zero in just one block.

Fixes: #10929

Tests for IPv4 fragments support introduced a flake in the CI. The test consists in sending a fragmented datagram and counting (from the conntrack table) that all fragments were processed as expected. But sometimes, an additional packet is observed, leading to a failure and to a message like the following: Failed to account for IPv4 fragments (in) Expected <[]int | len:2, cap:2>: [21, 20] To satisfy at least one of these matchers: [%!s(*matchers.EqualMatcher=&{[16 24]}) %!s(*matchers.EqualMatcher=&{[20 20]})] A normal datagram, as seen by the destination pod, looks like: 09:02:22.178149 IP (tos 0x0, ttl 63, id 61115, offset 0, flags [+], proto UDP (17), length 1444) 10.10.0.230.12345 > testds-smpbw.69: 1416 tftp-#0 09:02:22.178151 IP (tos 0x0, ttl 63, id 61115, offset 1424, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:22.178233 IP (tos 0x0, ttl 63, id 61115, offset 2848, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:22.178265 IP (tos 0x0, ttl 63, id 61115, offset 4272, flags [none], proto UDP (17), length 876) 10.10.0.230 > testds-smpbw: udp When reproducing the flake, we could observe the additional packet: 09:02:26.535728 IP (tos 0x0, ttl 63, id 61232, offset 0, flags [DF], proto UDP (17), length 540) 10.10.0.230.12345 > testds-smpbw.69: [udp sum ok] 512 tftp-#0 09:02:26.536103 IP (tos 0x0, ttl 63, id 61233, offset 0, flags [+], proto UDP (17), length 1444) 10.10.0.230.12345 > testds-smpbw.69: 1416 tftp-#0 09:02:26.536162 IP (tos 0x0, ttl 63, id 61233, offset 1424, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:26.536274 IP (tos 0x0, ttl 63, id 61233, offset 2848, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:26.536422 IP (tos 0x0, ttl 63, id 61233, offset 4272, flags [none], proto UDP (17), length 364) 10.10.0.230 > testds-smpbw: udp We note that data is split into two datagrams, the first packet being standalone and the rest being fragmented. The total length received is the same (3*1444 + 540 + 364 == 3*1444 + 876 + sizeof(IP, UDP headers)). The fact that the first packet is 512 byte-long is a hint to the probable cause of the flake. We send the packets with netcat, but source the data from /dev/zero with "dd" as follows: dd if=/dev/zero bs=512 count=10 | nc ... Most of the time, the different blocks written by dd are passed quick enough that netcat processes them in one go. But if the machine is under a heavier load at that moment, it is likely that a small latency is introduced between the blocks, and netcat sends the data in several chunks (datagrams). Let's solve this by copying from /dev/zero in just one block. Fixes: #10929 Signed-off-by: Quentin Monnet <quentin@isovalent.com>

qmonnet · 2020-04-16T14:07:38Z

test-me-please

coveralls · 2020-04-16T15:00:17Z

Coverage remained the same at 46.777% when pulling 776538e on pr/qmonnet/flake_ipfrag into e92fd24 on master.

qmonnet · 2020-04-16T16:30:51Z

Hit #11013.

qmonnet · 2020-04-16T16:30:59Z

restart-ginkgo

qmonnet · 2020-04-16T16:33:23Z

And GKE just hit #9902.

qmonnet · 2020-04-16T16:33:31Z

test-gke

qmonnet · 2020-04-16T16:35:39Z

test-gke

joestringer

Wow, nice insight! SGTM.

qmonnet added pending-review kind/bug/CI This is a bug in the testing code. area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. labels Apr 16, 2020

qmonnet requested a review from joestringer April 16, 2020 14:05

qmonnet requested a review from a team as a code owner April 16, 2020 14:05

maintainer-s-little-helper bot added this to In progress in 1.8.0 Apr 16, 2020

joestringer approved these changes Apr 16, 2020

View reviewed changes

borkmann approved these changes Apr 16, 2020

View reviewed changes

borkmann merged commit 8b74e24 into master Apr 16, 2020

1.8.0 automation moved this from In progress to Merged Apr 16, 2020

borkmann deleted the pr/qmonnet/flake_ipfrag branch April 16, 2020 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/K8sServices: send datagrams in one block for fragment support tests #11016

test/K8sServices: send datagrams in one block for fragment support tests #11016

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

coveralls commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

joestringer left a comment

test/K8sServices: send datagrams in one block for fragment support tests #11016

test/K8sServices: send datagrams in one block for fragment support tests #11016

Conversation

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

coveralls commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

qmonnet commented Apr 16, 2020

joestringer left a comment

Choose a reason for hiding this comment