Tests for IPv4 fragments support introduced a flake in the CI. The test
consists in sending a fragmented datagram and counting (from the
conntrack table) that all fragments were processed as expected. But
sometimes, an additional packet is observed, leading to a failure and to
a message like the following:
Failed to account for IPv4 fragments (in)
Expected
<[]int | len:2, cap:2>: [21, 20]
To satisfy at least one of these matchers: [%!s(*matchers.EqualMatcher=&{[16 24]}) %!s(*matchers.EqualMatcher=&{[20 20]})]
A normal datagram, as seen by the destination pod, looks like:
09:02:22.178149 IP (tos 0x0, ttl 63, id 61115, offset 0, flags [+], proto UDP (17), length 1444)
10.10.0.230.12345 > testds-smpbw.69: 1416 tftp-#0
09:02:22.178151 IP (tos 0x0, ttl 63, id 61115, offset 1424, flags [+], proto UDP (17), length 1444)
10.10.0.230 > testds-smpbw: udp
09:02:22.178233 IP (tos 0x0, ttl 63, id 61115, offset 2848, flags [+], proto UDP (17), length 1444)
10.10.0.230 > testds-smpbw: udp
09:02:22.178265 IP (tos 0x0, ttl 63, id 61115, offset 4272, flags [none], proto UDP (17), length 876)
10.10.0.230 > testds-smpbw: udp
When reproducing the flake, we could observe the additional packet:
09:02:26.535728 IP (tos 0x0, ttl 63, id 61232, offset 0, flags [DF], proto UDP (17), length 540)
10.10.0.230.12345 > testds-smpbw.69: [udp sum ok] 512 tftp-#0
09:02:26.536103 IP (tos 0x0, ttl 63, id 61233, offset 0, flags [+], proto UDP (17), length 1444)
10.10.0.230.12345 > testds-smpbw.69: 1416 tftp-#0
09:02:26.536162 IP (tos 0x0, ttl 63, id 61233, offset 1424, flags [+], proto UDP (17), length 1444)
10.10.0.230 > testds-smpbw: udp
09:02:26.536274 IP (tos 0x0, ttl 63, id 61233, offset 2848, flags [+], proto UDP (17), length 1444)
10.10.0.230 > testds-smpbw: udp
09:02:26.536422 IP (tos 0x0, ttl 63, id 61233, offset 4272, flags [none], proto UDP (17), length 364)
10.10.0.230 > testds-smpbw: udp
We note that data is split into two datagrams, the first packet being
standalone and the rest being fragmented. The total length received is
the same (3*1444 + 540 + 364 == 3*1444 + 876 + sizeof(IP, UDP headers)).
The fact that the first packet is 512 byte-long is a hint to the
probable cause of the flake. We send the packets with netcat, but source
the data from /dev/zero with "dd" as follows:
dd if=/dev/zero bs=512 count=10 | nc ...
Most of the time, the different blocks written by dd are passed quick
enough that netcat processes them in one go. But if the machine is under
a heavier load at that moment, it is likely that a small latency is
introduced between the blocks, and netcat sends the data in several
chunks (datagrams).
Let's solve this by copying from /dev/zero in just one block.
Fixes: #10929
Signed-off-by: Quentin Monnet <quentin@isovalent.com>