Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/K8sServices: send datagrams in one block for fragment support tests #11016

Merged
merged 1 commit into from Apr 16, 2020

Conversation

qmonnet
Copy link
Member

@qmonnet qmonnet commented Apr 16, 2020

Tests for IPv4 fragments support introduced a flake in the CI. The test consists in sending a fragmented datagram and counting (from the conntrack table) that all fragments were processed as expected. But sometimes, an additional packet is observed, leading to a failure and to a message like the following:

Failed to account for IPv4 fragments (in)
     Expected
         <[]int | len:2, cap:2>: [21, 20]
     To satisfy at least one of these matchers: [%!s(*matchers.EqualMatcher=&{[16 24]}) %!s(*matchers.EqualMatcher=&{[20 20]})]

A normal datagram, as seen by the destination pod, looks like:

09:02:22.178149 IP (tos 0x0, ttl 63, id 61115, offset 0, flags [+], proto UDP (17), length 1444)
    10.10.0.230.12345 > testds-smpbw.69:  1416 tftp-#0
09:02:22.178151 IP (tos 0x0, ttl 63, id 61115, offset 1424, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:22.178233 IP (tos 0x0, ttl 63, id 61115, offset 2848, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:22.178265 IP (tos 0x0, ttl 63, id 61115, offset 4272, flags [none], proto UDP (17), length 876)
    10.10.0.230 > testds-smpbw: udp

When reproducing the flake, we could observe the additional packet:

09:02:26.535728 IP (tos 0x0, ttl 63, id 61232, offset 0, flags [DF], proto UDP (17), length 540)
    10.10.0.230.12345 > testds-smpbw.69: [udp sum ok]  512 tftp-#0
09:02:26.536103 IP (tos 0x0, ttl 63, id 61233, offset 0, flags [+], proto UDP (17), length 1444)
    10.10.0.230.12345 > testds-smpbw.69:  1416 tftp-#0
09:02:26.536162 IP (tos 0x0, ttl 63, id 61233, offset 1424, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:26.536274 IP (tos 0x0, ttl 63, id 61233, offset 2848, flags [+], proto UDP (17), length 1444)
    10.10.0.230 > testds-smpbw: udp
09:02:26.536422 IP (tos 0x0, ttl 63, id 61233, offset 4272, flags [none], proto UDP (17), length 364)
    10.10.0.230 > testds-smpbw: udp

We note that data is split into two datagrams, the first packet being standalone and the rest being fragmented. The total length received is the same (3*1444 + 540 + 364 == 3*1444 + 876 + sizeof(IP, UDP headers)).

The fact that the first packet is 512 byte-long is a hint to the probable cause of the flake. We send the packets with netcat, but source the data from /dev/zero with dd as follows:

dd if=/dev/zero bs=512 count=10 | nc ...

Most of the time, the different blocks written by dd are passed quick enough that netcat processes them in one go. But if the machine is under a heavier load at that moment, it is likely that a small latency is introduced between the blocks, and netcat sends the data in several chunks (datagrams).

Let's solve this by copying from /dev/zero in just one block.

Fixes: #10929

Tests for IPv4 fragments support introduced a flake in the CI. The test
consists in sending a fragmented datagram and counting (from the
conntrack table) that all fragments were processed as expected. But
sometimes, an additional packet is observed, leading to a failure and to
a message like the following:

    Failed to account for IPv4 fragments (in)
         Expected
             <[]int | len:2, cap:2>: [21, 20]
         To satisfy at least one of these matchers: [%!s(*matchers.EqualMatcher=&{[16 24]}) %!s(*matchers.EqualMatcher=&{[20 20]})]

A normal datagram, as seen by the destination pod, looks like:

    09:02:22.178149 IP (tos 0x0, ttl 63, id 61115, offset 0, flags [+], proto UDP (17), length 1444)
        10.10.0.230.12345 > testds-smpbw.69:  1416 tftp-#0
    09:02:22.178151 IP (tos 0x0, ttl 63, id 61115, offset 1424, flags [+], proto UDP (17), length 1444)
        10.10.0.230 > testds-smpbw: udp
    09:02:22.178233 IP (tos 0x0, ttl 63, id 61115, offset 2848, flags [+], proto UDP (17), length 1444)
        10.10.0.230 > testds-smpbw: udp
    09:02:22.178265 IP (tos 0x0, ttl 63, id 61115, offset 4272, flags [none], proto UDP (17), length 876)
        10.10.0.230 > testds-smpbw: udp

When reproducing the flake, we could observe the additional packet:

    09:02:26.535728 IP (tos 0x0, ttl 63, id 61232, offset 0, flags [DF], proto UDP (17), length 540)
        10.10.0.230.12345 > testds-smpbw.69: [udp sum ok]  512 tftp-#0
    09:02:26.536103 IP (tos 0x0, ttl 63, id 61233, offset 0, flags [+], proto UDP (17), length 1444)
        10.10.0.230.12345 > testds-smpbw.69:  1416 tftp-#0
    09:02:26.536162 IP (tos 0x0, ttl 63, id 61233, offset 1424, flags [+], proto UDP (17), length 1444)
        10.10.0.230 > testds-smpbw: udp
    09:02:26.536274 IP (tos 0x0, ttl 63, id 61233, offset 2848, flags [+], proto UDP (17), length 1444)
        10.10.0.230 > testds-smpbw: udp
    09:02:26.536422 IP (tos 0x0, ttl 63, id 61233, offset 4272, flags [none], proto UDP (17), length 364)
        10.10.0.230 > testds-smpbw: udp

We note that data is split into two datagrams, the first packet being
standalone and the rest being fragmented. The total length received is
the same (3*1444 + 540 + 364 == 3*1444 + 876 + sizeof(IP, UDP headers)).

The fact that the first packet is 512 byte-long is a hint to the
probable cause of the flake. We send the packets with netcat, but source
the data from /dev/zero with "dd" as follows:

    dd if=/dev/zero bs=512 count=10 | nc ...

Most of the time, the different blocks written by dd are passed quick
enough that netcat processes them in one go. But if the machine is under
a heavier load at that moment, it is likely that a small latency is
introduced between the blocks, and netcat sends the data in several
chunks (datagrams).

Let's solve this by copying from /dev/zero in just one block.

Fixes: #10929
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@qmonnet qmonnet added pending-review kind/bug/CI This is a bug in the testing code. area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. labels Apr 16, 2020
@qmonnet qmonnet requested a review from a team as a code owner April 16, 2020 14:05
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Apr 16, 2020
@qmonnet
Copy link
Member Author

qmonnet commented Apr 16, 2020

test-me-please

@coveralls
Copy link

Coverage Status

Coverage remained the same at 46.777% when pulling 776538e on pr/qmonnet/flake_ipfrag into e92fd24 on master.

@qmonnet
Copy link
Member Author

qmonnet commented Apr 16, 2020

Hit #11013.

@qmonnet
Copy link
Member Author

qmonnet commented Apr 16, 2020

restart-ginkgo

@qmonnet
Copy link
Member Author

qmonnet commented Apr 16, 2020

And GKE just hit #9902.

@qmonnet
Copy link
Member Author

qmonnet commented Apr 16, 2020

test-gke

1 similar comment
@qmonnet
Copy link
Member Author

qmonnet commented Apr 16, 2020

test-gke

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice insight! SGTM.

@borkmann borkmann merged commit 8b74e24 into master Apr 16, 2020
1.8.0 automation moved this from In progress to Merged Apr 16, 2020
@borkmann borkmann deleted the pr/qmonnet/flake_ipfrag branch April 16, 2020 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake kind/bug/CI This is a bug in the testing code. release-note/misc This PR makes changes that have no direct user impact.
Projects
No open projects
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

CI: K8sServicesTest Checks service across nodes Supports IPv4 Fragments: Failed to account for IPv4 fragments
4 participants