New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/K8sServices: send datagrams in one block for fragment support tests #11016
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Tests for IPv4 fragments support introduced a flake in the CI. The test consists in sending a fragmented datagram and counting (from the conntrack table) that all fragments were processed as expected. But sometimes, an additional packet is observed, leading to a failure and to a message like the following: Failed to account for IPv4 fragments (in) Expected <[]int | len:2, cap:2>: [21, 20] To satisfy at least one of these matchers: [%!s(*matchers.EqualMatcher=&{[16 24]}) %!s(*matchers.EqualMatcher=&{[20 20]})] A normal datagram, as seen by the destination pod, looks like: 09:02:22.178149 IP (tos 0x0, ttl 63, id 61115, offset 0, flags [+], proto UDP (17), length 1444) 10.10.0.230.12345 > testds-smpbw.69: 1416 tftp-#0 09:02:22.178151 IP (tos 0x0, ttl 63, id 61115, offset 1424, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:22.178233 IP (tos 0x0, ttl 63, id 61115, offset 2848, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:22.178265 IP (tos 0x0, ttl 63, id 61115, offset 4272, flags [none], proto UDP (17), length 876) 10.10.0.230 > testds-smpbw: udp When reproducing the flake, we could observe the additional packet: 09:02:26.535728 IP (tos 0x0, ttl 63, id 61232, offset 0, flags [DF], proto UDP (17), length 540) 10.10.0.230.12345 > testds-smpbw.69: [udp sum ok] 512 tftp-#0 09:02:26.536103 IP (tos 0x0, ttl 63, id 61233, offset 0, flags [+], proto UDP (17), length 1444) 10.10.0.230.12345 > testds-smpbw.69: 1416 tftp-#0 09:02:26.536162 IP (tos 0x0, ttl 63, id 61233, offset 1424, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:26.536274 IP (tos 0x0, ttl 63, id 61233, offset 2848, flags [+], proto UDP (17), length 1444) 10.10.0.230 > testds-smpbw: udp 09:02:26.536422 IP (tos 0x0, ttl 63, id 61233, offset 4272, flags [none], proto UDP (17), length 364) 10.10.0.230 > testds-smpbw: udp We note that data is split into two datagrams, the first packet being standalone and the rest being fragmented. The total length received is the same (3*1444 + 540 + 364 == 3*1444 + 876 + sizeof(IP, UDP headers)). The fact that the first packet is 512 byte-long is a hint to the probable cause of the flake. We send the packets with netcat, but source the data from /dev/zero with "dd" as follows: dd if=/dev/zero bs=512 count=10 | nc ... Most of the time, the different blocks written by dd are passed quick enough that netcat processes them in one go. But if the machine is under a heavier load at that moment, it is likely that a small latency is introduced between the blocks, and netcat sends the data in several chunks (datagrams). Let's solve this by copying from /dev/zero in just one block. Fixes: #10929 Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet
added
pending-review
kind/bug/CI
This is a bug in the testing code.
area/CI
Continuous Integration testing issue or flake
release-note/misc
This PR makes changes that have no direct user impact.
labels
Apr 16, 2020
test-me-please |
Hit #11013. |
restart-ginkgo |
And GKE just hit #9902. |
test-gke |
1 similar comment
test-gke |
joestringer
approved these changes
Apr 16, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice insight! SGTM.
borkmann
approved these changes
Apr 16, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/CI
Continuous Integration testing issue or flake
kind/bug/CI
This is a bug in the testing code.
release-note/misc
This PR makes changes that have no direct user impact.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tests for IPv4 fragments support introduced a flake in the CI. The test consists in sending a fragmented datagram and counting (from the conntrack table) that all fragments were processed as expected. But sometimes, an additional packet is observed, leading to a failure and to a message like the following:
A normal datagram, as seen by the destination pod, looks like:
When reproducing the flake, we could observe the additional packet:
We note that data is split into two datagrams, the first packet being standalone and the rest being fragmented. The total length received is the same (
3*1444 + 540 + 364 == 3*1444 + 876 + sizeof(IP, UDP headers)
).The fact that the first packet is 512 byte-long is a hint to the probable cause of the flake. We send the packets with netcat, but source the data from /dev/zero with
dd
as follows:Most of the time, the different blocks written by
dd
are passed quick enough that netcat processes them in one go. But if the machine is under a heavier load at that moment, it is likely that a small latency is introduced between the blocks, and netcat sends the data in several chunks (datagrams).Let's solve this by copying from /dev/zero in just one block.
Fixes: #10929