Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong csum for non-first ipv4 fragments #13476

Merged
merged 1 commit into from Oct 14, 2020
Merged

Conversation

liuyuan10
Copy link
Contributor

@liuyuan10 liuyuan10 commented Oct 8, 2020

Before this commit, natting are blindly changing l4 ports to non-first ipv4
fragments, causing wrong csum.

Also fixes the same issue during the rev natting for dsr.

Signed-off-by: Yuan Liu liuyuan@google.com

Fix natting of non-first ipv4 fragments.

@liuyuan10 liuyuan10 requested a review from a team October 8, 2020 21:41
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 8, 2020
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Upon first review, I was confused why you would avoid NATing for second and subsequent fragments since the L3 addresses still need to be translated. However upon closer read, this PR is avoiding port translation (and corresponding checksum calculation for L4) on second/subsequent fragments. Please update the commit message and PR description to reflect this, it will help avoid confusion in future.

Do you have a reproducer to regression-test this issue? It seems like something we would want tests in the codebase for.

Misc additional cosmetic feedback below.

bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/lib/lb.h Outdated Show resolved Hide resolved
@joestringer joestringer added needs-backport/1.8 release-note/bug This PR fixes an issue in a previous release of Cilium. labels Oct 8, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 8, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 8, 2020
@liuyuan10
Copy link
Contributor Author

Thanks for the contribution!

Upon first review, I was confused why you would avoid NATing for second and subsequent fragments since the L3 addresses still need to be translated. However upon closer read, this PR is avoiding port translation (and corresponding checksum calculation for L4) on second/subsequent fragments. Please update the commit message and PR description to reflect this, it will help avoid confusion in future.

Done

Do you have a reproducer to regression-test this issue? It seems like something we would want tests in the codebase for.

I was doing manual testing with a cluster with a udp_echo application. I don't know about regression-tests. do you have more info?

@joestringer
Copy link
Member

I was doing manual testing with a cluster with a udp_echo application. I don't know about regression-tests. do you have more info?

@liuyuan10 We have fragmentation tests under test/k8sT/Services.go, so I was wondering if we could extend them to include a test that would validate the fix (and prevent us merging code that breaks it in future). The end-to-end testing framework we use is described in more detail here: https://docs.cilium.io/en/v1.8/contributing/testing/e2e/ .

@joestringer
Copy link
Member

test-me-please

@liuyuan10
Copy link
Contributor Author

I was doing manual testing with a cluster with a udp_echo application. I don't know about regression-tests. do you have more info?

@liuyuan10 We have fragmentation tests under test/k8sT/Services.go, so I was wondering if we could extend them to include a test that would validate the fix (and prevent us merging code that breaks it in future). The end-to-end testing framework we use is described in more detail here: https://docs.cilium.io/en/v1.8/contributing/testing/e2e/ .

I see. let me look into that. can we merge this PR first?

@joestringer
Copy link
Member

let me look into that. can we merge this PR first?

Yep, once it passes the full CI.

Feel free to ignore the Cilium-Ginkgo-GKE job failure for now, there is an infrastructure issue. When we get an update on that being fixed, we can re-trigger that job.

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! I'm wondering why it was not caught by the testIPv4FragmentSupport in test/k8sT/Services.go? /cc @qmonnet

@pchaigno pchaigno requested a review from qmonnet October 12, 2020 06:49
@pchaigno
Copy link
Member

The 4.9 build is failing because of a new complexity issue:

2020-10-09T18:52:24.130199902Z level=error msg="Command execution failed" cmd="[tc filter replace dev lxc_health egress prio 1 handle 1 bpf da obj 3350_next/bpf_lxc.o sec to-container]" error="exit status 1" subsys=datapath-loader
2020-10-09T18:52:24.130222938Z level=warning subsys=datapath-loader
2020-10-09T18:52:24.130225739Z level=warning msg="Prog section 'to-container' rejected: Argument list too long (7)!" subsys=datapath-loader
2020-10-09T18:52:24.130274563Z level=warning msg=" - Type:         3" subsys=datapath-loader
2020-10-09T18:52:24.130280055Z level=warning msg=" - Attach Type:  0" subsys=datapath-loader
2020-10-09T18:52:24.130282364Z level=warning msg=" - Instructions: 1716 (0 over limit)" subsys=datapath-loader
2020-10-09T18:52:24.130284500Z level=warning msg=" - License:      GPL" subsys=datapath-loader
2020-10-09T18:52:24.130286629Z level=warning subsys=datapath-loader
2020-10-09T18:52:24.130288654Z level=warning msg="Verifier analysis:" subsys=datapath-loader
2020-10-09T18:52:24.130290702Z level=warning subsys=datapath-loader
2020-10-09T18:52:24.130292748Z level=warning msg="Skipped 4880636 bytes, use 'verb' option for the full verbose log." subsys=datapath-loader
2020-10-09T18:52:24.130294873Z level=warning msg="[...]" subsys=datapath-loader
2020-10-09T18:52:24.130296926Z level=warning msg="6b) *(u16 *)(r10 -52) = r3" subsys=datapath-loader
2020-10-09T18:52:24.130299379Z level=warning msg="751: (67) r3 <<= 32" subsys=datapath-loader
[...]
2020-10-09T18:52:24.130522901Z level=warning msg="1681: (b7) r1 = 256" subsys=datapath-loader
2020-10-09T18:52:24.130525017Z level=warning msg="1682: (7b) *(u64 *)(r10 -120) = r1" subsys=datapath-loader
2020-10-09T18:52:24.130527142Z level=warning msg="BPF program is too large. Proccessed 98305 insn" subsys=datapath-loader

@qmonnet
Copy link
Member

qmonnet commented Oct 12, 2020

Great find! I'm wondering why it was not caught by the testIPv4FragmentSupport in test/k8sT/Services.go? /cc @qmonnet

In the tests, we check that fragments are effectively processed and sent to the backend by looking for updates on the counters in the CT map:

cmdOut := "cilium bpf ct list global | awk ..."

We don't check for valid checksums on the receiving end at the moment.

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the revalidate_data() are necessary, but the rest of the code looks good, thank you!

bpf/lib/lb.h Outdated Show resolved Hide resolved
bpf/lib/lb.h Outdated Show resolved Hide resolved
bpf/lib/nat.h Outdated Show resolved Hide resolved
Before this commit, natting are blindly changing l4 ports to non-first ipv4
fragments, causing wrong csum.

Also fixes the same issue during the rev natting for dsr.

Signed-off-by: Yuan Liu <liuyuan@google.com>
@pchaigno
Copy link
Member

pchaigno commented Oct 14, 2020

Let's see if the complexity is still a concern first:
retest-4.9
Looks fixed 🎉

@pchaigno
Copy link
Member

test-me-please

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

@tklauser tklauser merged commit 3453877 into cilium:master Oct 14, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 14, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 14, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.8.5
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

8 participants