Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: split off debug options and do not run it in ci #11977

Merged
merged 11 commits into from Jun 12, 2020

Conversation

borkmann
Copy link
Member

@borkmann borkmann commented Jun 9, 2020

Related: #11175

@borkmann borkmann requested review from pchaigno and a team June 9, 2020 09:30
@borkmann borkmann requested review from a team as code owners June 9, 2020 09:30
@borkmann borkmann requested a review from a team June 9, 2020 09:30
@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Jun 9, 2020
@borkmann borkmann added the release-note/misc This PR makes changes that have no direct user impact. label Jun 9, 2020
@borkmann borkmann requested a review from brb June 9, 2020 09:31
@borkmann borkmann requested a review from a team as a code owner June 9, 2020 09:36
@borkmann
Copy link
Member Author

borkmann commented Jun 9, 2020

retest-4.19

@borkmann borkmann requested review from a team as code owners June 9, 2020 10:01
@borkmann borkmann requested a review from a team June 9, 2020 10:01
@coveralls
Copy link

coveralls commented Jun 9, 2020

Coverage Status

Coverage decreased (-0.02%) to 37.024% when pulling 5e94d68 on pr/fix-4.19-verifier into e5bc626 on master.

@borkmann borkmann added pending-review area/CI Continuous Integration testing issue or flake sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Jun 9, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 Jun 9, 2020
@borkmann
Copy link
Member Author

borkmann commented Jun 9, 2020

retest-4.19

@borkmann borkmann force-pushed the pr/fix-4.19-verifier branch 2 times, most recently from 987addf to 2191734 Compare June 9, 2020 11:16
@borkmann
Copy link
Member Author

borkmann commented Jun 9, 2020

retest-4.19

borkmann and others added 2 commits June 12, 2020 13:46
Seen this brokeness from a Cilium upgrade test run:

  kubectl delete daemonset --namespace=kube-system cilium
  kubectl delete deployment --namespace=kube-system cilium-operator
  kubectl delete podsecuritypolicy cilium-psp cilium-operator-psp
  kubectl delete daemonset --namespace=kube-system cilium-node-init
  kubectl get pods --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'
  kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}'

This caused the test suite to wait for an invalid namespace which never
came up. Fix it to move ns into a tmp var.

Fixes: 82df172 ("test: Add single node conformance test")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The host endpoint introduces some interesting interactions between the
API, the host endpoint datapath configuration management and the
recently proposed autodisabling of debug mode for the host endpoint.
These issues are tracked in more detail here:

#12037

For now, skip the host endpoint when iterating through endpoints in
runtime tests to simplify working around this issue; coverage is
otherwise the same as for previous releases. We can figure out the right
long-term handling of the host endpoint in RuntimeMonitor tests via the
above issue and unblock PR #11977 with this commit.

Signed-off-by: Joe Stringer <joe@cilium.io>
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 12, 2020
@borkmann
Copy link
Member Author

test-me-please

@borkmann
Copy link
Member Author

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI. With 8adf6fe in place and taking the cherry-pick out of the series let the test-focus run through just fine on net-next. I tested 4.19.57 with kube-proxy-replacement=strict and v4+v6 locally on my machine and it loads just fine with the rest of the series. So I've taken out mentioned commit and repushed everything, also took out the debug=false from Joe since not necessary. There are no other known issues for now on the rest of the series, so once this hits green, we can move forward and merge them in.

To avoid long waiting times on this one, I did a new PR #12045 where I reenable the kube-proxy-replacement probing on 4.19 given the combination that the commit tried to fix (modulo debug) works fine locally, so if that goes well, we can get the full kube-proxy free coverage on 4.19 and fix up BPF datapath debug as needed later on. Given it is highly likely that no-one is running the BPF debug in production/staging, we can just move on and fix up later on w/o being release-blocker.

@pchaigno
Copy link
Member

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI.

That was my first impression as well (and the reason I tried reordering the tail call numbers). I'm wondering if we have some issue when swapping the programs in the prog array 🤔 If I remember correctly tc doesn't flush the array, it just updates entries...

Could you remove bpf: Enforce host policies before BPF-based SNAT as well if it's not needed? I'd rather not merge a partial fix for that if possible as it make things worse until a get a full fix in.

How is the program size issue on CILIUM_CALL_NODEPORT_REVNAT addressed if 26a39a8 is dropped? I thought that issue was happening even with debug=false (it's the original issue reported in #11175).

@borkmann
Copy link
Member Author

borkmann commented Jun 12, 2020

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI.

That was my first impression as well (and the reason I tried reordering the tail call numbers). I'm wondering if we have some issue when swapping the programs in the prog array 🤔 If I remember correctly tc doesn't flush the array, it just updates entries...

Currently debugging it, I was thinking about it as well. The path in the test is 1.7.4 -> latest (upgrade) -> 1.7.4 (downgrade) -> latest (general cleanup for next test). If we don't use some tail call slots we would drop due to missing tail call. Otoh, the current config should load what it has enabled. Hmm, I'll debug further here.

Could you remove bpf: Enforce host policies before BPF-based SNAT as well if it's not needed? I'd rather not merge a partial fix for that if possible as it make things worse until a get a full fix in.

Hm, true, otoh, test suite currently passes and service tests on 4.19 with kube-proxy free on v4 instead of v4+v6 enabled also pass right now. Maybe lets get this merged as-is right now to unblock and hopefully till tonight I have the other one fixed so we can merge it as well. Either way before final 1.8.0 it's going to be fixed.

@borkmann borkmann merged commit 7cfe6f2 into master Jun 12, 2020
1.8.0 automation moved this from In progress to Merged Jun 12, 2020
@borkmann borkmann deleted the pr/fix-4.19-verifier branch June 12, 2020 13:43
joestringer added a commit that referenced this pull request Jun 12, 2020
[ upstream commit 7cfe6f2 ]

The host endpoint introduces some interesting interactions between the
API, the host endpoint datapath configuration management and the
recently proposed autodisabling of debug mode for the host endpoint.
These issues are tracked in more detail here:

#12037

For now, skip the host endpoint when iterating through endpoints in
runtime tests to simplify working around this issue; coverage is
otherwise the same as for previous releases. We can figure out the right
long-term handling of the host endpoint in RuntimeMonitor tests via the
above issue and unblock PR #11977 with this commit.

Signed-off-by: Joe Stringer <joe@cilium.io>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.0 Jun 12, 2020
@borkmann
Copy link
Member Author

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI.

That was my first impression as well (and the reason I tried reordering the tail call numbers). I'm wondering if we have some issue when swapping the programs in the prog array 🤔 If I remember correctly tc doesn't flush the array, it just updates entries...

Currently debugging it, I was thinking about it as well. The path in the test is 1.7.4 -> latest (upgrade) -> 1.7.4 (downgrade) -> latest (general cleanup for next test). If we don't use some tail call slots we would drop due to missing tail call. Otoh, the current config should load what it has enabled. Hmm, I'll debug further here.

Found the issue, it's drop due to missed tail call which then causes the connection disruption during up/downgrade. Will post fix & full rest of the series for #12045 on Mon.

aanm pushed a commit that referenced this pull request Jun 15, 2020
[ upstream commit 7cfe6f2 ]

The host endpoint introduces some interesting interactions between the
API, the host endpoint datapath configuration management and the
recently proposed autodisabling of debug mode for the host endpoint.
These issues are tracked in more detail here:

#12037

For now, skip the host endpoint when iterating through endpoints in
runtime tests to simplify working around this issue; coverage is
otherwise the same as for previous releases. We can figure out the right
long-term handling of the host endpoint in RuntimeMonitor tests via the
above issue and unblock PR #11977 with this commit.

Signed-off-by: Joe Stringer <joe@cilium.io>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 16, 2020
qmonnet added a commit that referenced this pull request Jun 17, 2020
The test for fragment tracking support got a fix with commit
0e772e7 ("test: Fix fragment tracking test under KUBEPROXY=1"),
where the pattern for searching entries in the Conntrack table accounts
for DNAT not happening if kube-proxy is present.

Following recent changes in the datapath and tests for the Linux 4.19
kernel, DNAT is now used even with kube-proxy, provided bpf_sock is in
use. This led to CI failures, and the test was disabled for 4.19 kernels
with commit 1120aed ("test/K8sServices: disable fragment tracking
test for kernel 4.19").

Now that complexity issues are fixed (see #11977 and #12045), let's
enable the test on 4.19 again. Ignore DNAT only if kube-proxy is present
and bpf_sock (host-reachable services) is not in use. This is also the
case for net-next kernels (this didn't fail in CI before because we do
not test with kube-proxy on net-next).

Note that (as far as I know) both 4.19 and net-next always use bpf_sock
in CI runs, so the check on hostReachableServices is currently
superfluous. Let's have it all the same, in case something changes in
the future, to avoid unexpected breakage.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet added a commit that referenced this pull request Jun 18, 2020
The test for fragment tracking support got a fix with commit
0e772e7 ("test: Fix fragment tracking test under KUBEPROXY=1"),
where the pattern for searching entries in the Conntrack table accounts
for DNAT not happening if kube-proxy is present.

Following recent changes in the datapath and tests for the Linux 4.19
kernel, DNAT is now used even with kube-proxy, provided bpf_sock is in
use. This led to CI failures, and the test was disabled for 4.19 kernels
with commit 1120aed ("test/K8sServices: disable fragment tracking
test for kernel 4.19").

Now that complexity issues are fixed (see #11977 and #12045), let's
enable the test on 4.19 again. Ignore DNAT only if kube-proxy is present
and bpf_sock (host-reachable services) is not in use. This is also the
case for net-next kernels (this didn't fail in CI before because we do
not test with kube-proxy on net-next).

Note that (as far as I know) both 4.19 and net-next always use bpf_sock
in CI runs, so the check on hostReachableServices is currently
superfluous. Let's have it all the same, in case something changes in
the future, to avoid unexpected breakage.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
aanm pushed a commit that referenced this pull request Jun 21, 2020
The test for fragment tracking support got a fix with commit
0e772e7 ("test: Fix fragment tracking test under KUBEPROXY=1"),
where the pattern for searching entries in the Conntrack table accounts
for DNAT not happening if kube-proxy is present.

Following recent changes in the datapath and tests for the Linux 4.19
kernel, DNAT is now used even with kube-proxy, provided bpf_sock is in
use. This led to CI failures, and the test was disabled for 4.19 kernels
with commit 1120aed ("test/K8sServices: disable fragment tracking
test for kernel 4.19").

Now that complexity issues are fixed (see #11977 and #12045), let's
enable the test on 4.19 again. Ignore DNAT only if kube-proxy is present
and bpf_sock (host-reachable services) is not in use. This is also the
case for net-next kernels (this didn't fail in CI before because we do
not test with kube-proxy on net-next).

Note that (as far as I know) both 4.19 and net-next always use bpf_sock
in CI runs, so the check on hostReachableServices is currently
superfluous. Let's have it all the same, in case something changes in
the future, to avoid unexpected breakage.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
aanm pushed a commit that referenced this pull request Jun 21, 2020
[ upstream commit 5b9503f ]

The test for fragment tracking support got a fix with commit
0e772e7 ("test: Fix fragment tracking test under KUBEPROXY=1"),
where the pattern for searching entries in the Conntrack table accounts
for DNAT not happening if kube-proxy is present.

Following recent changes in the datapath and tests for the Linux 4.19
kernel, DNAT is now used even with kube-proxy, provided bpf_sock is in
use. This led to CI failures, and the test was disabled for 4.19 kernels
with commit 1120aed ("test/K8sServices: disable fragment tracking
test for kernel 4.19").

Now that complexity issues are fixed (see #11977 and #12045), let's
enable the test on 4.19 again. Ignore DNAT only if kube-proxy is present
and bpf_sock (host-reachable services) is not in use. This is also the
case for net-next kernels (this didn't fail in CI before because we do
not test with kube-proxy on net-next).

Note that (as far as I know) both 4.19 and net-next always use bpf_sock
in CI runs, so the check on hostReachableServices is currently
superfluous. Let's have it all the same, in case something changes in
the future, to avoid unexpected breakage.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: André Martins <andre@cilium.io>
aanm pushed a commit that referenced this pull request Jun 22, 2020
[ upstream commit 5b9503f ]

The test for fragment tracking support got a fix with commit
0e772e7 ("test: Fix fragment tracking test under KUBEPROXY=1"),
where the pattern for searching entries in the Conntrack table accounts
for DNAT not happening if kube-proxy is present.

Following recent changes in the datapath and tests for the Linux 4.19
kernel, DNAT is now used even with kube-proxy, provided bpf_sock is in
use. This led to CI failures, and the test was disabled for 4.19 kernels
with commit 1120aed ("test/K8sServices: disable fragment tracking
test for kernel 4.19").

Now that complexity issues are fixed (see #11977 and #12045), let's
enable the test on 4.19 again. Ignore DNAT only if kube-proxy is present
and bpf_sock (host-reachable services) is not in use. This is also the
case for net-next kernels (this didn't fail in CI before because we do
not test with kube-proxy on net-next).

Note that (as far as I know) both 4.19 and net-next always use bpf_sock
in CI runs, so the check on hostReachableServices is currently
superfluous. Let's have it all the same, in case something changes in
the future, to avoid unexpected breakage.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: André Martins <andre@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.8.0
  
Merged
1.8.0
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

5 participants