v1.9 backports 2021-11-09 #17835

pchaigno · 2021-11-09T18:02:47Z

test: Collect object file artifacts for K8sVerifier #14129 -- test: Collect object file artifacts for K8sVerifier (@pchaigno)
test/K8sVerifier: Cover several datapath configurations #17470 -- test/K8sVerifier: Cover several datapath configurations (@pchaigno)
bug/pkg/health: Fix Nil Address Issue in Node Update Mechanism #17667 -- bug/pkg/health: Fix Nil Address Issue in Node Update Mechanism (@nathanjsweet)
docs: Use git+https in requirements.txt #17756 -- docs: Use git+https in requirements.txt (@michi-covalent)
test: Do not require netpols in 'waitNextPolicyRevisions()' #17769 -- test: Do not require netpols in 'waitNextPolicyRevisions()' (@jrajahalme)
.github: Increase reporting threshold for new flakes #17812 -- .github: Increase reporting threshold for new flakes (@pchaigno)
.github: Rename project/ci-force to ci/flake #17344 -- .github: Rename project/ci-force to ci/flake (@pchaigno)
Reduce bugtool memory usage #17546 -- Reduce bugtool memory usage (@tklauser)
bpf: Additional tail calls for IPvX-only setups #17573 -- bpf: Additional tail calls for IPvX-only setups (@pchaigno)
nodediscovery: Fix local host identity propagation #17836 -- nodediscovery: Fix local host identity propagation (@joestringer)

Once this PR is merged, you can update the PR labels via:

$ for pr in 14129 17470 17667 17756 17769 17812 17344 17546 17573 17836; do contrib/backporting/set-labels.py $pr done 1.9; done

nathanjsweet

My changes look good.

jrajahalme

Backport of my PR look right :-)

tklauser

LGTM for my changes

michi-covalent

✔️ 8c5a7bc

[ upstream commit 40eba60 ] Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 87195bf ] Netlify preview is currently failing with the following error: ``` Collecting git+git://github.com/cilium/sphinx_rtd_theme.git@v0.7 (from -r requirements.txt (line 23)) Cloning git://github.com/cilium/sphinx_rtd_theme.git (to revision v0.7) ... Running command git clone -q git://github.com/cilium/sphinx_rtd_theme.git ... fatal: remote error: The unauthenticated git protocol on port 9418 is no longer supported. Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information. ``` Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit c8d2fc7 ] 'waitNextPolicyRevisions()' currently returns 'true' when no k8s network policies are applied, bypassing the Cilium agent policy revision wait in this case. As our tests typically (never?) have no NPs applied, we have not actually waited for CNP or CCNP changes to take place in all Cilium PODs before proceeding with the tests. This may have caused CI flakes. Fix this by removing the code that checks for the presence of NPs. Reported-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 693163e ] MLH assumes a flake is a new one if the similarity to existing flakes is below 75%. This threshold is a bit low for flakes affecting the same test but failing with a different error message. We can adjust to 85% and see. Related: cilium#17270. Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 988e26e ] Following discussion in the community meeting, we decided to rename the project/ci-force label to ci/flake. We need to rename it in MLH and the issue template. Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit e562904 ] s/ethool/ethtool/ Also fix the build for other non-Linux platforms besides macOS. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit f05a344 ] Avoid reallocations and GC pressure in the loop. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 6a5490f ] There is no need to recompile the constant regexp for each iteration of the loop. Also, hashEncryptionKeys is called in a loop, so move the regexp compilation to a global var to avoid recompiling it. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 64940c4 ] Instead of open-coding a worker pool use the existing github.com/cilium/workerpool package. Also limit the number of workers to the number of CPUs by default, which should help to reduce excessive memory usage by too many parallel goroutines at the price of potentially slightly slower bugtool report creation. If users want more parallel tasks, the number can be specified using the newly introduces `--parallel-workers` option. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 8bcc4e5 ] Converting []byte to string can cause allocation due to the fact that strings are immutable in Go. By letting execCommand return []byte instead of string we can avoid some of these allocations, i.e also the allcoation caused by further processing of that return value. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 8d4ec81 ] When the command output doiesn't need to be postprocessed, the output can be written directly to the file without buffering. This should significantly reduce memory usage. On my test system this reduces total RSS by about one third: Before: $ /usr/bin/time -f 'RSS=%MKB' cilium-bugtool RSS=63168KB After: $ /usr/bin/time -f 'RSS=%MKB' cilium-bugtool RSS=42752KB Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 1b6a98c ] When both IPv4 and IPv6 are enabled, we split the to/from-container BPF programs into two code paths, one for each IP family, to reduce program size and complexity. Because our existing K8sVerifier test only covers the IPv4+IPv6 configuration, new complexity and program size issues sneaked in for the IPvX-only setups. These new issues occur when to/from-container contain both the initial IP parsing code and the IPv4 (resp. IPv6---we have one issue per family) code path. Splitting these programs such that they only contain the initial IP parsing code is enough to fix these issues. Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 7bf60a5 ] The local NodeDiscovery implementation was previously informing the rest of the Cilium agent that the local node's identity is "Remote Node" because of the statically initialized "identity.GetLocalNodeID" value. However, that value should only ever be used for external workloads cases in order to prepare the source identity used for transmitting traffic to other Cilium nodes. It should not be used for locally determining the identity of traffic coming from the host itself. Fix this by hardcoding the identity to "Host" identity. Fixes: c864fd3 ("daemon: Split IPAM bootstrap, join cluster in between") Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno · 2021-11-16T18:10:56Z

Fixed an issue in backport commit 54664f2 which was causing complexity issues on 5.4. The datapath configurations were missing NEEDS_RELAX_VERIFIER. Added a note to that regard in the commit's backporting notes.

test-backport-1.9

maintainer-s-little-helper · 2021-11-16T21:36:10Z

Job 'Cilium-PR-K8s-1.17-kernel-5.4' hit: #17617 (93.95% similarity)

maintainer-s-little-helper · 2021-11-16T22:42:21Z

Job 'Cilium-PR-K8s-1.17-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17617 (94.18)

ti-mo · 2021-11-17T10:00:12Z

GKE tests failing with:
21:30:52 Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: kind: ConfigMap, namespace: kube-system, name: cilium-config

Also not sure if the 4.9 and 5.4 test failures need to be considered flakes, since they both fail for the same reason: #17617 (comment)
networkPlugin cni failed to set up pod "coredns-7dcc479575-mftvg_kube-system" network: unable to allocate IP via local cilium agent: [POST /ipam][502] postIpamFailure range is full

gandro · 2021-11-23T09:05:12Z

GKE has been broken on v1.9 for a while: https://jenkins.cilium.io/view/Cilium-v1.9/job/cilium-v1.9-k8s-gke/

But the failure in the upgrade test worries me a bit, as it indicates some incompatibilities

Edit: On the regular Jenkins builds, these tests don't seem to fail. I'll restart the failed test to see if it's a flake.

gandro · 2021-11-23T09:43:31Z

/test-1.17-4.9

Edit: Unclear failure https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/console

gandro · 2021-11-23T09:43:37Z

/test-1.17-5.4

Edit: Failed during provisioning, looks like an NFS or I/O problem? https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-5.4/129/console

maintainer-s-little-helper · 2021-11-23T12:24:16Z

Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

[empty]

Failure Output

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.17-kernel-4.9 so I can create a new GitHub issue to track it.

gandro · 2021-11-23T12:33:17Z

/test-1.17-5.4

gandro · 2021-11-23T17:40:55Z

/test-1.17-4.9

Edit: It keeps failing without any error https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/

pchaigno · 2021-11-23T18:00:24Z

Edit: It keeps failing without any error https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/

It seems the VM startup hit the 15m timeout. The tests themselves then ended after exactly 1h30, which looks suspiciously like a timeout but timeout value should be 3h. Maybe worth comparing all the timeouts in the Jenkinsfile with v1.10 and master to see if we're not just missing a fix on timeouts?

gandro · 2021-11-24T09:46:21Z

It seems the VM startup hit the 15m timeout. The tests themselves then ended after exactly 1h30, which looks suspiciously like a timeout but timeout value should be 3h. Maybe worth comparing all the timeouts in the Jenkinsfile with v1.10 and master to see if we're not just missing a fix on timeouts?

There is a 98 minute timeout here that I don't see on master:

cilium/jenkinsfiles/ginkgo-kubernetes-all.Jenkinsfile

Line 43 in bb44128

GINKGO_TIMEOUT="98m"

The Jenkinsfile setup on v1.9 is completely different than on master. I'm not familiar with the setup, would need to spend some time understanding it first.

I'm going ahead and merging this PR. As I said, GKE has been broken on v1.9 for a while and the flake on the upgrade test (ipam: range is full) needs to be investigated, but it seems unrelated, as I don't see any changes at all to the endpoint control plane in this PR.

For future reference: #17617

pchaigno requested a review from a team as a code owner November 9, 2021 18:02

pchaigno added backport/1.9 kind/backports This PR provides functionality previously merged into master. labels Nov 9, 2021

pchaigno requested review from jrajahalme, michi-covalent and nathanjsweet November 9, 2021 18:07

maintainer-s-little-helper bot assigned jrajahalme, michi-covalent and nathanjsweet Nov 9, 2021

nathanjsweet approved these changes Nov 9, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned nathanjsweet Nov 9, 2021

pchaigno force-pushed the pr/v1.9-backport-2021-11-09 branch from 406409a to 3c0ab67 Compare November 12, 2021 11:57

pchaigno requested review from joestringer and tklauser November 12, 2021 13:12

maintainer-s-little-helper bot assigned joestringer and tklauser and unassigned joestringer Nov 12, 2021

pchaigno force-pushed the pr/v1.9-backport-2021-11-09 branch 3 times, most recently from cc4dab0 to 9bc0794 Compare November 12, 2021 16:32

jrajahalme approved these changes Nov 12, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned jrajahalme Nov 12, 2021

maintainer-s-little-helper bot mentioned this pull request Nov 12, 2021

[v1.9] CI: K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master #17617

Closed

tklauser approved these changes Nov 15, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned tklauser Nov 15, 2021

joestringer approved these changes Nov 15, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned joestringer Nov 15, 2021

michi-covalent approved these changes Nov 15, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned michi-covalent Nov 15, 2021

test: Define workdir for test-verifier pod

9e95cdf

[ upstream commit 40eba60 ] Signed-off-by: Paul Chaignon <paul@cilium.io>

michi-covalent and others added 12 commits November 16, 2021 19:06

.github: Rename project/ci-force to ci/flake

b6259f6

[ upstream commit 988e26e ] Following discussion in the community meeting, we decided to rename the project/ci-force label to ci/flake. We need to rename it in MLH and the issue template. Signed-off-by: Paul Chaignon <paul@cilium.io>

bugtool: fix typo in function name and file names

bb7fbba

[ upstream commit e562904 ] s/ethool/ethtool/ Also fix the build for other non-Linux platforms besides macOS. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

bugtool: preallocate ethtool command slices

ce67fe8

[ upstream commit f05a344 ] Avoid reallocations and GC pressure in the loop. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno force-pushed the pr/v1.9-backport-2021-11-09 branch from 9bc0794 to 1467a3d Compare November 16, 2021 18:09

gandro approved these changes Nov 24, 2021

View reviewed changes

gandro merged commit a4d175d into cilium:v1.9 Nov 24, 2021

pchaigno deleted the pr/v1.9-backport-2021-11-09 branch November 24, 2021 09:55

kkourt mentioned this pull request Nov 24, 2021

v1.9: lxc complexity issue fix #18003

Closed

joestringer mentioned this pull request Jan 18, 2022

Prepare for release v1.9.12 #18533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.9 backports 2021-11-09 #17835

v1.9 backports 2021-11-09 #17835

pchaigno commented Nov 9, 2021 •

edited

nathanjsweet left a comment

jrajahalme left a comment

tklauser left a comment

michi-covalent left a comment

pchaigno commented Nov 16, 2021 •

edited

maintainer-s-little-helper bot commented Nov 16, 2021

maintainer-s-little-helper bot commented Nov 16, 2021

ti-mo commented Nov 17, 2021

gandro commented Nov 23, 2021 •

edited

gandro commented Nov 23, 2021 •

edited

gandro commented Nov 23, 2021 •

edited

maintainer-s-little-helper bot commented Nov 23, 2021

Test Name

Failure Output

gandro commented Nov 23, 2021

gandro commented Nov 23, 2021 •

edited

pchaigno commented Nov 23, 2021

gandro commented Nov 24, 2021

v1.9 backports 2021-11-09 #17835

v1.9 backports 2021-11-09 #17835

Conversation

pchaigno commented Nov 9, 2021 • edited

nathanjsweet left a comment

Choose a reason for hiding this comment

jrajahalme left a comment

Choose a reason for hiding this comment

tklauser left a comment

Choose a reason for hiding this comment

michi-covalent left a comment

Choose a reason for hiding this comment

pchaigno commented Nov 16, 2021 • edited

maintainer-s-little-helper bot commented Nov 16, 2021

maintainer-s-little-helper bot commented Nov 16, 2021

ti-mo commented Nov 17, 2021

gandro commented Nov 23, 2021 • edited

gandro commented Nov 23, 2021 • edited

gandro commented Nov 23, 2021 • edited

maintainer-s-little-helper bot commented Nov 23, 2021

Test Name

Failure Output

gandro commented Nov 23, 2021

gandro commented Nov 23, 2021 • edited

pchaigno commented Nov 23, 2021

gandro commented Nov 24, 2021

pchaigno commented Nov 9, 2021 •

edited

pchaigno commented Nov 16, 2021 •

edited

gandro commented Nov 23, 2021 •

edited

gandro commented Nov 23, 2021 •

edited

gandro commented Nov 23, 2021 •

edited

gandro commented Nov 23, 2021 •

edited