-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.9 backports 2021-11-09 #17835
v1.9 backports 2021-11-09 #17835
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My changes look good.
406409a
to
3c0ab67
Compare
cc4dab0
to
9bc0794
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backport of my PR look right :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for my changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️ 8c5a7bc
[ upstream commit 40eba60 ] Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 87195bf ] Netlify preview is currently failing with the following error: ``` Collecting git+git://github.com/cilium/sphinx_rtd_theme.git@v0.7 (from -r requirements.txt (line 23)) Cloning git://github.com/cilium/sphinx_rtd_theme.git (to revision v0.7) ... Running command git clone -q git://github.com/cilium/sphinx_rtd_theme.git ... fatal: remote error: The unauthenticated git protocol on port 9418 is no longer supported. Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information. ``` Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c8d2fc7 ] 'waitNextPolicyRevisions()' currently returns 'true' when no k8s network policies are applied, bypassing the Cilium agent policy revision wait in this case. As our tests typically (never?) have no NPs applied, we have not actually waited for CNP or CCNP changes to take place in all Cilium PODs before proceeding with the tests. This may have caused CI flakes. Fix this by removing the code that checks for the presence of NPs. Reported-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 693163e ] MLH assumes a flake is a new one if the similarity to existing flakes is below 75%. This threshold is a bit low for flakes affecting the same test but failing with a different error message. We can adjust to 85% and see. Related: cilium#17270. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 988e26e ] Following discussion in the community meeting, we decided to rename the project/ci-force label to ci/flake. We need to rename it in MLH and the issue template. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e562904 ] s/ethool/ethtool/ Also fix the build for other non-Linux platforms besides macOS. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit f05a344 ] Avoid reallocations and GC pressure in the loop. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 6a5490f ] There is no need to recompile the constant regexp for each iteration of the loop. Also, hashEncryptionKeys is called in a loop, so move the regexp compilation to a global var to avoid recompiling it. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 64940c4 ] Instead of open-coding a worker pool use the existing github.com/cilium/workerpool package. Also limit the number of workers to the number of CPUs by default, which should help to reduce excessive memory usage by too many parallel goroutines at the price of potentially slightly slower bugtool report creation. If users want more parallel tasks, the number can be specified using the newly introduces `--parallel-workers` option. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 8bcc4e5 ] Converting []byte to string can cause allocation due to the fact that strings are immutable in Go. By letting execCommand return []byte instead of string we can avoid some of these allocations, i.e also the allcoation caused by further processing of that return value. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 8d4ec81 ] When the command output doiesn't need to be postprocessed, the output can be written directly to the file without buffering. This should significantly reduce memory usage. On my test system this reduces total RSS by about one third: Before: $ /usr/bin/time -f 'RSS=%MKB' cilium-bugtool RSS=63168KB After: $ /usr/bin/time -f 'RSS=%MKB' cilium-bugtool RSS=42752KB Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1b6a98c ] When both IPv4 and IPv6 are enabled, we split the to/from-container BPF programs into two code paths, one for each IP family, to reduce program size and complexity. Because our existing K8sVerifier test only covers the IPv4+IPv6 configuration, new complexity and program size issues sneaked in for the IPvX-only setups. These new issues occur when to/from-container contain both the initial IP parsing code and the IPv4 (resp. IPv6---we have one issue per family) code path. Splitting these programs such that they only contain the initial IP parsing code is enough to fix these issues. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 7bf60a5 ] The local NodeDiscovery implementation was previously informing the rest of the Cilium agent that the local node's identity is "Remote Node" because of the statically initialized "identity.GetLocalNodeID" value. However, that value should only ever be used for external workloads cases in order to prepare the source identity used for transmitting traffic to other Cilium nodes. It should not be used for locally determining the identity of traffic coming from the host itself. Fix this by hardcoding the identity to "Host" identity. Fixes: c864fd3 ("daemon: Split IPAM bootstrap, join cluster in between") Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
9bc0794
to
1467a3d
Compare
Fixed an issue in backport commit 54664f2 which was causing complexity issues on 5.4. The datapath configurations were missing test-backport-1.9 |
Job 'Cilium-PR-K8s-1.17-kernel-5.4' hit: #17617 (93.95% similarity) |
Job 'Cilium-PR-K8s-1.17-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17617 (94.18) |
GKE tests failing with: Also not sure if the 4.9 and 5.4 test failures need to be considered flakes, since they both fail for the same reason: #17617 (comment) |
GKE has been broken on v1.9 for a while: https://jenkins.cilium.io/view/Cilium-v1.9/job/cilium-v1.9-k8s-gke/ But the failure in the upgrade test worries me a bit, as it indicates some incompatibilities
Edit: On the regular Jenkins builds, these tests don't seem to fail. I'll restart the failed test to see if it's a flake. |
/test-1.17-4.9 Edit: Unclear failure https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/console |
/test-1.17-5.4 Edit: Failed during provisioning, looks like an NFS or I/O problem? https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-5.4/129/console |
Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
/test-1.17-5.4 |
/test-1.17-4.9 Edit: It keeps failing without any error https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/ |
It seems the VM startup hit the 15m timeout. The tests themselves then ended after exactly 1h30, which looks suspiciously like a timeout but timeout value should be 3h. Maybe worth comparing all the timeouts in the Jenkinsfile with v1.10 and master to see if we're not just missing a fix on timeouts? |
There is a 98 minute timeout here that I don't see on master:
The Jenkinsfile setup on v1.9 is completely different than on master. I'm not familiar with the setup, would need to spend some time understanding it first. I'm going ahead and merging this PR. As I said, GKE has been broken on v1.9 for a while and the flake on the upgrade test ( For future reference: #17617 |
project/ci-force
toci/flake
#17344 -- .github: Renameproject/ci-force
toci/flake
(@pchaigno)Once this PR is merged, you can update the PR labels via: