Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.9 backports 2021-11-09 #17835

Merged
merged 17 commits into from
Nov 24, 2021
Merged

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Nov 9, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 14129 17470 17667 17756 17769 17812 17344 17546 17573 17836; do contrib/backporting/set-labels.py $pr done 1.9; done

@pchaigno pchaigno requested a review from a team as a code owner November 9, 2021 18:02
@pchaigno pchaigno added backport/1.9 kind/backports This PR provides functionality previously merged into master. labels Nov 9, 2021
Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good.

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backport of my PR look right :-)

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my changes

Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ 8c5a7bc

[ upstream commit 40eba60 ]

Signed-off-by: Paul Chaignon <paul@cilium.io>
michi-covalent and others added 12 commits November 16, 2021 19:06
[ upstream commit 87195bf ]

Netlify preview is currently failing with the following error:

```
Collecting git+git://github.com/cilium/sphinx_rtd_theme.git@v0.7 (from -r requirements.txt (line 23))
  Cloning git://github.com/cilium/sphinx_rtd_theme.git (to revision v0.7) ...
  Running command git clone -q git://github.com/cilium/sphinx_rtd_theme.git ...
  fatal: remote error:
    The unauthenticated git protocol on port 9418 is no longer supported.
  Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
```

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c8d2fc7 ]

'waitNextPolicyRevisions()' currently returns 'true' when no k8s
network policies are applied, bypassing the Cilium agent policy
revision wait in this case. As our tests typically (never?) have no
NPs applied, we have not actually waited for CNP or CCNP changes to
take place in all Cilium PODs before proceeding with the tests. This
may have caused CI flakes.

Fix this by removing the code that checks for the presence of NPs.

Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 693163e ]

MLH assumes a flake is a new one if the similarity to existing flakes is
below 75%. This threshold is a bit low for flakes affecting the same
test but failing with a different error message. We can adjust to 85%
and see.

Related: cilium#17270.
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 988e26e ]

Following discussion in the community meeting, we decided to rename the
project/ci-force label to ci/flake. We need to rename it in MLH and the
issue template.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e562904 ]

s/ethool/ethtool/

Also fix the build for other non-Linux platforms besides macOS.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit f05a344 ]

Avoid reallocations and GC pressure in the loop.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 6a5490f ]

There is no need to recompile the constant regexp for each iteration of
the loop. Also, hashEncryptionKeys is called in a loop, so move the
regexp compilation to a global var to avoid recompiling it.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 64940c4 ]

Instead of open-coding a worker pool use the existing
github.com/cilium/workerpool package. Also limit the number of workers
to the number of CPUs by default, which should help to reduce excessive
memory usage by too many parallel goroutines at the price of potentially
slightly slower bugtool report creation.

If users want more parallel tasks, the number can be specified using the
newly introduces `--parallel-workers` option.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 8bcc4e5 ]

Converting []byte to string can cause allocation due to the fact that
strings are immutable in Go. By letting execCommand return []byte
instead of string we can avoid some of these allocations, i.e also the
allcoation caused by further processing of that return value.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 8d4ec81 ]

When the command output doiesn't need to be postprocessed, the output
can be written directly to the file without buffering. This should
significantly reduce memory usage.

On my test system this reduces total RSS by about one third:

Before:

    $ /usr/bin/time -f 'RSS=%MKB' cilium-bugtool
    RSS=63168KB

After:

    $ /usr/bin/time -f 'RSS=%MKB' cilium-bugtool
    RSS=42752KB

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1b6a98c ]

When both IPv4 and IPv6 are enabled, we split the to/from-container BPF
programs into two code paths, one for each IP family, to reduce program
size and complexity. Because our existing K8sVerifier test only covers
the IPv4+IPv6 configuration, new complexity and program size issues
sneaked in for the IPvX-only setups.

These new issues occur when to/from-container contain both the initial IP
parsing code and the IPv4 (resp. IPv6---we have one issue per family)
code path. Splitting these programs such that they only contain the
initial IP parsing code is enough to fix these issues.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 7bf60a5 ]

The local NodeDiscovery implementation was previously informing the rest
of the Cilium agent that the local node's identity is "Remote Node"
because of the statically initialized "identity.GetLocalNodeID" value.
However, that value should only ever be used for external workloads
cases in order to prepare the source identity used for transmitting
traffic to other Cilium nodes. It should not be used for locally
determining the identity of traffic coming from the host itself.

Fix this by hardcoding the identity to "Host" identity.

Fixes: c864fd3 ("daemon: Split IPAM bootstrap, join cluster in between")

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Nov 16, 2021

Fixed an issue in backport commit 54664f2 which was causing complexity issues on 5.4. The datapath configurations were missing NEEDS_RELAX_VERIFIER. Added a note to that regard in the commit's backporting notes.

test-backport-1.9

@maintainer-s-little-helper
Copy link

Job 'Cilium-PR-K8s-1.17-kernel-5.4' hit: #17617 (93.95% similarity)

@maintainer-s-little-helper
Copy link

Job 'Cilium-PR-K8s-1.17-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17617 (94.18)

@ti-mo
Copy link
Contributor

ti-mo commented Nov 17, 2021

GKE tests failing with:
21:30:52 Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: kind: ConfigMap, namespace: kube-system, name: cilium-config

Also not sure if the 4.9 and 5.4 test failures need to be considered flakes, since they both fail for the same reason: #17617 (comment)
networkPlugin cni failed to set up pod "coredns-7dcc479575-mftvg_kube-system" network: unable to allocate IP via local cilium agent: [POST /ipam][502] postIpamFailure range is full

@gandro
Copy link
Member

gandro commented Nov 23, 2021

GKE has been broken on v1.9 for a while: https://jenkins.cilium.io/view/Cilium-v1.9/job/cilium-v1.9-k8s-gke/

But the failure in the upgrade test worries me a bit, as it indicates some incompatibilities

Edit: On the regular Jenkins builds, these tests don't seem to fail. I'll restart the failed test to see if it's a flake.

@gandro
Copy link
Member

gandro commented Nov 23, 2021

/test-1.17-4.9

Edit: Unclear failure https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/console

@gandro
Copy link
Member

gandro commented Nov 23, 2021

/test-1.17-5.4

Edit: Failed during provisioning, looks like an NFS or I/O problem? https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-5.4/129/console

@maintainer-s-little-helper
Copy link

Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

[empty]

Failure Output


If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.17-kernel-4.9 so I can create a new GitHub issue to track it.

@gandro
Copy link
Member

gandro commented Nov 23, 2021

/test-1.17-5.4

@gandro
Copy link
Member

gandro commented Nov 23, 2021

/test-1.17-4.9

Edit: It keeps failing without any error https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/

@pchaigno
Copy link
Member Author

Edit: It keeps failing without any error https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/400/

It seems the VM startup hit the 15m timeout. The tests themselves then ended after exactly 1h30, which looks suspiciously like a timeout but timeout value should be 3h. Maybe worth comparing all the timeouts in the Jenkinsfile with v1.10 and master to see if we're not just missing a fix on timeouts?

@gandro
Copy link
Member

gandro commented Nov 24, 2021

It seems the VM startup hit the 15m timeout. The tests themselves then ended after exactly 1h30, which looks suspiciously like a timeout but timeout value should be 3h. Maybe worth comparing all the timeouts in the Jenkinsfile with v1.10 and master to see if we're not just missing a fix on timeouts?

There is a 98 minute timeout here that I don't see on master:

The Jenkinsfile setup on v1.9 is completely different than on master. I'm not familiar with the setup, would need to spend some time understanding it first.


I'm going ahead and merging this PR. As I said, GKE has been broken on v1.9 for a while and the flake on the upgrade test (ipam: range is full) needs to be investigated, but it seems unrelated, as I don't see any changes at all to the endpoint control plane in this PR.

For future reference: #17617

@gandro gandro merged commit a4d175d into cilium:v1.9 Nov 24, 2021
@pchaigno pchaigno deleted the pr/v1.9-backport-2021-11-09 branch November 24, 2021 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants