Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.11 backports 2021-12-10 #18232

Merged
merged 21 commits into from
Dec 16, 2021
Merged

Conversation

nbusseneau
Copy link
Member

@nbusseneau nbusseneau commented Dec 10, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 17982 17990 18133 18123 18148 18155 18170 18166 17939 18117 18167 18186 18200 18203 18209 18208 18137 18217; do contrib/backporting/set-labels.py $pr done 1.11; done

pchaigno and others added 19 commits December 10, 2021 18:00
[ upstream commit 3bd4ad6 ]

We use path filters in the CodeQL workflow to avoid running it for
unrelated changes. We're however missing the workflow file itself in the
path filters. As a result, the CodeQL workflow isn't run when the GitHub
Actions it uses are updated by dependabot. This commit fixes it.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit fdfd78d ]

The microk8s target previously relied on an env var declared at the
top-level of the Makefile, which will conflict with an upcoming commit.
Move it into a target-specific variable. This also means we can remove
the redundant tagging operation.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit b12ecd1 ]

Add a new kind-image target, mirroring the microk8s target for building
and tagging the image.

To use:

$ make kind
<install Cilium>
$ make kind-image
$ kubectl -n kube-system set image daemonset/cilium cilium-agent=localhost:5000/cilium/cilium-dev:local

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 444c58c ]

It seems that in some cases (e.g. for RCs) the minor version reported by
`kubectl version --client -o json` contains trailing non-numeric
characters, e.g.

```
{
  "clientVersion": {
    "major": "1",
    "minor": "23+",
    "gitVersion": "v1.23.0-rc.0",
    "gitCommit": "a117a51aa497a516106a3115963437218493c8d2",
    "gitTreeState": "clean",
    "buildDate": "2021-11-24T05:31:49Z",
    "goVersion": "go1.17.3",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}
```

This breaks the check where the reported version is compared against the
current k8s env version, leading to `kubectl` being downloaded by each
test and creating flakes such as

```
cmd: "curl --output /tmp/kubectl/1.23/kubectl https://storage.googleapis.com/kubernetes-release/release/v1.23.0-rc.0/bin/linux/amd64/kubectl && chmod +x /tmp/kubectl/1.23/kubectl" exitCode: 23 duration: 90.516369ms stdout:

err:
exit status 23
stderr:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Warning: Failed to create the file /tmp/kubectl/1.23/kubectl: Text file busy
^M  0 44.4M    0  1177    0     0  14530      0  0:53:25 --:--:--  0:53:25 14530
curl: (23) Failed writing body (0 != 1177)

FAIL: failed to ensure kubectl version: failed to download kubectl
```

Where `kubectl` from another test is still in use and attempted to be
overwritten by the current test.

Fix this by stripping any trailing non-numeric characters from the k8s
minor version for the purpose of version comparison.

Fixes: 75fbebb ("test/helpers: use rc.0 as the default version of kubectl")
Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 6c169f6 ]

Signed-off-by: Timo Reimann <ttr314@googlemail.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 4b16aee ]

Calculate MaximumAllocationIdentity during module initialization is not
enough as user specified 'ClusterID' hasn't been parsed during this phase.

This patch fixed the problem by recalculating the max identity during
runtime initialization when `ClusterID > 0`.

Signed-off-by: ArthurChiao <arthurchiao@hotmail.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit f4d59d1 ]

Minikube 1.12.0 or later is required to use the --cni flag [1].

1 - kubernetes/minikube@9e95435
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 11c17f0 ]

This reverts commit ca4ed8d ("test/Services: Quarantine 'Checks service on
same node'"). The original intention was to quarantine the test from cilium#17919.
However, ca4ed8d quarantined the /wrong/ test case which was not the failing
one since the code in mentioned commit only covers the IPv6 one and in the test
the IPv4 one was failing. Meanwhile, the IPv4 test for 'Checks service on same
node' was still left active on master, and looking at CI dashboard from last few
days on master there has been no failure related to any of the 'Checks service on
same node' tests.

Closes: cilium#17919
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 018be31 ]

Alexander reports in GitHub issue 18023 that establishing a connection
via an FQDN policy, then modifying that FQDN policy, would cause
subsequent traffic to the FQDN to be dropped, even if the new policy
still allowed the same traffic via a toFQDN statement.

This was caused by overzealous release of CIDR identities while
generating a new policy. Although the policy calculation itself keeps
all selectorcache entries alive during the policy generation phase (see
cachedSelectorPolicy.setPolicy() ), after the new policy is inserted
into the PolicyCache, the distillery package would clean up the old
policy. As part of that cleanup, it would call into the individual
selector to call the RemoveSelectors() function.

The previous implementation of this logic unintentionally released the
underlying identities any time a user of a selector was released, rather
than only releasing the underlying identities when the number of users
reached zero and the selector itself would be released. This meant that
rather than the selectorcache retaining references to the underlying
identities when a policy was updated, instead the references would be
released (and all corresponding BPF resources cleaned up) at the end of
the process. This then triggered subsequent connectivity outages.

Fix it by only releasing the identity references once the cached
selector itself is removed from the SelectorCache.

Fixes: f559cf1 ("selectorcache: Release identities on selector removal")
Reported-by: Alexander Block <ablock84@gmail.com>
Suggested-by: Jarno Rajahalme <jarno@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 80ff05a ]

Test that when FQDN policy is updated to a new policy which still
selects the same old FQDN destination, connectivity continues to work.

Validated to fail without the previous commit:

    K8sFQDNTest
    /home/joe/git/cilium/test/ginkgo-ext/scopes.go:473
      Validate that FQDN policy continues to work after being updated [It]
      /home/joe/git/cilium/test/ginkgo-ext/scopes.go:527

      Can't connect to to a valid target when it should work
      Expected command: kubectl exec -n default app2-58757b7dd5-rh7dd --
        curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5
            --max-time 20 --retry 5 http://vagrant-cache.ci.cilium.io
            -w "time-> DNS: '%{time_namelookup}(%{remote_ip})',
                Connect: '%{time_connect}',
                Transfer '%{time_starttransfer}',
                total '%{time_total}'"
      To succeed, but it failed:
      Exitcode: 28
      Err: exit status 28
      Stdout:
             time-> DNS: '0.000016()', Connect: '0.000000',Transfer '0.000000', total '5.000415'
      Stderr:
             command terminated with exit code 28

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 09f5b56 ]

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP
allocation to a pod, meanwhile the operator can determine that a node has
excess IPs and release the IP causing pod connectivity issues.

A new operator flag `excess-ip-release-delay` is added to control how long
operator should wait before considering an IP for release(defaults to 180 secs).
This is done to better handle IP reuse during rollouts. Operator and agent use
a new map in cilium node status .status.ipam.release-ips to exchange information
during the handshake.

Fixes: cilium#13412

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 5a4dd80 ]

After the handshake is complete and the operator is done releasing an IP,
CiliumNode status (release-ips) and spec (pool) are updated in two
consecutive requests. There's a tiny window between the two updates where
the entry is removed from .status.ipam.release-ips but the IP is still
present in spec.ipam.pool. It was possible that the IP could be allocated
between these requests.

This commit introduces a new state called released to deal with this.
Now agent removes the entry from release-ips only when the IP was
removed from .spec.ipam.pool as well.

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 82d4422 ]

Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. cilium#16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 337b2c9 ]

If the developer set the imagePullPolicy to Always, then kind would
always end up pulling the image version from the localhost:5000 registry
instead of using the version we sideload in the final step here. Avoid
this by unconditionally pushing the image to the local registry.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 1305bab ]

Commit f665152 ("test: Use stable tags instead of :latest")
switched most use of image tags to stable tags, but omitted one
occurrence of the zookeeper image in the runtime kafka tests. Switch it
to the stable docker.io/cilium/zookeeper:1.0 as well.

Fixes: f665152 ("test: Use stable tags instead of :latest")
Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 09e14df ]

This commits adds a workflow parameter to allow for an
developer-defined image suffix. It's useful if the developers don't want
to use the default "beta" prefix for the docker image repository.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 7d74863 ]

Signed-off-by: Dmitry Kharitonov <dmitry@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 701967f ]

The variable was not enclosed fully in speechmarks, observed by seeing
failures in `make -C install/kubernetes` while backporting this to older
branches.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit f584267 ]

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
@nbusseneau nbusseneau requested review from a team as code owners December 10, 2021 17:03
@nbusseneau nbusseneau requested review from a team December 10, 2021 17:03
@nbusseneau nbusseneau requested a review from a team as a code owner December 14, 2021 14:50
@nbusseneau
Copy link
Member Author

nbusseneau commented Dec 14, 2021

/test-backport-1.11

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.56.12:80 from testclient-jtrsp

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@aditighag
Copy link
Member

runtime failure is legitimate -

07:49:31      runtime: HINT: to fix this, run 'make -C Documentation update-helm-values'
07:49:31      runtime: Makefile:43: recipe for target 'check' failed
07:49:31      runtime: make[1]: Leaving directory '/home/vagrant/go/src/github.com/cilium/cilium/Documentation'
07:49:31      runtime: make[1]: *** [check] Error 1
07:49:31      runtime: Makefile:575: recipe for target 'postcheck' failed
07:49:31      runtime: make: *** [postcheck] Error 2

@nbusseneau
Copy link
Member Author

Both Travis CI and runtime tests hit issues due to make -C Documentation update-helm-values needing to be run, adding this to the PR.

@nbusseneau
Copy link
Member Author

nbusseneau commented Dec 15, 2021

1.22 / 4.19 hit #18138.

joestringer and others added 2 commits December 15, 2021 15:26
[ upstream commit 2bede03 ]

[ Backport notes: regenerated generated files on the release branch
  using `make -C install/kubernetes` and
  `make -C Documentation update-helm-values` ]

Generate the helm values docs during the release process.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit f542724 ]

Fixes: cilium#18204

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
@nbusseneau
Copy link
Member Author

nbusseneau commented Dec 15, 2021

/test-runtime

Retriggering only runtime as all previous CI passed.

@nbusseneau
Copy link
Member Author

nbusseneau commented Dec 15, 2021

Runtime test passed, Travis CI hit DockerHub pull limits. I've retriggered Travis CI out of habit, but can be merged afterwards.

@nbusseneau
Copy link
Member Author

Travis CI passed, marking as ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet