Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.9 backports 2020-10-23 #13720

Merged
merged 41 commits into from
Oct 23, 2020
Merged

v1.9 backports 2020-10-23 #13720

merged 41 commits into from
Oct 23, 2020

Conversation

gandro
Copy link
Member

@gandro gandro commented Oct 23, 2020

Once this PR is merged, you can update the PR labels via:

$ for pr in 13683 13685 13673 13668 13675 13667 13327 13698 13695 13693 13697 13703 13699 13704 13689 13691 13543 13694 13692 13696; do contrib/backporting/set-labels.py $pr done 1.9; done

christarazi and others added 30 commits October 23, 2020 11:04
[ upstream commit 6747a84 ]

The metadata resolver controller can be put in a scenario where it can
never succeed if the Cilium K8s pod cache is out-of-sync with the
current view of pods in the cluster. The cache can be out-of-sync if
there's heavy load on the cluster and especially on the apiserver. This
leaves the Cilium pod cache starved. When the pod cache is starved,
Cilium may never have an up-to-date view of the pods running in the
cluster, and therefore may never be able to resolve labels for pods
because the fetch will return "pod not found".

Eventually, kubelet (or maybe even the user) will give up on the pod and
remove it, thereby Cilium begins removing the endpoint. The controller
is stopped, but the goroutine waiting on the `done` channel will never
receive, hence the leak.

This commit fixes the goroutine leak when the controller never succeeds
and therefore, never closes the `done` channel. The fix is to add a
`select` statement to also watch the endpoint's `aliveCtx` which is
cancelled (and the context's `Done` channel is closed) when the endpoint
is deleted.

This commit was validated by forcing the metadata resolver to never find
a pod (manually hardcode the wrong pod name), and ensuring that the
`gops stack` output does not contain an entry like:

```
goroutine 1244259 [chan receive, 1434 minutes]:
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).RunMetadataResolver.func1(0xc0055d2a20, 0xc0049ff600, 0xc0055d2ae0, 0x5d)
	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1531 +0x34
created by github.com/cilium/cilium/pkg/endpoint.(*Endpoint).RunMetadataResolver
	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1530 +0x11e
```

Fixes: #13680

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit cb9aa3e ]

This controller's run interval is actually unnecessary as this
controller is intended to only run once as long as it succeeds. This is
evidenced by the fact that we have a `done` channel that's closed when
the controller succeeds and a goroutine waiting `done` to be closed to
remove the controller, thereby stopping it.

In addition, in the very unlikely case that the controller is scheduled
again to run _and_ the goroutine waiting for the `done` channel to be
closed has not run yet, the controller could panic because it would
close the channel twice. This commit prevents that as the interval
defaults to 10m if not provided. If the goroutine waiting on the channel
doesn't run within 10m, then we very likely have much worse problems at
hand.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit e8bce62 ]

As of Azure Linux kernel 5.4.0-1022 the necessary patches to the
hv_netvsc have been backported, see [1]. This allows to run NodePort XDP
on Azure AKS when using the Ubuntu 18.04 node image which provides said
kernel version.

[1] https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1877654

Update the instructions accordingly. In addition the nodePort device
helm option needs to be set explicitly to eth0, as otherwise bpf_host.o
would erroneously be bound to the azure0 interface.

For #13627

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 251fd53 ]

Fixes:

- Fix Helm option: "--set devices=ethX,ethY" does not work (missing
  curly braces and quotes).
- Rename one occurrence of "access=ssh" into "node-access=ssh", for
  consistency.
- Add a missing closing parenthesis at the end of the command that sets
  the HOST_EP_ID variable.
- Use jsonpath and kubectl's options in one command to avoid piping the
  output to jq.

Additions:

- Explain what the interfaces passed to the "--devices" option are for.
  This is mostly copied from the description in the Host Policies
  description (Layer 3 Examples).
- Explain that per-endpoint routing is not compatible with the host
  firewall and must be disabled, in particular on managed environments
  like GKE.
- Add node label removal to the clean-up section.
- Format shell session code as shell session code.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 1cd20c8 ]

Refs: #13627

Signed-off-by: Tom Payne <tom@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 3c97c22 ]

The upstream types have a lot of information that Cilium does not need
so we can slim them down.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 71b4fb5 ]

In upstream Kubernetes the core, discovery and networking packages need
to be in the api package.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 1314600 ]

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 620b243 ]

When Cilium is waiting for the CRDs to be available in the cluster,
Cilium does not need process entire CRD nor Table objects as it is only
checking for the existence of the CRD, by checking its name. Thus, we
can use a slimmer version of the CRD as well as a slimmer version of the
Table structures to avoid resource consumption during initialization.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit ac34b34 ]

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 3a803bf ]

During initialization of CRDs, if Cilium can't watch for POMs it can
fall back to watch for v1beta1.Table, instead of processing the entire
CRDs.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 228a485 ]

During the endpoint restoration process, when we parse the endpoints, we
assign them a reserved init identity if they don't already have an
identity [0]. If we later remove the endpoint (because the corresponding
K8s pod or interface are missing), we attempt to remove the identity
from the identity manager. That last operation results in the following
error message because the init identity was never added to the manager.

  level=error msg="removing identity not added to the identity manager!" identity=5 subsys=identitymanager

This commit fixes it by skipping the removal attempt from the manager in
the case of identity init.

0 - https://github.com/cilium/cilium/blob/80a71791320df34df5b6252b9680553e38d88d20/pkg/endpoint/endpoint.go#L819
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 8cd003c ]

An example CLI output:

cuiwl@cuiwl:~$ kubectl exec -itn kube-system anetd-jgsvb -- cilium lrp list
cuiwl@cuiwl:/google/src/cloud/cuiwl/LRP/google3$ kubectl exec -itn kube-system anetd-v2j52 -- cilium lrp list
LRP namespace   LRP name   FrontendType   Matching Service
default         lrp-addr   IP + port
                |          169.254.169.254:8080/TCP ->
kube-system     lrp-svc    clusterIP + named ports   kube-system/kube-dns
                |          10.84.48.10:53/UDP -> 10.100.1.2:53(kube-system/node-local-dns-wgbjm),
                |          10.84.48.10:53/TCP -> 10.100.1.2:53(kube-system/node-local-dns-wgbjm),

Signed-off-by: Weilong Cui <cuiwl@google.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 64e681d ]

This commit is to add required argument for `vagrant port` command to
avoid below error

```shell script
$ vagrant port --guest 6443
The
provider reported there are no forwarded ports for this virtual
machine.
This can be caused if there are no ports specified in the
Vagrantfile or
if the virtual machine is not currently running. Please
check that the
virtual machine is running and try again.
```

Relates to b68af48

Signed-off-by: Tam Mach <sayboras@yahoo.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 4931f6b ]

For #13627

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 2c2b5de ]

For #13627

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit bfc0078 ]

This release fixes a bug which prevents certain environment variables
from being used for configuration.

See the release notes[0] for details.

[0]: https://github.com/cilium/hubble/releases/tag/v0.7.1

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 1672c81 ]

For #13627

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit f648be6 ]

This commit unifies the `cilium_version` and `cilium_tag` variables in
the Vagrantfile for the getting started tutorial into a single variable,
`cilium_version`.

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit a8e67f1 ]

When submitting the backport PR using submit-backport, the script
proposes to update the labels (i.e., remove needs-backport/X and add
backport-pending/X):

    Sending pull request...
    Everything up-to-date
    #13700

    Updating labels for PRs 13383 13608 12975

    Set labels for all PRs above? [y/N] y
    Setting labels for PR 13383... ✓
    Setting labels for PR 13608... ✓
    Setting labels for PR 12975... ✓

The choice defaults to not updating the labels. That may give the wrong
impression that it is an optional step---and if you're like me, when
you're unsure what an optional step does, you skip it. We should default
to setting the labels because we later rely on the labels being set
(e.g., when we update them after the PR is merged).

This commit changes it to the following:

    Sending pull request...
    Everything up-to-date
    #13700

    Updating labels for PRs 13383 13608 12975

    Set labels for all PRs above? [Y/n]
    Setting labels for PR 13383... ✓
    Setting labels for PR 13608... ✓
    Setting labels for PR 12975... ✓

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit ec16cab ]

If the controller that is used for label resolution fails, the
prometheus metrics will increase its cardinality since the uniquely
controller name was being used as a prometheus label. To avoid this, we
will reference these warnings with a common subsystem name,
'resolve-labels'.

Fixes: a31ab29 ("endpoint: Run labels controller under ep manager")
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 770bad5 ]

".. code:: shell-session" does not work to provide syntax highlighting
for shell sessions, this needs to be ".. code-block:: shell-session".

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit e168e09 ]

Helm 2 support was deprecated as of commit 31f1306 ("feat(helm):
Move requirements.yaml to Chart.yaml"), and will reach end of support
from upstream on November 13, 2020 [0]. Remove references to Helm 2.

[0] https://helm.sh/blog/helm-v2-deprecation-timeline/

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 532bbd5 ]

While proofreading the upgrade guide, I found some phrasings awkward and
some duplicate text. Improve it by fixing the grammar / tense and
rewording sentences that flow poorly.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 76505f7 ]

This paragraph seemed to be vaguely referencing some section that might
exist in the future at some point, which doesn't make any sense. We can
only refer users to the documentation that exists below. Do so.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 6b4e3c0 ]

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit e62fd90 ]

Since Cilium 1.6, we have provided Helm charts and have instructed users
to upgrade with full YAML updates (either directly through helm or by
templating first). There's no point in having a column for handling
partial upgrades any more, remove it.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 7058593 ]

Try to make it more clear to users how they should specify the options
in the primary command for upgrading the deployment, to prevent them
from using 1.8.x helm options which are no longer recognized by Cilium
1.9 Helm charts.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 88cf4df ]

The Helm chart one isn't interpreted by sphinx so goes to a dead link,
the api ratelimiting one was using double graves rather than single
which meant it would highlight the text rather than linking it, and the
preflight section vaguely referred to earlier in the document even
though it's super easy to just directly link it.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit d12aacd ]

Commit c8360a6 ("test/k8s: keep configmap across upgrade test")
introduced the option KeepDeprecatedProbes which allowed smooth upgrade
from 1.7 or earlier to 1.8 or later by generating cilium-agent DaemonSet
health probes in the old style if the user specified the
keepDeprecatedProbes option during helm install.

However, it didn't take into account users upgrading from fresh 1.8.x
installs to 1.9.x or later, where deprecated probes should never be
used.

Teach this setting about "upgradeCompatibility" so that when the user
either directly specifies the option, or if they provide the appropriate
upgrade compatibility version, the deprecated probes will only be used
if the initial install was with Cilium 1.7 or earlier.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
joestringer and others added 11 commits October 23, 2020 11:04
[ upstream commit ba7c822 ]

The upgradeCompatibility option was not specified in the default
instructions, but there was a more specific option that provided only
one piece of upgrade compatibility. Convert the main instructions to use
the common variable for providing smooth upgrade which allows control
over multiple options.

While we're at it, fix the indentation for the values file instructions.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 7d356ff ]

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 9de5fd6 ]

The way this was written, you could never disable the l7 proxy because
the helm chart ignored the setting if it was false.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit ac29aa7 ]

These settings cannot be specified by default in the values.yaml file,
because otherwise it would override the upgradeCompatibility value for
the flag and hence break compatibility during upgrade from
v1.7 (no such option) -> v1.8 (should be disabled during upgrade) -> v1.9.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 26878ec ]

[Forward-cherry-pick]

With this option users will be able to deploy Cilium without envoy
support which is helpful for arm64 clusters.

Related: #13650

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
…se cases

[ upstream commit 4d07a34 ]

Update the getting started guide with the kiam use case.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 92f0a9b ]

Signed-off-by: Weilong Cui <cuiwl@google.com>
Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 767a1b8 ]

This function can suffer from connectivity loss between cluster and
external IP, which is not dependant on Cilium. These retries will help
reduce flakiness.

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit c1af153 ]

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 2eea58f ]

- Use title case where appropriate.
- Fix a couple syntax and punctuation errors.
- helm -> Helm

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 5733f6a ]

This change is only adding more unit tests to better understand the
behavior of the labelsfilter as well as improving the documentation for
the expectation of filtering labels.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro requested review from a team as code owners October 23, 2020 09:05
@gandro gandro added backport/1.9 kind/backports This PR provides functionality previously merged into master. labels Oct 23, 2020
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for my change

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good 👍

@gandro
Copy link
Member Author

gandro commented Oct 23, 2020

test-backport-1.9

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good for my one line change 😅

Copy link
Member

@jibi jibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for my changes, thanks!

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you.

@gandro
Copy link
Member Author

gandro commented Oct 23, 2020

GKE failed in Suite-k8s-1.16.K8sIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/2922/

Seems like the pods did not pull the images in time? Retesting.

@gandro
Copy link
Member Author

gandro commented Oct 23, 2020

retest-gke

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 23, 2020
@aanm aanm merged commit 9731320 into v1.9 Oct 23, 2020
@aanm aanm deleted the pr/v1.9-backport-2020-10-23 branch October 23, 2020 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet