Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cilium installation in GCloud beta "rapid" channel #9959

Merged
merged 1 commit into from Jan 24, 2020

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jan 24, 2020

If the nodeinit script runs somewhere where the BPFFS is already
mounted, then it is highly likely that existing infrastructure handles
the auto-mounting of the BPF filesystem to /sys/fs/bpf, for instance
because they are running systemd 239 or later.

In this case, don't always create the BPFFS mount unit configuration for
systemd, otherwise systemd may reject it with:

Failed to start sys-fs-bpf.mount:
Unit sys-fs-bpf.mount has a bad unit file setting.

This is because systemd has deemed the BPFFS mounting to be within its
own domain of control, and that others should not configure it.

This caused issues in beta gcloud environments where the nodeinit would
fail because the systemd mount unit would not mount. This prevented
configuration of the kubelet on each node, meaning that Cilium would not
be configured as the CNI in the environment. However, the Cilium DS
itself would run, leading to situations where pods would attempt to
connect to remote services (eg kube-dns to the APIserver) and fail to
connect in an environment where the Cilium agents themselves otherwise
appear to be healthy. A cursory investigation of cilium endpoint list
in such environments would clearly show that no pods are being managed
by Cilium.

Fixes: #9556


This change is Reviewable

@joestringer joestringer added pending-review sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. labels Jan 24, 2020
@joestringer joestringer requested a review from a team January 24, 2020 01:22
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

2 similar comments
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.7.0 Jan 24, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.6 Jan 24, 2020
@joestringer joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jan 24, 2020
@joestringer joestringer changed the title helm: Make nodeinit systemd mountpoint conditional Fix cilium installation in GCloud beta "rapid" channel Jan 24, 2020
If the nodeinit script runs somewhere where the BPFFS is already
mounted, then it is highly likely that existing infrastructure handles
the auto-mounting of the BPF filesystem to /sys/fs/bpf, for instance
because they are running systemd 239 or later.

In this case, don't always create the BPFFS mount unit configuration for
systemd, otherwise systemd may reject it with:

  Failed to start sys-fs-bpf.mount:
  Unit sys-fs-bpf.mount has a bad unit file setting.

This is because systemd has deemed the BPFFS mounting to be within its
own domain of control, and that others should not configure it.

This caused issues in beta gcloud environments where the nodeinit would
fail because the systemd mount unit would not mount. This prevented
configuration of the kubelet on each node, meaning that Cilium would not
be configured as the CNI in the environment. However, the Cilium DS
itself would run, leading to situations where pods would attempt to
connect to remote services (eg kube-dns to the APIserver) and fail to
connect in an environment where the Cilium agents themselves otherwise
appear to be healthy. A cursory investigation of `cilium endpoint list`
in such environments would clearly show that no pods are being managed
by Cilium.

Fixes: cilium#9556

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

joestringer commented Jan 24, 2020

test-me-please

EDIT: Only these failed:

Suite-k8s-1.11.K8sHealthTest checks cilium-health status between nodes
Suite-k8s-1.17.K8sPolicyTest Basic Test Redirects traffic to proxy when no policy is applied with proxy-visibility annotation Tests DNS proxy visibility without policy

So, not consistently (1 per k8s version), and not related to BPFFS (restore etc). I'm not even sure where nodeinit is used in the CI. k8s tests with other versions all passed though so I think it's good from CI perspective.

@coveralls
Copy link

coveralls commented Jan 24, 2020

Coverage Status

Coverage increased (+0.002%) to 45.976% when pulling 79d2ef2 on joestringer:submit/fix-gke-rapid into b2c9b07 on cilium:master.

@tgraf tgraf merged commit 1ab474d into cilium:master Jan 24, 2020
1.7.0 automation moved this from In progress to Merged Jan 24, 2020
@joestringer joestringer deleted the submit/fix-gke-rapid branch January 24, 2020 17:59
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.6 Jan 31, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.6 Jan 31, 2020
@aanm aanm mentioned this pull request Jan 31, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.6 Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/bug This PR fixes an issue in a previous release of Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
No open projects
1.6.6
Backport done to v1.6
1.7.0
  
Merged
Development

Successfully merging this pull request may close these issues.

Multiple issues installing Cilium on latest GKE with CoS 77 (rapid channel)
4 participants