Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.10] Backport #18486 #18737

Closed
wants to merge 2 commits into from
Closed

[v1.10] Backport #18486 #18737

wants to merge 2 commits into from

Conversation

bmcustodio
Copy link
Contributor

@bmcustodio bmcustodio commented Feb 8, 2022

Backport of #18486 to v1.10 (cc @joamaki).

bmcustodio and others added 2 commits February 8, 2022 11:19
The changes that we have been doing to /etc/defaults/kubelet are reset
on node reboots, as is apparently the whole /etc directory --- which
also means that /etc/cni/net.d/05-cilium.conf is removed.

This would not be a problem if the assumption we made that the node taint we
recommend placing on the nodes would come back upon reboots held true, but in
practice it doesn't.

Besides this, it seems that containerd will re-instante its CNI
configuration file, and it will do so way before Cilium has had the
chance to re-run on the node and re-create its CNI configuration,
causing pods to be assigned IPs by the default CNI rather than by Cilium
in the meantime.

This commit attempts at preventing that from happening by observing that
/home/kubernetes/bin/kubelet (i.e. the actual kubelet binary) is kept between
reboots and executed concurrently with containerd by systemd. We leverage on
this empirical observation to replace this file kubelet with a wrapper script
that, under the required conditions, disables containerd, patches its
configuration, removes undesired CNI configuration files, re-enables
containerd and becomes the kubelet.

[ upstream commit 36585e4 ]

Signed-off-by: Bruno Miguel Custódio <brunomcustodio@gmail.com>
Co-authored-by: Alexandre Perrin <alex@kaworu.ch>
Co-authored-by: Chris Tarazi <chris@isovalent.com>
To prevent situations in which the GKE node is forcibly stopped and
re-created from causing unmanaged pods, and building on the observation
that the node comes back with the same name and pods are already
scheduled there, we change the recommended taint effect from NoSchedule
to NoExecute, to cause any previously scheduled pods to be evicted,
preventing them from getting IPs assigned by the default CNI. This
should not impact other environments due to the nature of 'NoExecute',
so we recommend it everywhere.

[ upstream commit b049574 ]

Signed-off-by: Bruno Miguel Custódio <brunomcustodio@gmail.com>
Co-authored-by: Tam Mach <sayboras@yahoo.com>
@bmcustodio bmcustodio requested a review from a team as a code owner February 8, 2022 11:23
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Feb 8, 2022
@joamaki
Copy link
Contributor

joamaki commented Feb 8, 2022

/test-backport-1.10

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sLRPTests Checks local redirect policy LRP connectivity

Failure Output

FAIL: Cannot retrieve services on cilium pod

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.20-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sLRPTests Checks local redirect policy LRP connectivity

Failure Output

FAIL: Cannot retrieve services on cilium pod

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.19 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sLRPTests Checks local redirect policy LRP connectivity

Failure Output

FAIL: Cannot retrieve services on cilium pod

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

@qmonnet
Copy link
Member

qmonnet commented Mar 28, 2022

@bmcustodio What's the status on this please? Was it expected to drop the backport? Can we please update the backport-pending/1.10 label accordingly on the original PR?

@qmonnet
Copy link
Member

qmonnet commented Mar 28, 2022

Apologies, I see this was backported in #18835 instead, sorry for the noise.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants