Skip to content

Commit

Permalink
image/runtime: Fix kube proxy and Cilium iptables and nftables collision
Browse files Browse the repository at this point in the history
[ upstream commit 369f3f9 ]

Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

[ Backport notes: Conflict on file
    images/runtime/configure-iptables-wrapper.sh, due to the copyright
    year being removed in 1.12 in commit 17a78a2 ("images: remove
    copyright year from copyright notices in source files") ]

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
  • Loading branch information
jrfastab authored and qmonnet committed Jun 30, 2022
1 parent 667ed7a commit 5831d3d
Show file tree
Hide file tree
Showing 4 changed files with 233 additions and 66 deletions.
3 changes: 1 addition & 2 deletions images/runtime/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,8 @@ WORKDIR /go/src/github.com/cilium/cilium/images/runtime
RUN --mount=type=bind,readwrite,target=/go/src/github.com/cilium/cilium/images/runtime \
./install-runtime-deps.sh

COPY iptables-wrapper /usr/sbin/iptables-wrapper
RUN --mount=type=bind,readwrite,target=/go/src/github.com/cilium/cilium/images/runtime \
./configure-iptables-wrapper.sh
./iptables-wrapper-installer.sh --no-sanity-check

COPY --from=llvm-dist /usr/local/bin/clang /usr/local/bin/llc /bin/
COPY --from=bpftool-dist /usr/local /usr/local
Expand Down
19 changes: 0 additions & 19 deletions images/runtime/configure-iptables-wrapper.sh

This file was deleted.

45 changes: 0 additions & 45 deletions images/runtime/iptables-wrapper

This file was deleted.

232 changes: 232 additions & 0 deletions images/runtime/iptables-wrapper-installer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
#!/bin/sh
# shellcheck disable=SC2166

# This script is copied from upstream with below link
# https://github.com/kubernetes-sigs/iptables-wrappers/blob/e139a115350974aac8a82ec4b815d2845f86997e/iptables-wrapper-installer.sh
# Copyright 2020 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Usage:
#
# iptables-wrapper-installer.sh [--no-sanity-check]
#
# Installs a wrapper iptables script in a container that will figure out
# whether iptables-legacy or iptables-nft is in use on the host and then
# replaces itself with the correct underlying iptables version.
#
# Unless "--no-sanity-check" is passed, it will first verify that the
# container already contains a suitable version of iptables.

# NOTE: This can only use POSIX /bin/sh features; the build container
# might not contain bash.

set -eu

# Find iptables binary location
if [ -d /usr/sbin -a -e /usr/sbin/iptables ]; then
sbin="/usr/sbin"
elif [ -d /sbin -a -e /sbin/iptables ]; then
sbin="/sbin"
else
echo "ERROR: iptables is not present in either /usr/sbin or /sbin" 1>&2
exit 1
fi

# Determine how the system selects between iptables-legacy and iptables-nft
if [ -x /usr/sbin/alternatives ]; then
# Fedora/SUSE style alternatives
altstyle="fedora"
elif [ -x /usr/sbin/update-alternatives ]; then
# Debian style alternatives
altstyle="debian"
else
# No alternatives system
altstyle="none"
fi

if [ "${1:-}" != "--no-sanity-check" ]; then
# Ensure dependencies are installed
if ! version=$("${sbin}/iptables-nft" --version 2> /dev/null); then
echo "ERROR: iptables-nft is not installed" 1>&2
exit 1
fi
if ! "${sbin}/iptables-legacy" --version > /dev/null 2>&1; then
echo "ERROR: iptables-legacy is not installed" 1>&2
exit 1
fi

case "${version}" in
*v1.8.[0123]\ *)
echo "ERROR: iptables 1.8.0 - 1.8.3 have compatibility bugs." 1>&2
echo " Upgrade to 1.8.4 or newer." 1>&2
exit 1
;;
*)
# 1.8.4+ are OK
;;
esac
fi

# Start creating the wrapper...
rm -f "${sbin}/iptables-wrapper"
cat > "${sbin}/iptables-wrapper" <<EOF
#!/bin/sh
# Copyright 2020 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# NOTE: This can only use POSIX /bin/sh features; the container image
# might not contain bash.
# This is a Cilium variant of the original Kubernetes iptablejs-wrapper
# the change is to follow Kube-proxy instead of Kubelet. This fixes
# issues where kube-proxy and kubelet iptables versions are not in
# sync and we observed kube-proxy incorrectly using different ipt/nft
# configuration from kubelet.
set -eu
# In kubernetes 1.17 and later, kubelet will have created at least
# one chain in the "mangle" table (either "KUBE-IPTABLES-HINT" or
# "KUBE-KUBELET-CANARY"), we expect kubeproxy will follow similar
# pattern so we check that first, against iptables-nft, because we
# can check that more efficiently and it's more common these days.
nft_kubeproxy_rules=\$( (iptables-nft-save -t mangle || true; ip6tables-nft-save -t mangle || true) 2>/dev/null | grep -E '^:(KUBE-IPTABLES-HINT|KUBE-PROXY-CANARY)' | wc -l)
if [ "\${nft_kubeproxy_rules}" -ne 0 ]; then
mode=nft
else
# Next lets check for a kubeproxy canary in iptables indicating
# kube-proxy is using ipt.
legacy_kubeproxy_rules=\$( (iptables-legacy-save || true; ip6tables-legacy-save || true) 2>/dev/null | grep -E '^:(KUBE-IPTABLES-HINT|KUBE-PROXY-CANARY)' | wc -l)
if [ "\${legacy_kubeproxy_rules}" -ne 0 ]; then
mode=legacy
else
# If we did not find a kube proxy canary either we started before
# kube-proxy or it doesn't exist so lets use ipwrapper standard
# logic to follow kubelet.
nft_kubelet_rules=\$( (iptables-nft-save -t mangle || true; ip6tables-nft-save -t mangle || true) 2>/dev/null | grep -E '^:KUBE-KUBELET-CANARY' | wc -l)
if [ "\${nft_kubeproxy_rules}" -ne 0 ]; then
mode = nft
else
# Check for kubernetes 1.17-or-later with iptables-legacy. We
# can't pass "-t mangle" to iptables-legacy-save because it would
# cause the kernel to create that table if it didn't already
# exist, which we don't want. So we have to grab all the rules
legacy_kubelet_rules=\$( (iptables-legacy-save || true; ip6tables-legacy-save || true) 2>/dev/null | grep -E '^:KUBE-KUBELET-CANARY' | wc -l)
if [ "\${legacy_kubelet_rules}" -ne 0 ]; then
mode=legacy
else
# With older kubernetes releases there may not be any _specific_
# rules we can look for, but we assume that some non-containerized process
# (possibly kubelet) will have created _some_ iptables rules.
num_legacy_lines=\$( (iptables-legacy-save || true; ip6tables-legacy-save || true) 2>/dev/null | grep '^-' | wc -l)
num_nft_lines=\$( (iptables-nft-save || true; ip6tables-nft-save || true) 2>/dev/null | grep '^-' | wc -l)
if [ "\${num_legacy_lines}" -gt "\${num_nft_lines}" ]; then
mode=legacy
else
mode=nft
fi
fi
fi
fi
fi
EOF

# Write out the appropriate alternatives-selection commands
case "${altstyle}" in
fedora)
cat >> "${sbin}/iptables-wrapper" <<EOF
# Update links to point to the selected binaries
alternatives --set iptables "/usr/sbin/iptables-\${mode}" > /dev/null || failed=1
EOF
;;

debian)
cat >> "${sbin}/iptables-wrapper" <<EOF
# Update links to point to the selected binaries
update-alternatives --set iptables "/usr/sbin/iptables-\${mode}" > /dev/null || failed=1
update-alternatives --set ip6tables "/usr/sbin/ip6tables-\${mode}" > /dev/null || failed=1
EOF
;;

*)
cat >> "${sbin}/iptables-wrapper" <<EOF
# Update links to point to the selected binaries
for cmd in iptables iptables-save iptables-restore ip6tables ip6tables-save ip6tables-restore; do
rm -f "${sbin}/\${cmd}"
ln -s "${sbin}/xtables-\${mode}-multi" "${sbin}/\${cmd}"
done 2>/dev/null || failed=1
EOF
;;
esac

# Write out the post-alternatives-selection error checking and final wrap-up
cat >> "${sbin}/iptables-wrapper" <<EOF
if [ "\${failed:-0}" = 1 ]; then
echo "Unable to redirect iptables binaries. (Are you running in an unprivileged pod?)" 1>&2
# fake it, though this will probably also fail if they aren't root
exec "${sbin}/xtables-\${mode}-multi" "\$0" "\$@"
fi
# Now re-exec the original command with the newly-selected alternative
exec "\$0" "\$@"
EOF
chmod +x "${sbin}/iptables-wrapper"

# Now back in the installer script, point the iptables binaries at our
# wrapper
case "${altstyle}" in
fedora)
alternatives \
--install /usr/sbin/iptables iptables /usr/sbin/iptables-wrapper 100 \
--slave /usr/sbin/iptables-restore iptables-restore /usr/sbin/iptables-wrapper \
--slave /usr/sbin/iptables-save iptables-save /usr/sbin/iptables-wrapper \
--slave /usr/sbin/ip6tables iptables /usr/sbin/iptables-wrapper \
--slave /usr/sbin/ip6tables-restore iptables-restore /usr/sbin/iptables-wrapper \
--slave /usr/sbin/ip6tables-save iptables-save /usr/sbin/iptables-wrapper
;;

debian)
update-alternatives \
--install /usr/sbin/iptables iptables /usr/sbin/iptables-wrapper 100 \
--slave /usr/sbin/iptables-restore iptables-restore /usr/sbin/iptables-wrapper \
--slave /usr/sbin/iptables-save iptables-save /usr/sbin/iptables-wrapper
update-alternatives \
--install /usr/sbin/ip6tables ip6tables /usr/sbin/iptables-wrapper 100 \
--slave /usr/sbin/ip6tables-restore ip6tables-restore /usr/sbin/iptables-wrapper \
--slave /usr/sbin/ip6tables-save ip6tables-save /usr/sbin/iptables-wrapper
;;

*)
for cmd in iptables iptables-save iptables-restore ip6tables ip6tables-save ip6tables-restore; do
rm -f "${sbin}/${cmd}"
ln -s "${sbin}/iptables-wrapper" "${sbin}/${cmd}"
done
;;
esac

# Cleanup
rm -f "$0"

0 comments on commit 5831d3d

Please sign in to comment.