Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
image/runtime: Fix kube proxy and Cilium iptables and nftables collision
[ upstream commit 369f3f9 ] Cilium currently, chooses to use iptables-legacy or iptables-nft using an iptables-wrapper script. The script currently does a simple check to see if there are more than 10 rules in iptables-legacy and if so picks legacy mode. Otherwise it will pick whichever has more rules nft or legacy. See [1] for the original wrapper this is taken from. This however can be problematic in some cases. We've hit an environment where arguably broken pods are inserting rules directly into iptables without checking legacy or nft. This can happen in cases of pods that are older for example and use an older package of iptables before 1.8.4 that was buggy or missing nft altogether. At any rate when this happens it becomes a race to see what pods come online first and insert rules into the table and if its greater than 10 cilium will flip into legacy mode. This becomes painfully obvious if the agent is restarted after the system has been running and these buggy pods already created their rules. At this point Cilium may be using legacy while kube-proxy and kubelet are running in nft space. (more on why this is bad below). We can quickly check this from a sysdump with a few one liners, $ find . -name iptables-nft-save* | xargs wc -l 1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md $ find . -name iptables-save* | xargs wc -l 109 ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md here we see that a single node has a significant amount of rules in both nft and legacy tables. In the above example we dove into the legacy table and found the normal CILIUM-* chains and rules. Then in the nft tables we see the standard KUBE-PROXY-* chains and rules. Another scenario where we can create a similar problem is with an old kube-proxy. In this hypothetical scenario the user upgrades to a new distribution/kernel with a base iptables image that points to iptables-nft. This will cause kubelet to use nft tables, but because of the older version of kube-proxy it may use iptables. Now kubelet and kube-proxy are out of sync. Now how should Cilium pick nft or legacy? Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick differently. First we might ask what runs first nft or iptables. From the kernel side its unclear to me. The hooks are run walking an array but, it appears those hooks are registered at runtime. So its up to which hooks register first. And hooks register at init so now we are left wondering which of nft or legacy registers first. This may very well depend on if iptables-legacy or iptables-nft runs first because the init of the module is done on demand with a request_module helper. So bottom line ordering is fragile at best. For this discussion lets assume we can't make any claims on if nft or iptables runs first. Next, lets assume kube-proxy is in nft and Cilium is in legacy and nft runs first. Now this will break Cilium's expectation that the rules for Cilium are run before kube-proxy and any other iptables rules. The result can be drops in the datapath. The example that lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule because it never ran the Cilium -j ACCEPT rule we expected to be inserted into the front of the chain. So clearly this is no good. Just to cover our cases, consider Cilium is run first and then kube-proxy is run. Well we are still stuck from kernel code side the hooks are executed in a for loop over the hooks and an ACCEPT will run the next hook instead of the normal accept the skb and do not run any further rules. The next hook in this case will have the kube-proxy rules and we hit the same -j DROP rule again. Finally because we can't depend on the order of nft vs legacy running it doesn't matter if cilium and kube proxy flip to put cilium on nft and kube-proxy on legacy. We get the same problem. Because Cilium and kube-proxy are coupled in that they both manage iptables for datapath flows they need to be on the same hook. We could try to do this by doing [2] and following kubelet AND assuming kube-proxy does the same everything should be OK. The problem is if kube-proxy is not updated and doesn't follow kubelet we again get stuck with Cilium and kube-proxy using different hooks. To fix this case modify [2] so that Cilium follows kube-proxy instead of following kubelet. This will force cilium and kube-proxy to at least choose the same hook and avoid the faults outlined above. There is a corner case if kube-proxy is not up before cilium, but experimentally it seems kube-proxy is started close to kubelet and init paths so is in fact up before cilium making this ok. If we ever need to verify this in sysdump we can check startAt times in the k8s-pod.yaml to confirm the start ordering of pods. For reference The original iptables-wrapper script the Cilium used previous to this patch is coming from [1]. This patch is based off of the new wrapper [2] in k8s upstream repo. [1]: kubernetes/kubernetes#82966 [2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh [ Backport notes: Conflict on file images/runtime/configure-iptables-wrapper.sh, due to the copyright year being removed in 1.12 in commit 17a78a2 ("images: remove copyright year from copyright notices in source files") ] Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
- Loading branch information