iptables: carry on and log on failure to set up transient rules #12006

qmonnet · 2020-06-10T14:16:40Z

As reported in #11276, we have occasionally observed unexpected failures when trying to set up iptables transient rules (used to avoid dropping packets while reinitialising the daemon).

This set brings two changes:

Do not abort reinitialisation if transient rules set up fails (we risk dropping a few packets until Cilium's rules are up again, this is better than failing to restart the agent).
Log iptables messages on failures to flush or delete leftover transient rules, as we suspect this is the cause for the error messages observed, so we can understand what happens if the issue reoccurs.

Please refer to individual commit logs for details.

When Reinitialize()-ing the datapath, transient iptables rules are set up to avoid dropping packets while Cilium's rules are not in place. In rare occasions, a failure to add those rules has been observed (see issue #11276), leading to an early exit from Reinitialize() and a failure to set up the daemon. But those transient rules are just used to lend a hand and keep packets going for a very small window of time: it does not actually matter much if we fail to install them, and it should not stop the reinitializing of the daemon. Let's simply log a warning and carry on if we fail to add those rules. Signed-off-by: Quentin Monnet <quentin@isovalent.com>

pchaigno

Would it be possible to see this bug in CI? Is it worth adding the new warning messages to badLogMessages so that we don't miss it if it happens in CI?

pkg/datapath/iptables/iptables.go

coveralls · 2020-06-10T14:55:44Z

Coverage decreased (-0.04%) to 37.035% when pulling f106702 on pr/qmonnet/log_ipt_transient_rules into 47f8d32 on master.

qmonnet · 2020-06-10T14:59:52Z

Would it be possible to see this bug in CI?

If filtering on debugTransientRules to avoid doing so for other rules when we pass quiet at false, this is probably doable. I'll look into it. Thanks!

We do the same thing for IPv4 and IPv6, with just the name of the program changing. Then we do nearly the same thing for flushing and deleting a chain. Let's refactor (no functional change). Signed-off-by: Quentin Monnet <quentin@isovalent.com>

When Cilium reinitialises its daemon, transient iptables rules are set up to avoid dropping packets while the regular rules are not in place. On rare occasions, setting up those transient rules has been found to fail, for an unknown reason (see issue #11276). The error message states that the "Chain already exists", even though we try to flush and remove any leftover from previous transient rules before adding the new ones. It sounds likely that removing the leftovers is failing, but we were not able to understand why, because we quieten the function to avoid spurious warnings the first time we try to remove them (since none is existing). It would be helpful to get more information to understand what happens in those rare occasions where setting up transient rules fails. Let's find a way to get more logs, without making too much noise. We cannot warn unconditionally in remove() since we want removal in the normal case to remain quiet. What we can do is logging when the "quiet" flag is not passed, _or_ when the error is different from the chain being not found, i.e. when the error is different from the one we want to silence on start up. This means matching on the error message returned by ip(6)tables. It looks fragile, but at least this message has not changed since 2009, so it should be relatively stable and pretty much the same on all supported systems. Since remove() is used for chains other than for transient rules too, we also match on chain name to make sure we are dealing with transient rules if ignoring the "quiet" flag. This additional logging could be removed once we reproduce and fix the issue. Alternative approaches could be: - Uncoupling the remove() function for transient rules and regular rules, to avoid matching on chain name (but it sounds worse). - Logging on failure for all rules even when the "quiet" flag is passed, but on "info" level instead of "warning". This would still require a modified version of runProg(), with also a modified version of CombinedOutput() in package "exec". Here I chose to limit the number of logs and keep the changes local. - Listing the chain first before trying to remove it, so we only try to remove if it exists, but this would likely add unnecessary complexity and latency. Should help with (but does not solve): #11276 Signed-off-by: Quentin Monnet <quentin@isovalent.com>

In an attempt to catch #11276 in CI, let's add any message related to a failure to flush or delete the chain related to iptables transient rules to the list of badLogMessages we want to catch. We need to filter on the name of the chain for transient rules to avoid false positives, which requires exporting that name. We also need to modify the log message error, to avoid adding four disting logs to the list (combinations for iptables/ip6tables, flush/delete). Signed-off-by: Quentin Monnet <quentin@isovalent.com>

qmonnet · 2020-06-10T16:58:54Z

test-me-please

qmonnet added area/daemon Impacts operation of the Cilium daemon. release-note/misc This PR makes changes that have no direct user impact. needs-backport/1.8 labels Jun 10, 2020

qmonnet requested review from borkmann, joestringer and a team June 10, 2020 14:16

maintainer-s-little-helper bot added this to In progress in 1.8.0 Jun 10, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 Jun 10, 2020

pchaigno approved these changes Jun 10, 2020

View reviewed changes

pkg/datapath/iptables/iptables.go Outdated Show resolved Hide resolved

qmonnet added 3 commits June 10, 2020 16:51

qmonnet force-pushed the pr/qmonnet/log_ipt_transient_rules branch from 26e16d9 to f106702 Compare June 10, 2020 16:13

qmonnet requested a review from a team as a code owner June 10, 2020 16:13

qmonnet requested a review from pchaigno June 10, 2020 16:14

pchaigno approved these changes Jun 10, 2020

View reviewed changes

nebril approved these changes Jun 10, 2020

View reviewed changes

aanm approved these changes Jun 11, 2020

View reviewed changes

borkmann approved these changes Jun 11, 2020

View reviewed changes

aanm added the needs-backport/1.7 label Jun 11, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.7.5 Jun 11, 2020

aanm merged commit e78405a into master Jun 11, 2020

1.8.0 automation moved this from In progress to Merged Jun 11, 2020

aanm deleted the pr/qmonnet/log_ipt_transient_rules branch June 11, 2020 14:50

aanm mentioned this pull request Jun 11, 2020

v1.8 backports 2020-06-11 #12027

Merged

aanm removed the needs-backport/1.8 label Jun 11, 2020

aanm added the backport-pending/1.8 label Jun 11, 2020

joestringer mentioned this pull request Jun 11, 2020

v1.7 backports 2020-06-11 #12031

Merged

joestringer added backport-pending/1.7 and removed needs-backport/1.7 labels Jun 11, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.0 Jun 11, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.5 Jun 11, 2020

aanm added backport-done/1.8 and removed backport-pending/1.8 labels Jun 12, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 12, 2020

jrajahalme mentioned this pull request Jun 15, 2020

test: Remove ginkgo linux dependency #12074

Merged

qmonnet added backport-done/1.7 and removed backport-pending/1.7 labels Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iptables: carry on and log on failure to set up transient rules #12006

iptables: carry on and log on failure to set up transient rules #12006

qmonnet commented Jun 10, 2020

pchaigno left a comment

coveralls commented Jun 10, 2020 •

edited

qmonnet commented Jun 10, 2020

qmonnet commented Jun 10, 2020

iptables: carry on and log on failure to set up transient rules #12006

iptables: carry on and log on failure to set up transient rules #12006

Conversation

qmonnet commented Jun 10, 2020

pchaigno left a comment

Choose a reason for hiding this comment

coveralls commented Jun 10, 2020 • edited

qmonnet commented Jun 10, 2020

qmonnet commented Jun 10, 2020

coveralls commented Jun 10, 2020 •

edited