Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iptables: carry on and log on failure to set up transient rules #12006

Merged
merged 4 commits into from Jun 11, 2020

Commits on Jun 10, 2020

  1. datapath/loader: carry on on failure to set up transient iptables rules

    When Reinitialize()-ing the datapath, transient iptables rules are set
    up to avoid dropping packets while Cilium's rules are not in place.
    
    In rare occasions, a failure to add those rules has been observed (see
    issue #11276), leading to an early exit from Reinitialize() and a
    failure to set up the daemon.
    
    But those transient rules are just used to lend a hand and keep packets
    going for a very small window of time: it does not actually matter much
    if we fail to install them, and it should not stop the reinitializing of
    the daemon. Let's simply log a warning and carry on if we fail to add
    those rules.
    
    Signed-off-by: Quentin Monnet <quentin@isovalent.com>
    qmonnet committed Jun 10, 2020
    Copy the full SHA
    1c2bb1e View commit details
    Browse the repository at this point in the history
  2. datapath/loader: refactor function to remove iptables chain

    We do the same thing for IPv4 and IPv6, with just the name of the
    program changing. Then we do nearly the same thing for flushing and
    deleting a chain. Let's refactor (no functional change).
    
    Signed-off-by: Quentin Monnet <quentin@isovalent.com>
    qmonnet committed Jun 10, 2020
    Copy the full SHA
    a20f68a View commit details
    Browse the repository at this point in the history
  3. iptables: log unexpected failures to delete transient rules

    When Cilium reinitialises its daemon, transient iptables rules are set
    up to avoid dropping packets while the regular rules are not in place.
    
    On rare occasions, setting up those transient rules has been found to
    fail, for an unknown reason (see issue #11276). The error message states
    that the "Chain already exists", even though we try to flush and remove
    any leftover from previous transient rules before adding the new ones.
    It sounds likely that removing the leftovers is failing, but we were not
    able to understand why, because we quieten the function to avoid
    spurious warnings the first time we try to remove them (since none is
    existing).
    
    It would be helpful to get more information to understand what happens
    in those rare occasions where setting up transient rules fails. Let's
    find a way to get more logs, without making too much noise.
    
    We cannot warn unconditionally in remove() since we want removal in the
    normal case to remain quiet. What we can do is logging when the "quiet"
    flag is not passed, _or_ when the error is different from the chain
    being not found, i.e. when the error is different from the one we want
    to silence on start up. This means matching on the error message
    returned by ip(6)tables. It looks fragile, but at least this message has
    not changed since 2009, so it should be relatively stable and pretty
    much the same on all supported systems. Since remove() is used for
    chains other than for transient rules too, we also match on chain name
    to make sure we are dealing with transient rules if ignoring the "quiet"
    flag.
    
    This additional logging could be removed once we reproduce and fix the
    issue.
    
    Alternative approaches could be:
    
    - Uncoupling the remove() function for transient rules and regular
      rules, to avoid matching on chain name (but it sounds worse).
    - Logging on failure for all rules even when the "quiet" flag is passed,
      but on "info" level instead of "warning". This would still require a
      modified version of runProg(), with also a modified version of
      CombinedOutput() in package "exec". Here I chose to limit the number
      of logs and keep the changes local.
    - Listing the chain first before trying to remove it, so we only try to
      remove if it exists, but this would likely add unnecessary complexity
      and latency.
    
    Should help with (but does not solve): #11276
    Signed-off-by: Quentin Monnet <quentin@isovalent.com>
    qmonnet committed Jun 10, 2020
    Copy the full SHA
    6dae0d7 View commit details
    Browse the repository at this point in the history
  4. iptables: add transient rules removal failure to badLogMessages

    In an attempt to catch #11276 in CI, let's add any message related to a
    failure to flush or delete the chain related to iptables transient rules
    to the list of badLogMessages we want to catch.
    
    We need to filter on the name of the chain for transient rules to avoid
    false positives, which requires exporting that name.
    
    We also need to modify the log message error, to avoid adding four
    disting logs to the list (combinations for iptables/ip6tables,
    flush/delete).
    
    Signed-off-by: Quentin Monnet <quentin@isovalent.com>
    qmonnet committed Jun 10, 2020
    Copy the full SHA
    f106702 View commit details
    Browse the repository at this point in the history