Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Ensure that the iptables masquerade rule is always installed #808
We ran into an issue where sometimes the iptables masquerade rules that flannel creates get deleted. Obviously ideally this wouldn't happen, but if it does happen it's really bad -- containers can't send packets to any hosts outside the cluster.
This PR makes flannel automatically recover from this situation by recreating the iptables rules, so that the cluster will remain healthy:
We're running this in our production cluster now
Hi @julia-stripe, thanks for another well put together PR!
I see one key problem with this PR and then a couple of minor concerns (below) - the main concern is that if a single rule (of the four that flannel installs) is removed, then flannel just restores the one missing rule, which means it might end up being out of order! Since the ordering of the rules matters, I think this defeats the purpose of this PR a bit.
e.g. if I remove the first flannel rules
I'd be interested to hear if you have any thoughts on automated testing in this area? I know there aren't any existing tests but it would be great to have some.
Otherwise, the code looks good to me, though I'd like to leave it up for at least a few days before merging to give others a chance to comment.
The slight areas of concern I'd have about this are
@fasaxc I'd love to hear your thoughts on this.
On (1), looks like the code checks for the existence of the rule first, so it's 4 reads per second. That could be moderately expensive if there are a lot of rules (since most iptables operations do a full table load from the kernel to userspace, even to check one rule).
On (2), as long as the other process isn't deleting these rules actively then we should be OK (and if it was actively trying to delete those rules then there's not much we can do.
Longer term, you may want to migrate to having only a single rule inserted into the root chain and then jumping fro there to a flannel-owned chain. That's the approach that we have in Calico; it reduces cloberring since there's only one rule and it avoids reordering issues. (And, since we own the calico chain, we can just flush and rewrite the whole chain at will if anyone ever tampers with it.)
this is a great point. Changed the PR so that it instead:
I also added a mock IPTables struct & some tests to give us a little more confidence that this code actually works as intended.
This should be reviewed commit-by-commit -- the first commit just adds tests, and the second actually changes functionality.
@fasaxc thanks for the comments. I agree that longer term, a single jump to a flannel owned chain would be better.
@julia-stripe That's looking much better thanks. This code doesn't actually delete the iptables rules when flannel stops, but after a bit of testing I see that the old code didn't either (ooops). I also think it's desirable that the rules aren't cleaned up (so the dataplane keeps working during an upgrade). It's up to you if you want to clean up the code or just leave it as is. I think in future it would be good to have a "--cleanup" command which which removes all traces of flannel from a system (removes the subnet.env file, any iptables rules, any flannel routes and flannel created devices).
Could you squash the two commits then I can merge it?