[21.05] Improve router behaviour during failovers #1003
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We've had some problems with temporary (ca. 20s) loss of connectivity when performing failovers with NixOS routers in VXLAN (both between Gentoo and NixOS and with NixOS pairs). For comparison, in the pre-VXLAN world failover between Gentoo routers was near-instantaneous.
One half of this problem is that by default we enable ARP and ND suppression on VXLAN interfaces. This is recommended by most of the tutorials I've seen for setting up VXLAN, as this reduces the amount of flooding traffic which needs to be replicated to all VTEPs in the fabric. However, flood suppression means that the gratuitous ARPs and NDs sent by keepalived when it's promoted from backup to primary router are dropped, which prevents the new primary router from being able to promptly steer outgoing traffic away from the demoted router.
The other half of this problem is that the firewall rule for allowing incoming TFTP packets was only enabled for the primary specialisation, and not the base specialisation. This means that switching between specialisations would cause a firewall reload, and reloading the NixOS firewall causes all incoming packets to be dropped for a few seconds. This is enough to interrupt BFD and BGP sessions, which (depending on the local configuration) can cause routers to temporarily lose their default route. If the router has just been promoted to primary, then the entire site loses connectivity until the BGP session comes back up and the default routes are re-learned.
This change splits the VXLAN bridge port configuration into separate units for flood suppression and MAC learning suppression, and then disables the former in the router role only. This means that flood suppression will stay enabled on all other physical hosts. Additionally, the firewall rules for TFTP are enabled unconditionally, irrespective of whether the router is the primary or not, in order to prevent a firewall reload when performing a failover.
PL-132482
@flyingcircusio/release-managers
Release process
Impact: internal
Changelog: none
PR release workflow (internal)
Design notes
on
oroff
. Example: rate limiting.Security implications