Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[21.05] Improve router behaviour during failovers #1003

Merged
merged 3 commits into from
May 7, 2024

Conversation

sysvinit
Copy link
Member

@sysvinit sysvinit commented May 7, 2024

We've had some problems with temporary (ca. 20s) loss of connectivity when performing failovers with NixOS routers in VXLAN (both between Gentoo and NixOS and with NixOS pairs). For comparison, in the pre-VXLAN world failover between Gentoo routers was near-instantaneous.

One half of this problem is that by default we enable ARP and ND suppression on VXLAN interfaces. This is recommended by most of the tutorials I've seen for setting up VXLAN, as this reduces the amount of flooding traffic which needs to be replicated to all VTEPs in the fabric. However, flood suppression means that the gratuitous ARPs and NDs sent by keepalived when it's promoted from backup to primary router are dropped, which prevents the new primary router from being able to promptly steer outgoing traffic away from the demoted router.

The other half of this problem is that the firewall rule for allowing incoming TFTP packets was only enabled for the primary specialisation, and not the base specialisation. This means that switching between specialisations would cause a firewall reload, and reloading the NixOS firewall causes all incoming packets to be dropped for a few seconds. This is enough to interrupt BFD and BGP sessions, which (depending on the local configuration) can cause routers to temporarily lose their default route. If the router has just been promoted to primary, then the entire site loses connectivity until the BGP session comes back up and the default routes are re-learned.

This change splits the VXLAN bridge port configuration into separate units for flood suppression and MAC learning suppression, and then disables the former in the router role only. This means that flood suppression will stay enabled on all other physical hosts. Additionally, the firewall rules for TFTP are enabled unconditionally, irrespective of whether the router is the primary or not, in order to prevent a firewall reload when performing a failover.

PL-132482

@flyingcircusio/release-managers

Release process

Impact: internal

Changelog: none

PR release workflow (internal)

  • PR has internal ticket
  • internal issue ID (PL-…) part of branch name
  • internal issue ID mentioned in PR description text
  • ticket is on Platform agile board
  • ticket state set to Pull request ready
  • if ticket is more urgent than within the next few days, directly contact a member of the Platform team

Design notes

  • Provide a feature toggle if the change might need to be adjusted/reverted quickly depending on context. Consider whether the default should be on or off. Example: rate limiting.
  • All customer-facing features and (NixOS) options need to be discoverable from documentation. Add or update relevant documentation such that hosted and guided customers can understand it as well.

Security implications

  • Security requirements defined? (WHERE)
    • Flood suppression is generally desirable to prevent packets being flooded unnecessarily, especially on multi-tenant hosts such as KVM servers. Suppression is hence only disabled on routers.
    • No other requirements, this change improves the integration of the router role in the platform.
  • Security requirements tested? (EVIDENCE)
    • Manually verified in DEV. Failing over between the NixOS and Gentoo router causes no visible disruption in an mtr trace left running on a VM in DEV.

We need finer grained control over ARP and ND flooding suppression on
VXLAN interfaces, so split the bridge port configuration into separate
units for flooding suppression and for mac learning.

PL-132482
Flood suppression on VXLAN interfaces will block the gratuitous ARPs
and NDs sent by keepalived when performing a promotion from backup to
primary router. Selectively disable the flood suppression on router
VXLAN interfaces which have floating addresses managed by keepalived.

PL-132482
The TFTP port must be unconditionally open in the firewall when this
role is enabled. Differences between the firewall in primary and
backup modes will cause a firewall reload, which will cause traffic to
be dropped in the process of a router switching to the primary role.

PL-132482
@ctheune ctheune merged commit 47e1cf6 into fc-21.05-dev May 7, 2024
2 checks passed
@ctheune ctheune deleted the PL-132482-router-improve-failover-behaviour branch May 7, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants