Implement BGP add-path #1

danderson · 2017-11-27T21:06:49Z

Currently MetalLB suffers from the same load imbalance problems as GCP when using externalTrafficPolicy: Local, because BGP ends up load-balancing by node, regardless of the pod count on each node.

One possible solution is to implement BGP add-path (RFC 7911) in MetalLB's BGP speaker, and make speakers advertise one distinct path for each pod running on a given node. This would effectively push weight information upstream, and compatible routers should therefore weight their traffic distribution.

Annoyingly, the spec doesn't make it clear that routers are expected to translate multiple distinct paths with the same next-hop as a weighted assignment in their FIB. It would make sense naively, but it may not be what people implementing the spec had in mind in terms of use cases. So, we'll have to implement a prototype and see how BIRD and others behave, and if they behave sensibly, we can productionize the change.

The text was updated successfully, but these errors were encountered:

halfa · 2017-12-07T09:36:07Z

Open question: why are you re-implementing a bpg library specifically for MetalLB instead if using something like GoBGP ? I've done a PoC running it against JunOS routers and encountered issue on the capabilities advertisement.

danderson · 2017-12-07T17:05:46Z

Good question! The initial prototype used gobgp, but I kept having issues and headaches with it. The API surface is completely undocumented (both the Go API and the gRPC API), and is completely un-idiomatic Go (frankly it feels like python that was ported word for word into Go). So when I (frequently) encountered strange behaviors, it was a chore to debug.

Aside from that, GoBGP also implements way more stuff than MetalLB needs. It wants to be a real router, implementing the full BGP convergence algorithm and all that. All that behavior is actively harmful to what MetalLB wants to do (which is just push a couple of routes and ignore everything the peer sends). It's possible to make it work, by segmenting things into VRFs, adding policies to drop all inbound routes, ... But at that point I'm writing significant amounts of code, against an undocumented library that implements 10x what we need, to trick it into not doing 99% of what it tries to do by default. That's a lot of additional complexity just to serialize and transmit an update message.

With that said, there's obviously downsides as well, and the big one is that OSRG has money to buy vendor gear and do interop testing on their quirky BGP stacks, and I don't.

What version of MetalLB did you use for your PoC? Yesterday I updated the manifests to point to a version that should resolve the main source of unhappy peers (not advertising MP-BGP extensions), so I'm curious if this makes JunOS happy as well. If it doesn't, any chance you could file a bug with a pcap of the BGP exchange?

pdf · 2018-04-20T05:22:20Z

Correct me if I'm wrong, but without add-path support, aren't we in a situation where there is actually no load-balancing done when externalTrafficPolicy: Local is specified? Because MetalLB only announces a single next-hop, and expects k8s to perform the actual load-balancing, whereas k8s expects load-balancers to distribute traffic accross service group members, especially with externalTrafficPolicy: Local?

danderson · 2018-04-20T05:40:40Z

That's not accurate. Without add-path, the router receiving multiple advertisements can still choose to do multipath routing between all peers. This is because each path is distinct, in terms of BGP's decision algorithm, because each one comes from a different router ID.

You still need a router capable of doing multipath (~all of the professional ones, and most of the entry level ones), and you have to explicitly enable that behavior (otherwise it will just pick one of the paths and only use that one - this is true even with add-path enabled).

The issue we have without add-path is that the traffic distribution is on a per-node basis, not a per-pod basis. So if you have multiple pods for a service on a single node, it can lead to traffic imbalances, as explained in more detail at https://metallb.universe.tf/usage/#local-traffic-policy.

I should add, and maybe this is the source of confusion: in BGP mode, each speaker is only advertising itself as a nexthop, it's not attempting to advertise the routes on behalf of other machines. So, from the external router's perspective, it sees N routers, each advertising routes that point to themselves - not N routers all advertising the exact same thing.

pdf · 2018-04-20T05:47:44Z

I should add, and maybe this is the source of confusion: in BGP mode, each speaker is only advertising itself as a nexthop, it's not attempting to advertise the routes on behalf of other machines.

Indeed, this contributes to my confusion - what happens in the case where you have externalTrafficPolicy: Local services to route that exist on nodes that aren't running MetalLB speakers?

danderson · 2018-04-20T05:52:08Z

Those pods/nodes won't get any traffic. When using the Local traffic policy, each node is responsible for attracting traffic to itself, if it contains pods for the relevant service.

In fact, this is almost exactly the same as the Cluster traffic policy behavior: each node is responsible for attracting its own traffic, the only difference is that in Cluster mode the attraction is unconditional, and in Local mode there must be at least 1 service pod on the machine.

Once we have add-path, I guess we could revisit that topology. However, there is one big benefit to having each node the only one responsible for attracting its own traffic: if the node fails, the BGP session breaks immediately, and traffic stops flowing to the broken node without having to wait for k8s to converge and notice that the pods are unhealthy. This greatly speeds up draining bad nodes (~0s vs. tens of seconds for a gray failure like power loss).

pdf · 2018-04-20T06:18:06Z

Those pods/nodes won't get any traffic. When using the Local traffic policy, each node is responsible for attracting traffic to itself, if it contains pods for the relevant service.

Right, makes sense now that I understand how this is designed to work (and indeed, I hadn't enabled multipath on the upstream routers in my lab so I was only seeing a single route, derp).

In fact, this is almost exactly the same as the Cluster traffic policy behavior: each node is responsible for attracting its own traffic, the only difference is that in Cluster mode the attraction is unconditional, and in Local mode there must be at least 1 service pod on the machine.

The other difference, I believe, between Local and Cluster is that Cluster traffic will still arrive at a node that can handle the traffic even if that node is not a speaker, since k8s will NAT it to get there.

there is one big benefit to having each node the only one responsible for attracting its own traffic: if the node fails, the BGP session breaks immediately

I can see logic in this approach for fault-detection, vs waiting for k8s to detect problems, though I do wonder if this coupling is always ideal.

Thanks kindly for the explanations, everything makes perfect sense now.

danderson · 2018-04-20T06:52:51Z

You're correct about the additional behavior of the Cluster policy. Courtesy of kube-proxy, it will do "perfect" even distribution across all service pods, at the cost of rewriting the source IP when it NATs.

As for the coupling of speaker and node... I agree, it's a tradeoff. I picked the one I picked mostly arbitrarily, based on past experiences at Google where this behavior was beneficial. But other options are just as valid, as illustrated in #148 for example - in that bug, the person effectively wants the BGP state to reflect what k8s thinks at all times, regardless of the health of individual speaker pods. It's a different set of tradeoffs, in that some failure modes become better and others become worse.

The good news is, when we get add-path, we suddenly have more choices as to which behavior we want... Although I worry about offering too many configuration options, I already have a hard time testing all combinations sufficiently :)

bjornryden · 2018-05-16T07:20:05Z

If I understand this correctly, I see a problem with add-path in regards to how many routes are going into the receiving routers. At a very quick glance the limitations in ECMP routes per destination ranges from 16 (older custom silicon from C/J) to 1024 (Trident 2+).
C/J both implements a weighted ECMP solution based on BGP communities, which would reduce the amount of routes needed to nodes (with active pods) and not pods.
Juniper implementation:
https://www.juniper.net/documentation/en_US/junos/topics/example/bgp-multipath-unequal.html
Cisco implementation:
https://supportforums.cisco.com/t5/service-providers-documents/asr9000-xr-understanding-unequal-cost-multipath-ucmp-dmz-link/ta-p/3138853
IETF Draft:
https://tools.ietf.org/html/draft-ietf-idr-link-bandwidth-07

danderson · 2018-05-16T16:28:12Z

Yup, ECMP group size is a concern. The 100% fix for that is WCMP (weighted cost multipath), which the silicon typically supports but the upper layer protocols don't really.

Thanks for the reference on Cisco and Juniper's approach to this! The RFC raises some questions for me, so I'll have to consult with a couple of networking friends... But it seems plausible to implement.

Regarding ECMP group size: IIRC, Broadcom Scorpion, the predecessor of Trident, already supported 128-way ECMP, and later generations of silicon go up to 512 or even 1024-way ECMP. So, for enterprises that are using even semi-modern hardware, the group size shouldn't be a huge issue. Homelabbers and people using lower-tier hardware may indeed have issues, however, I agree. There's not much I can do about that, except offer the option of using community+bandwidth balancing for people who have compatible hardware.

CJP10 · 2018-12-20T07:32:53Z

I am curious if removing additional paths that were added by add-path have any significant downside.

Assuming they don't, even capping the paths at 16 by default to support older hardware would greatly improve load distribution.

My initial thought is have the node with the most pods allocate 16 paths and the node with the least allocate a single path. All the other nodes, fall in between based off their distribution between the two.

Repeat this process when the distribution of pods changes and this should add some sort of "weighted" routing

rchenzheng · 2020-07-21T20:42:40Z

Any news on fixing this?

Setup Metallb repo for ART request

Jean-Daniel · 2022-08-09T07:16:59Z

Now that metallb switched to FRR, it may be implemented using weighted ECMP. This is an use case FRR is explicitly supporting:

http://docs.frrouting.org/en/stable-7.5/bgp.html#weighted-ecmp-using-bgp-link-bandwidth

One of the primary use cases of this capability is in the data center when a service (represented by its anycast IP) has an unequal set of resources across the regions (e.g., PODs) of the data center

danderson self-assigned this Nov 27, 2017

danderson added the enhancement label Nov 27, 2017

gayathri-baskaran mentioned this issue Dec 15, 2017

MetalLB unable to connect to quagga 0.99.22.4-4 on centos linux #65

Closed

danderson removed their assignment Dec 17, 2017

pawankkamboj mentioned this issue Jan 3, 2018

K8S external-ip(expose using MetalLB) not accessible from BGP router instance #126

Closed

danderson added this to To Do in BGP mode via automation Jan 15, 2018

danderson mentioned this issue Jan 25, 2018

Another BGP speaker should take over announcements in case of failure when using externalTrafficPolicy: Local #148

Open

michaelfig mentioned this issue Apr 26, 2018

layer2 mode doesn't receive broadcast packets on VM unless promiscuous mode is enabled #253

Closed

j-devops mentioned this issue May 11, 2018

have to keep running arping to keep the VIP alive #259

Closed

carlpett mentioned this issue Jul 21, 2018

Following layer 2 tutorial, don't get IP #281

Closed

asvillamar mentioned this issue Jul 30, 2018

Nginx pod not receiving traffic with Local traffic Policy, Layer 2 mode #287

Closed

ghaering mentioned this issue Aug 9, 2018

Failover time very high in layer2 mode #298

Closed

SergeyJyravlev mentioned this issue Aug 14, 2018

Cannot get response from pod #300

Closed

adidenko mentioned this issue Aug 15, 2018

Layer 2: impossible to use MetalLB on a subset of Kubernetes nodes. #302

Closed

muffin87 mentioned this issue Aug 23, 2018

Add liveness and readiness probes for MetalLB components #307

Closed

michaelfig mentioned this issue Oct 4, 2018

layer2 only announces when speaker nodes overlap with service pod nodes #322

Closed

This was referenced Oct 15, 2018

metallb not able to reach k8s-apiserver when Istio is enabled #327

Closed

Metallb doesn't send/receive ARP when Istio is enabled #330

Closed

jeroenjacobs79 mentioned this issue Nov 25, 2018

Works on one worker, but not on the other #345

Closed

dmytroivanovv mentioned this issue Dec 1, 2018

Service is unreachable from outside #355

Closed

dirtbag mentioned this issue Dec 13, 2018

cant connect to LoadBalancer ip addr from remote host #363

Closed

davidschrooten mentioned this issue Jan 14, 2019

Weave's default configuration is broken with externalTrafficPolicy: Local #380

Closed

moikot mentioned this issue Feb 15, 2019

Speaker stops responding on ARP requests #402

Closed

k2mahajan mentioned this issue Sep 26, 2019

Log previous error from parseCIDR in parseAddressPool #481

Closed

tiandong19860806 mentioned this issue Dec 25, 2019

After installed, the kubentes machine can't access the internet. #512

Closed

russellb mentioned this issue Aug 6, 2020

Move website to CNCF-provided Netlify account under metallb.io #691

Open

markdgray added the protocol/BGP label May 6, 2021

sabinaaledort pushed a commit to sabinaaledort/metallb that referenced this issue Jul 4, 2021

Merge pull request metallb#1 from loadbalancer-api/metallb-setup

1f7af9b

Setup Metallb repo for ART request

russellb mentioned this issue Oct 8, 2021

L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BGP add-path #1

Implement BGP add-path #1

danderson commented Nov 27, 2017

halfa commented Dec 7, 2017 •

edited

danderson commented Dec 7, 2017

pdf commented Apr 20, 2018

danderson commented Apr 20, 2018

pdf commented Apr 20, 2018

danderson commented Apr 20, 2018

pdf commented Apr 20, 2018

danderson commented Apr 20, 2018

bjornryden commented May 16, 2018

danderson commented May 16, 2018

CJP10 commented Dec 20, 2018 •

edited

rchenzheng commented Jul 21, 2020

Jean-Daniel commented Aug 9, 2022 •

edited

Implement BGP add-path #1

Implement BGP add-path #1

Comments

danderson commented Nov 27, 2017

halfa commented Dec 7, 2017 • edited

danderson commented Dec 7, 2017

pdf commented Apr 20, 2018

danderson commented Apr 20, 2018

pdf commented Apr 20, 2018

danderson commented Apr 20, 2018

pdf commented Apr 20, 2018

danderson commented Apr 20, 2018

bjornryden commented May 16, 2018

danderson commented May 16, 2018

CJP10 commented Dec 20, 2018 • edited

rchenzheng commented Jul 21, 2020

Jean-Daniel commented Aug 9, 2022 • edited

halfa commented Dec 7, 2017 •

edited

CJP10 commented Dec 20, 2018 •

edited

Jean-Daniel commented Aug 9, 2022 •

edited