Improve the ipAddrAdd performance #965

iamakulov · 2020-08-06T09:25:26Z

This PR replaces the Exec() call inside ipAddrAdd() with netlink.RouteReplace().

When calling ipAddrAdd in a loop (e.g., inside setupClusterIPServices), this gives a significant performance boost.

pkg/controllers/proxy/network_services_controller.go

iamakulov · 2020-08-06T09:42:15Z

pkg/controllers/proxy/network_services_controller.go

+		Table:     unix.RT_TABLE_LOCAL,
+		Protocol:  unix.RTPROT_KERNEL,
+		Scope:     unix.RT_SCOPE_HOST,
+		Type:      unix.RTN_LOCAL,


These parameters come from running the original ip route replace ... command through strace:

strace ip route replace local 10.97.216.74 dev kube-dummy-if table local proto kernel scope host src 172.17.0.2

which prints the system call with needed constants:

sendmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=52, type=RTM_NEWROUTE, flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_REPLACE|NLM_F_CREATE, seq=1596701111, pid=0}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_HOST, rtm_type=RTN_LOCAL, rtm_flags=0}, [{{nla_len=8, nla_type=RTA_DST}, inet_addr("10.97.216.74")}, {{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("172.17.0.2")}, {{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("kube-dummy-if")}]}, iov_len=52}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 52

murali-reddy · 2020-08-07T10:51:06Z

@iamakulov thanks for your PR

We ran into some issues with netlink.RouteReplace on some distros and specific kernel versions. Please see #370 (comment)

Action done by netlink routereplace was done to address specific issue with using service proxy in case of nodes with multiple network interface. This is the problem kubernetes/kubeadm#102 (comment) and recommended solution https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#check-network-adapters. That piece of code is trying to automate above requirement.

On hindsight adding a route by exec command or with netlink libray for each VIP just does not scale. Now that kube-router has --service-cluster-ip-range we should all together remove setting up route for each cluster IP and set one rule to use node IP as source when destination is from the --service-cluster-ip-range

Not sure you have motivation to work on the fix, but here is what i think would be better approach

remove whole ip route logic from the ipAddrAdd
if --service-cluster-ip-range specified setup an ip rule to use node ip as source when destination is in given range

Unfortunatley CIDR that is used for cluster Ip's can not inferred from the kubernetes API's to we will have to rely on user input through this parameter.

This commit removes the `Exec("ip", "route", "replace")` call from `ipAddrAdd`. When `setupClusterIPServices` calls `ipAddrAdd` in a loop, `Exec()` gets executed multiple times and becomes a hot path. Instead of these looped `Exec()` calls, this method adds a single `Exec("ip", "route", "replace")` call that sets up routing based on the cluster CIDR. This was suggested in cloudnativelabs#965 (comment)

iamakulov · 2020-08-14T21:52:39Z

@murali-reddy Just took a round on your suggestion! Did I get your idea right?

murali-reddy · 2020-08-18T12:01:49Z

thanks @iamakulov PR looks good. Will give a quick test and will update the PR.

This commit removes the `Exec("ip", "route", "replace")` call from `ipAddrAdd`. When `setupClusterIPServices` calls `ipAddrAdd` in a loop, `Exec()` gets executed multiple times and becomes a hot path. Instead of these looped `Exec()` calls, this method adds a single `Exec("ip", "route", "replace")` call that sets up routing based on the cluster CIDR. This was suggested in cloudnativelabs#965 (comment)

iamakulov · 2020-09-12T16:07:00Z

(Rebased on master.)

murali-reddy · 2020-09-16T09:05:39Z

@iamakulov Apolgies for the delay in review. We are trying to get 1.1 release out in next couple of days. Marking for 1.2 release which is focussed on performance enhancements.

iamakulov · 2020-09-19T00:40:32Z

Tested this patch in production, and it doesn’t work as expected.

1) For new nodes, nsc.enforceRightServiceInterface gets called when kube-dummy-if doesn’t exist yet – and fails.

2) The change resurrects #376.

netlink.AddrAdd not only associates an address with an interface, but also adds a route to the routing table. And the route uses the same IP address in the src field:

$ ip addr add 10.10.10.10 dev kube-dummy-if
$ ip addr list kube-dummy-if
8: kube-dummy-if: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default 
    link/ether 56:51:fe:11:e3:02 brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.10/32 brd 10.10.10.10 scope link kube-dummy-if
       valid_lft forever preferred_lft forever
$ ip route show table all | grep 10.10.10.10
local 10.10.10.10 dev kube-dummy-if table local proto kernel scope host src 10.10.10.10 # ← here, `src 10.10.10.10` is the problem
broadcast 10.10.10.10 dev kube-dummy-if table local proto kernel scope link src 10.10.10.10

Before, we were fixing this by calling ip route replace for each service IP. That was costly, so this PR removed that call and started calling ip route replace <service-ip-cidr> .... But that resurrected #376 – because the CIDR route is less specific than netlink.AddrAdd’s routes and doesn’t have any effect.

To fix both issues, I’m going to change the approach:

I’ll bring ip route replace back to nsc.ln.ipAddrAdd()
and I’ll make setupClusterIPServices call nsc.ln.ipAddrAdd() only once (with the full services CIDR) instead of X times (for each service’s IP)

I’ll test this change in our system and then cherry-pick it here.

iamakulov · 2020-09-19T00:46:10Z

Also, re: UX/DX.

This PR makes the service-cluster-ip-range param much more important. Before, if you specified it incorrectly (or if the default 10.96.0.0/16 value was wrong), it only affected network policies (which aren’t that common, I guess). Now, an incorrect value can break the service connectivity completely. (I learned this the hard way.)

Kubernetes already knows the service-cluster-ip-range value. Could kube-router retrieve it automatically and use the actual value (instead of a hard-coded default) in its defaults? That’d make life quite easier for a lot of people who aren’t that experienced with networking/Kubernetes administration (like me).

aauren · 2020-09-22T20:16:45Z

@iamakulov While it's true that Kubernetes does know the service-cluster-ip-range value, as far as I know, it doesn't expose that information anywhere that is obtainable with through the API. Although, it may just be that it's not in a common place and I haven't run across it yet?

iamakulov · 2020-09-22T21:28:44Z

While it's true that Kubernetes does know the service-cluster-ip-range value, as far as I know, it doesn't expose that information anywhere that is obtainable with through the API. Although, it may just be that it's not in a common place and I haven't run across it yet?

Right, it appears that it’s not implemented: kubernetes/kubernetes#25533 kubernetes/kubernetes#46508

Would it be a viable idea to make --service-cluster-ip-range a required argument then? (To ensure that people have to configure it properly – and don’t accidentally skip it and break their cluster.) I’m not sure how actually common is the 10.96.0.0/16 default, but at least kops uses 100.64.0.0/13 instead.

(If yes, it might be a good idea to also document how to retrieve this value – from kubectl cluster-info dump | grep service-cluster-ip-range.)

aauren

Finally coming back around this one. The only thing that I can see is that we should do as you suggest and make --service-cluster-ip-range a required argument as well as update the documentation in this area.

Otherwise this looks good to me.

This commit removes the `Exec("ip", "route", "replace")` call from `ipAddrAdd`. When `setupClusterIPServices` calls `ipAddrAdd` in a loop, `Exec()` gets executed multiple times and becomes a hot path. Instead of these looped `Exec()` calls, this method adds a single `Exec("ip", "route", "replace")` call that sets up routing based on the cluster CIDR. This was suggested in cloudnativelabs#965 (comment)

aauren · 2021-03-19T21:39:41Z

pkg/controllers/proxy/network_services_controller.go

@@ -1006,7 +995,7 @@ func (ln *linuxNetworking) prepareEndpointForDsr(containerID string, endpointIP
 	}

 	// assign VIP to the KUBE_TUNNEL_IF interface
-	err = ln.ipAddrAdd(tunIf, vip, false)
+	err = ln.ipAddrAdd(tunIf, vip)


@iamakulov You missed another one of these changes on line 1229: https://github.com/cloudnativelabs/kube-router/blob/master/pkg/controllers/proxy/network_services_controller.go#L1229

aauren · 2021-03-19T23:31:43Z

pkg/controllers/proxy/network_services_controller.go

 	naddr := &netlink.Addr{IPNet: &net.IPNet{IP: net.ParseIP(ip), Mask: net.IPv4Mask(255, 255, 255, 255)}, Scope: syscall.RT_SCOPE_LINK}
 	err := netlink.AddrAdd(iface, naddr)
 	if err != nil && err.Error() != IfaceHasAddr {
 		glog.Errorf("Failed to assign cluster ip %s to dummy interface: %s",
 			naddr.IPNet.IP.String(), err.Error())
 		return err
 	}
-


@iamakulov I found an additional problem with this logic. Down below, you only account for the ClusterIP range in the enforceRightServiceInterface() function. However, we also needs routes inserted for ExternalIP and LoadBalancer VIPs if we remove this section here.

ExternalIPs and LoadBalancer VIPs are less likely to be in a cohesive subnet, and I don't think that it would be good to use the same process for those (even though there is a parameter for them) as it would likely exclude certain valid use-cases for users.

I think that we'll likely have to go back to the drawing board here.

iamakulov · 2021-03-21T21:04:23Z

Hey!

I’m afraid I left the project I was working for when I implemented this patch, so I won’t be able to contribute fixes for this PR. Sorry!

However, as I mentioned in #965 (comment), this patch actually doesn’t work as expected, so this PR can’t be merged as-is. In the project I was working for, we ended up taking the alternative approach:

To fix both issues, I’m going to change the approach:

I’ll bring ip route replace back to nsc.ln.ipAddrAdd()

and I’ll make setupClusterIPServices call nsc.ln.ipAddrAdd() only once (with the full services CIDR) instead of X times (for each service’s IP)

I’ll test this change in our system and then cherry-pick it here.

– and this one worked well. It’s been running in production for a few months with no observable issues.

Let me know if there’s anything else I can help with!

aauren · 2022-02-19T05:19:47Z

Closing as this PR is incomplete and the author unfortunately had to move on. In NPC we've started taking the approach of using ipset save and ipset restore, and likely we'll work that same functionality into the NSC in the future as well if we wanted to approach this problem again.

iamakulov mentioned this pull request Aug 6, 2020

Service proxy causing high CPU usage with ~2.7K services #962

Closed

iamakulov commented Aug 6, 2020

View reviewed changes

aauren added performance review-needed labels Aug 6, 2020

iamakulov force-pushed the ipaddradd-perf branch from d2e223b to 93fedf1 Compare August 14, 2020 21:46

murali-reddy self-assigned this Aug 18, 2020

iamakulov force-pushed the ipaddradd-perf branch from 93fedf1 to c1ff9f9 Compare September 12, 2020 16:06

murali-reddy added this to In Progress in 1.2 via automation Sep 16, 2020

aauren approved these changes Mar 19, 2021

View reviewed changes

aauren requested changes Mar 19, 2021

View reviewed changes

aauren reviewed Mar 19, 2021

View reviewed changes

aauren removed this from In Progress in 1.2 Mar 19, 2021

aauren added do not merge yet and removed review-needed labels Mar 19, 2021

aauren closed this Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the ipAddrAdd performance #965

Improve the ipAddrAdd performance #965

iamakulov commented Aug 6, 2020 •

edited

iamakulov Aug 6, 2020

murali-reddy commented Aug 7, 2020 •

edited

iamakulov commented Aug 14, 2020

murali-reddy commented Aug 18, 2020

iamakulov commented Sep 12, 2020

murali-reddy commented Sep 16, 2020

iamakulov commented Sep 19, 2020

iamakulov commented Sep 19, 2020

aauren commented Sep 22, 2020

iamakulov commented Sep 22, 2020

aauren left a comment

aauren Mar 19, 2021

aauren Mar 19, 2021

iamakulov commented Mar 21, 2021

aauren commented Feb 19, 2022

Improve the ipAddrAdd performance #965

Improve the ipAddrAdd performance #965

Conversation

iamakulov commented Aug 6, 2020 • edited

iamakulov Aug 6, 2020

Choose a reason for hiding this comment

murali-reddy commented Aug 7, 2020 • edited

iamakulov commented Aug 14, 2020

murali-reddy commented Aug 18, 2020

iamakulov commented Sep 12, 2020

murali-reddy commented Sep 16, 2020

iamakulov commented Sep 19, 2020

iamakulov commented Sep 19, 2020

aauren commented Sep 22, 2020

iamakulov commented Sep 22, 2020

aauren left a comment

Choose a reason for hiding this comment

aauren Mar 19, 2021

Choose a reason for hiding this comment

aauren Mar 19, 2021

Choose a reason for hiding this comment

iamakulov commented Mar 21, 2021

aauren commented Feb 19, 2022

iamakulov commented Aug 6, 2020 •

edited

murali-reddy commented Aug 7, 2020 •

edited