Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the ipAddrAdd performance #965

Closed
wants to merge 1 commit into from

Conversation

iamakulov
Copy link
Contributor

@iamakulov iamakulov commented Aug 6, 2020

This PR replaces the Exec() call inside ipAddrAdd() with netlink.RouteReplace().

When calling ipAddrAdd in a loop (e.g., inside setupClusterIPServices), this gives a significant performance boost.

Ref: #962

Comment on lines 147 to 150
Table: unix.RT_TABLE_LOCAL,
Protocol: unix.RTPROT_KERNEL,
Scope: unix.RT_SCOPE_HOST,
Type: unix.RTN_LOCAL,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters come from running the original ip route replace ... command through strace:

strace ip route replace local 10.97.216.74 dev kube-dummy-if table local proto kernel scope host src 172.17.0.2

which prints the system call with needed constants:

sendmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=52, type=RTM_NEWROUTE, flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_REPLACE|NLM_F_CREATE, seq=1596701111, pid=0}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_HOST, rtm_type=RTN_LOCAL, rtm_flags=0}, [{{nla_len=8, nla_type=RTA_DST}, inet_addr("10.97.216.74")}, {{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("172.17.0.2")}, {{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("kube-dummy-if")}]}, iov_len=52}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 52

@murali-reddy
Copy link
Member

murali-reddy commented Aug 7, 2020

@iamakulov thanks for your PR

We ran into some issues with netlink.RouteReplace on some distros and specific kernel versions. Please see #370 (comment)

Action done by netlink routereplace was done to address specific issue with using service proxy in case of nodes with multiple network interface. This is the problem kubernetes/kubeadm#102 (comment) and recommended solution https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#check-network-adapters. That piece of code is trying to automate above requirement.

On hindsight adding a route by exec command or with netlink libray for each VIP just does not scale. Now that kube-router has --service-cluster-ip-range we should all together remove setting up route for each cluster IP and set one rule to use node IP as source when destination is from the --service-cluster-ip-range

Not sure you have motivation to work on the fix, but here is what i think would be better approach

  • remove whole ip route logic from the ipAddrAdd
  • if --service-cluster-ip-range specified setup an ip rule to use node ip as source when destination is in given range

Unfortunatley CIDR that is used for cluster Ip's can not inferred from the kubernetes API's to we will have to rely on user input through this parameter.

iamakulov added a commit to iamakulov/kube-router that referenced this pull request Aug 14, 2020
This commit removes the `Exec("ip", "route", "replace")` call from `ipAddrAdd`. When `setupClusterIPServices` calls `ipAddrAdd` in a loop, `Exec()` gets executed multiple times and becomes a hot path.

Instead of these looped `Exec()` calls, this method adds a single `Exec("ip", "route", "replace")` call that sets up routing based on the cluster CIDR. This was suggested in cloudnativelabs#965 (comment)
@iamakulov
Copy link
Contributor Author

@murali-reddy Just took a round on your suggestion! Did I get your idea right?

@murali-reddy
Copy link
Member

thanks @iamakulov PR looks good. Will give a quick test and will update the PR.

@murali-reddy murali-reddy self-assigned this Aug 18, 2020
This commit removes the `Exec("ip", "route", "replace")` call from `ipAddrAdd`. When `setupClusterIPServices` calls `ipAddrAdd` in a loop, `Exec()` gets executed multiple times and becomes a hot path.

Instead of these looped `Exec()` calls, this method adds a single `Exec("ip", "route", "replace")` call that sets up routing based on the cluster CIDR. This was suggested in cloudnativelabs#965 (comment)
@iamakulov
Copy link
Contributor Author

(Rebased on master.)

@murali-reddy murali-reddy added this to In Progress in 1.2 via automation Sep 16, 2020
@murali-reddy
Copy link
Member

@iamakulov Apolgies for the delay in review. We are trying to get 1.1 release out in next couple of days. Marking for 1.2 release which is focussed on performance enhancements.

@iamakulov
Copy link
Contributor Author

Tested this patch in production, and it doesn’t work as expected.

1) For new nodes, nsc.enforceRightServiceInterface gets called when kube-dummy-if doesn’t exist yet – and fails.

2) The change resurrects #376.

netlink.AddrAdd not only associates an address with an interface, but also adds a route to the routing table. And the route uses the same IP address in the src field:

$ ip addr add 10.10.10.10 dev kube-dummy-if
$ ip addr list kube-dummy-if
8: kube-dummy-if: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default 
    link/ether 56:51:fe:11:e3:02 brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.10/32 brd 10.10.10.10 scope link kube-dummy-if
       valid_lft forever preferred_lft forever
$ ip route show table all | grep 10.10.10.10
local 10.10.10.10 dev kube-dummy-if table local proto kernel scope host src 10.10.10.10 # ← here, `src 10.10.10.10` is the problem
broadcast 10.10.10.10 dev kube-dummy-if table local proto kernel scope link src 10.10.10.10

Before, we were fixing this by calling ip route replace for each service IP. That was costly, so this PR removed that call and started calling ip route replace <service-ip-cidr> .... But that resurrected #376 – because the CIDR route is less specific than netlink.AddrAdd’s routes and doesn’t have any effect.


To fix both issues, I’m going to change the approach:

  • I’ll bring ip route replace back to nsc.ln.ipAddrAdd()
  • and I’ll make setupClusterIPServices call nsc.ln.ipAddrAdd() only once (with the full services CIDR) instead of X times (for each service’s IP)

I’ll test this change in our system and then cherry-pick it here.

@iamakulov
Copy link
Contributor Author

Also, re: UX/DX.

This PR makes the service-cluster-ip-range param much more important. Before, if you specified it incorrectly (or if the default 10.96.0.0/16 value was wrong), it only affected network policies (which aren’t that common, I guess). Now, an incorrect value can break the service connectivity completely. (I learned this the hard way.)

Kubernetes already knows the service-cluster-ip-range value. Could kube-router retrieve it automatically and use the actual value (instead of a hard-coded default) in its defaults? That’d make life quite easier for a lot of people who aren’t that experienced with networking/Kubernetes administration (like me).

@aauren
Copy link
Collaborator

aauren commented Sep 22, 2020

@iamakulov While it's true that Kubernetes does know the service-cluster-ip-range value, as far as I know, it doesn't expose that information anywhere that is obtainable with through the API. Although, it may just be that it's not in a common place and I haven't run across it yet?

@iamakulov
Copy link
Contributor Author

While it's true that Kubernetes does know the service-cluster-ip-range value, as far as I know, it doesn't expose that information anywhere that is obtainable with through the API. Although, it may just be that it's not in a common place and I haven't run across it yet?

Right, it appears that it’s not implemented: kubernetes/kubernetes#25533 kubernetes/kubernetes#46508

Would it be a viable idea to make --service-cluster-ip-range a required argument then? (To ensure that people have to configure it properly – and don’t accidentally skip it and break their cluster.) I’m not sure how actually common is the 10.96.0.0/16 default, but at least kops uses 100.64.0.0/13 instead.

(If yes, it might be a good idea to also document how to retrieve this value – from kubectl cluster-info dump | grep service-cluster-ip-range.)

Copy link
Collaborator

@aauren aauren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally coming back around this one. The only thing that I can see is that we should do as you suggest and make --service-cluster-ip-range a required argument as well as update the documentation in this area.

Otherwise this looks good to me.

aauren pushed a commit to aauren/kube-router that referenced this pull request Mar 19, 2021
This commit removes the `Exec("ip", "route", "replace")` call from `ipAddrAdd`. When `setupClusterIPServices` calls `ipAddrAdd` in a loop, `Exec()` gets executed multiple times and becomes a hot path.

Instead of these looped `Exec()` calls, this method adds a single `Exec("ip", "route", "replace")` call that sets up routing based on the cluster CIDR. This was suggested in cloudnativelabs#965 (comment)
@@ -1006,7 +995,7 @@ func (ln *linuxNetworking) prepareEndpointForDsr(containerID string, endpointIP
}

// assign VIP to the KUBE_TUNNEL_IF interface
err = ln.ipAddrAdd(tunIf, vip, false)
err = ln.ipAddrAdd(tunIf, vip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naddr := &netlink.Addr{IPNet: &net.IPNet{IP: net.ParseIP(ip), Mask: net.IPv4Mask(255, 255, 255, 255)}, Scope: syscall.RT_SCOPE_LINK}
err := netlink.AddrAdd(iface, naddr)
if err != nil && err.Error() != IfaceHasAddr {
glog.Errorf("Failed to assign cluster ip %s to dummy interface: %s",
naddr.IPNet.IP.String(), err.Error())
return err
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamakulov I found an additional problem with this logic. Down below, you only account for the ClusterIP range in the enforceRightServiceInterface() function. However, we also needs routes inserted for ExternalIP and LoadBalancer VIPs if we remove this section here.

ExternalIPs and LoadBalancer VIPs are less likely to be in a cohesive subnet, and I don't think that it would be good to use the same process for those (even though there is a parameter for them) as it would likely exclude certain valid use-cases for users.

I think that we'll likely have to go back to the drawing board here.

@aauren aauren removed this from In Progress in 1.2 Mar 19, 2021
@iamakulov
Copy link
Contributor Author

Hey!

I’m afraid I left the project I was working for when I implemented this patch, so I won’t be able to contribute fixes for this PR. Sorry!

However, as I mentioned in #965 (comment), this patch actually doesn’t work as expected, so this PR can’t be merged as-is. In the project I was working for, we ended up taking the alternative approach:

To fix both issues, I’m going to change the approach:

  • I’ll bring ip route replace back to nsc.ln.ipAddrAdd()
  • and I’ll make setupClusterIPServices call nsc.ln.ipAddrAdd() only once (with the full services CIDR) instead of X times (for each service’s IP)

I’ll test this change in our system and then cherry-pick it here.

– and this one worked well. It’s been running in production for a few months with no observable issues.

Let me know if there’s anything else I can help with!

@aauren
Copy link
Collaborator

aauren commented Feb 19, 2022

Closing as this PR is incomplete and the author unfortunately had to move on. In NPC we've started taking the approach of using ipset save and ipset restore, and likely we'll work that same functionality into the NSC in the future as well if we wanted to approach this problem again.

@aauren aauren closed this Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants