Crippling xtables lock contention on kubernetes #933

SleepyBrett · 2018-01-26T22:56:19Z

Expected Behavior

Kubeproxy should be able to get a word in edgewise and update our tables.

Current Behavior

We currently have three kubernetes 1.7.6 clusters in prod, one fairly large one (about 50 m4.10xl nodes in aws, 818 services) and two smaller clusters that net about 200 services apiece but are otherwise identical.

As part of our move to kubernetes 1.9 by way or 1.8 we performed an upgrade from flannel 0.8.0 to 0.9.1. Things went smoothly on the two smaller clusters, however on the large cluster as soon as a flannel pod got upgraded on a node the kubeproxy on the node started reporting kube-proxy-w8hxt:kube-proxy E0126 20:16:49.022132 1 proxier.go:1601] Failed to execute iptables-restore: failed to acquire old iptables lock: timed out waiting for the condition.

As part of our troubleshooting we also moved to flannel 0.10.0, same issue.

I don't have any in depth knowledge of how the iptables xtables.lock works, but on an upgraded box we were seeing upwards of 4 iptables processes pending at all times (/usr/local/bin/iptables commands with --wait flags) they don't seem to have a fifo type queuing arrangement I imagine that it's just whoever checks and finds no lock runs.

Digging through the chanelog we found this pull request: #808

We suspected this was the root cause and that a 5s check for these rules was causing excessive contention on our nodes.

Possible Solution

I've built a replacement container against 0.10.0 in which I quickly changed the 5s check to a 5m check and deployed it on one of the nodes that was previously effected, so far so good.

I think perhaps that hardcoding this value is a mistake and perhaps a flag or other configuration could be provided to adjust this sync timer. I'm perfectly happy to work up the pull request for this feature but would like some guidance about how you would like me to provide this new parameter (I suspect a commandline flag that defaults to 5 seconds).

Steps to Reproduce (for bugs)

Have a largeish cluster running kubeproxy in iptables mode and a large number of services w/ endpoints
install flannel and monitor the kube proxy
potentially delete some pods to trigger a iptables rebuild
observe errors

Context

I think I covered the context above

Your Environment

Flannel version: 0.9.1, 0.10.0
Backend used (e.g. vxlan or udp): vxlan
Etcd version: not sure at the moment, i suspect its not relevant
Kubernetes version (if used): 1.7.6 w/ kube proxy in iptables mode
Operating System and version: coreos current beta channel
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

tomdee · 2018-01-26T23:15:50Z

Thanks for the great issue report. A command line arg would be the right way to configure this. Allowing very large values would also provide a mechanism for disabling it. It would be great to add a node to troubleshooting.md about this too. I think it's worth getting this fix into the next release but longer term we'll probably want to do something better.

tomdee · 2018-01-29T17:56:49Z

Going to leave this open for now to help track a longer term solution.

squeed · 2018-01-29T18:35:42Z

Switch to nftables :-) ?

stale · 2023-01-26T16:51:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

SleepyBrett mentioned this issue Jan 29, 2018

Added new flag -iptables-resync #935

Merged

3 tasks

tomdee closed this as completed Jan 29, 2018

tomdee reopened this Jan 29, 2018

tomdee added the kind/enhancement label Jan 29, 2018

janeczku mentioned this issue Apr 10, 2019

Flannel fails after upgrade because of lock: iptables: Resource temporarily unavailable. rancher/rancher#18637

Closed

stale bot added the wontfix label Jan 26, 2023

stale bot closed this as completed Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crippling xtables lock contention on kubernetes #933

Crippling xtables lock contention on kubernetes #933

SleepyBrett commented Jan 26, 2018

tomdee commented Jan 26, 2018

tomdee commented Jan 29, 2018

squeed commented Jan 29, 2018

stale bot commented Jan 26, 2023

Crippling xtables lock contention on kubernetes #933

Crippling xtables lock contention on kubernetes #933

Comments

SleepyBrett commented Jan 26, 2018

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

tomdee commented Jan 26, 2018

tomdee commented Jan 29, 2018

squeed commented Jan 29, 2018

stale bot commented Jan 26, 2023