Enable fully random masquerading ports #1001

discordianfish · 2018-06-01T12:26:28Z

Description

There is a race condition in linux which can lead to packages being
dropped when many connections are established to the same address.

See this for more datail: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

This commit introduces the proposed workaround of enabling fully
randomized ports. Since this requires iptables 1.6.2, it requires using
alpine:edge until the next release.

See this for similar discussion in weave: weaveworks/weave#3287

There is a race condition in linux which can lead to packages being dropped when many connections are established to the same address. See this for more datail: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 This commit introduces the proposed workaround of enabling fully randomized ports. Since this requires iptables 1.6.2, it requires using alpine:edge until the next release.

discordianfish · 2018-06-01T13:18:23Z

Honestly, I'm not sure this helps. I've just tested this on a cluster here and I don't see the conntrack insert failures going down at all..

discordianfish · 2018-06-01T13:39:03Z

In my case it got worse, so probably don't merge this.. There is a underlying issue somewhere though.

maxlaverse · 2018-06-01T13:55:38Z

Hi @discordianfish
Could it be that you've exhausted the triplet local_ip:nameserver_ip:nameserver_port ? How full is your conntrack table ?

discordianfish · 2018-06-01T14:10:04Z

@maxlaverse Oh Hi! :) No, conntrack table isn't full. There are at least ~100k entries available on all my nodes:

bottomk(3, node_nf_conntrack_entries_limit - node_nf_conntrack_entries)
{instance="172.20.163.110:9100",job="kubernetes-service-endpoints",kubernetes_name="node-exporter",kubernetes_namespace="prometheus"} | 99950
{instance="172.20.160.36:9100",job="kubernetes-service-endpoints",kubernetes_name="node-exporter",kubernetes_namespace="prometheus"} | 100297
{instance="172.20.164.169:9100",job="kubernetes-service-endpoints",kubernetes_name="node-exporter",kubernetes_namespace="prometheus"} | 127662

maxlaverse · 2018-06-01T15:37:46Z

As discussed together, I don't know what the issue could be here nor why the --random-fully makes the situation worse :/
But if this was going to be merged, I think I would make it configurable. There is also a very interesting hack in the link you posted, using tc to randomly slow some packets down. Maybe worse a try.

Our plan is still to get rid of NAT completely

maxlaverse · 2018-06-02T07:46:33Z

I read a few issues about those DNS timeouts. As pointed out in weaveworks/weave#3287 (comment), the problem in that case might be with the DNAT rules that transform the Service IP into a PodIP in the requests for nameserver resolution. It looks like there is the same delay as with SNAT, between the allocation of a conntrack record and its insertion in the table. This can lead some collisation and ultimately in packet drops.

Apparently, IPVS uses its own connection tracking module (http://kb.linuxvirtualserver.org/wiki/Performance_and_Tuning). If you have a network that supports routable PodIPs and switch kube-proxy to ipvs, it might help with your issue (if it's really about DNAT in conntrack).

Just read there are different issues with ipvs too: kubernetes/kubernetes#57841

szuecs · 2018-07-09T14:28:29Z

@discordianfish do you deployed a successful fix?

@maxlaverse we also have problems with that and run coreos + flannel in 85 clusters. Monitoring shows regular spikes of DNS request timeouts.
We do not see many failures a day, but it's measurable.
Some inserts fail, so this might show the problems we can measure.

 ~# conntrack -S
cpu=0           found=63 invalid=129 ignore=1314771 insert=0 insert_failed=11 drop=11 early_drop=0 error=8 search_restart=27531 
cpu=1           found=57 invalid=102 ignore=1303273 insert=0 insert_failed=11 drop=11 early_drop=0 error=2 search_restart=37546 
cpu=2           found=48 invalid=109 ignore=1276723 insert=0 insert_failed=6 drop=6 early_drop=0 error=5 search_restart=36513 
cpu=3           found=77 invalid=138 ignore=1272906 insert=0 insert_failed=16 drop=16 early_drop=0 error=5 search_restart=35515

We also increased node_nf_conntrack_entries_limit and alert on node_nf_conntrack_entries, we do not see any issues there.

CoreOS release

cat /etc/lsb-release 
DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1688.5.3
DISTRIB_CODENAME="Rhyolite"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1688.5.3 (Rhyolite)"

maxlaverse · 2018-07-09T14:37:38Z

Hi @szuecs
Most of the DNS requests timeout issues I read about are located between the Pods and the Cluster DNS. This is likely caused by a race condition the DNAT path, not the SNAT (which this PR is about). In that case it's not the Network Plugin's fault but kube-proxy's and its implementation of the ClusterIP load-balancing.

@Quentin-M wrote an article about it that might interest you, with a workaround: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/

I hope it will help

szuecs · 2018-07-10T12:20:05Z

@maxlaverse thanks for the response. I think also that DNAT will be the problem. I ported basically the code by @Quentin-M to flannel (we use 53 as destination so also change 5353 to 53, but anything else is pretty similar).

I added this as sidecar to flannel running as daemonset and added:

  - args:
    - -c
    - /tc-flannel.sh
    command:
    - /bin/bash
    image: registry.opensource.zalan.do/teapot/flannel-tc:7d93e04
    imagePullPolicy: IfNotPresent
    name: flannel-tc
    resources:
      requests:
        cpu: 25m
        memory: 25Mi
    securityContext:
      privileged: true
    stdin: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run
      name: run
    - mountPath: /lib/tc
      name: lib-tc

I published the code repository at https://github.com/szuecs/flannel-tc

I started the test at 6PM on 7/9 and it looks better than before (so the race condition is less likely to happen than before). We have to wait before to claim a success:

…hat the kernel race condition happens less likely. Background see https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/ kubernetes/kubernetes#62628 flannel-io/flannel#1001

tomdee · 2018-10-05T17:37:47Z

Thanks @discordianfish for the PR and @maxlaverse and @szuecs for the additional diags. Now that iptables v1.6.2 is in the latest Alpine release I'm going to take the fix from #1040, so closing this one.

szuecs mentioned this pull request Jul 10, 2018

DNS latency of 5s when uses iptables forward in pods network traffic kubernetes/kubernetes#62628

Closed

szuecs mentioned this pull request Jul 10, 2018

fix kernel race condition in DNAT rules for DNS zalando-incubator/kubernetes-on-aws#1228

Merged

ghost mentioned this pull request Sep 25, 2018

Use --random-fully when creating MASQ rules #1040

Merged

1 task

tomdee closed this Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable fully random masquerading ports #1001

Enable fully random masquerading ports #1001

discordianfish commented Jun 1, 2018

discordianfish commented Jun 1, 2018

discordianfish commented Jun 1, 2018

maxlaverse commented Jun 1, 2018

discordianfish commented Jun 1, 2018

maxlaverse commented Jun 1, 2018

maxlaverse commented Jun 2, 2018 •

edited

szuecs commented Jul 9, 2018

maxlaverse commented Jul 9, 2018

szuecs commented Jul 10, 2018 •

edited

tomdee commented Oct 5, 2018

Enable fully random masquerading ports #1001

Enable fully random masquerading ports #1001

Conversation

discordianfish commented Jun 1, 2018

Description

discordianfish commented Jun 1, 2018

discordianfish commented Jun 1, 2018

maxlaverse commented Jun 1, 2018

discordianfish commented Jun 1, 2018

maxlaverse commented Jun 1, 2018

maxlaverse commented Jun 2, 2018 • edited

szuecs commented Jul 9, 2018

maxlaverse commented Jul 9, 2018

szuecs commented Jul 10, 2018 • edited

tomdee commented Oct 5, 2018

maxlaverse commented Jun 2, 2018 •

edited

szuecs commented Jul 10, 2018 •

edited