Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable fully random masquerading ports #1001

Conversation

discordianfish
Copy link

Description

There is a race condition in linux which can lead to packages being
dropped when many connections are established to the same address.

See this for more datail: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

This commit introduces the proposed workaround of enabling fully
randomized ports. Since this requires iptables 1.6.2, it requires using
alpine:edge until the next release.

See this for similar discussion in weave: weaveworks/weave#3287

There is a race condition in linux which can lead to packages being
dropped when many connections are established to the same address.

See this for more datail: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

This commit introduces the proposed workaround of enabling fully
randomized ports. Since this requires iptables 1.6.2, it requires using
alpine:edge until the next release.
@discordianfish
Copy link
Author

Honestly, I'm not sure this helps. I've just tested this on a cluster here and I don't see the conntrack insert failures going down at all..

@discordianfish
Copy link
Author

insert-failed
lookup-time

In my case it got worse, so probably don't merge this.. There is a underlying issue somewhere though.

@maxlaverse
Copy link

Hi @discordianfish
Could it be that you've exhausted the triplet local_ip:nameserver_ip:nameserver_port ? How full is your conntrack table ?

@discordianfish
Copy link
Author

@maxlaverse Oh Hi! :) No, conntrack table isn't full. There are at least ~100k entries available on all my nodes:

bottomk(3, node_nf_conntrack_entries_limit - node_nf_conntrack_entries)
{instance="172.20.163.110:9100",job="kubernetes-service-endpoints",kubernetes_name="node-exporter",kubernetes_namespace="prometheus"} | 99950
{instance="172.20.160.36:9100",job="kubernetes-service-endpoints",kubernetes_name="node-exporter",kubernetes_namespace="prometheus"} | 100297
{instance="172.20.164.169:9100",job="kubernetes-service-endpoints",kubernetes_name="node-exporter",kubernetes_namespace="prometheus"} | 127662

@maxlaverse
Copy link

As discussed together, I don't know what the issue could be here nor why the --random-fully makes the situation worse :/
But if this was going to be merged, I think I would make it configurable. There is also a very interesting hack in the link you posted, using tc to randomly slow some packets down. Maybe worse a try.

Our plan is still to get rid of NAT completely

@maxlaverse
Copy link

maxlaverse commented Jun 2, 2018

I read a few issues about those DNS timeouts. As pointed out in weaveworks/weave#3287 (comment), the problem in that case might be with the DNAT rules that transform the Service IP into a PodIP in the requests for nameserver resolution. It looks like there is the same delay as with SNAT, between the allocation of a conntrack record and its insertion in the table. This can lead some collisation and ultimately in packet drops.

Apparently, IPVS uses its own connection tracking module (http://kb.linuxvirtualserver.org/wiki/Performance_and_Tuning). If you have a network that supports routable PodIPs and switch kube-proxy to ipvs, it might help with your issue (if it's really about DNAT in conntrack).

Just read there are different issues with ipvs too: kubernetes/kubernetes#57841

@szuecs
Copy link

szuecs commented Jul 9, 2018

@discordianfish do you deployed a successful fix?

@maxlaverse we also have problems with that and run coreos + flannel in 85 clusters. Monitoring shows regular spikes of DNS request timeouts.
We do not see many failures a day, but it's measurable.
Some inserts fail, so this might show the problems we can measure.

 ~# conntrack -S
cpu=0           found=63 invalid=129 ignore=1314771 insert=0 insert_failed=11 drop=11 early_drop=0 error=8 search_restart=27531 
cpu=1           found=57 invalid=102 ignore=1303273 insert=0 insert_failed=11 drop=11 early_drop=0 error=2 search_restart=37546 
cpu=2           found=48 invalid=109 ignore=1276723 insert=0 insert_failed=6 drop=6 early_drop=0 error=5 search_restart=36513 
cpu=3           found=77 invalid=138 ignore=1272906 insert=0 insert_failed=16 drop=16 early_drop=0 error=5 search_restart=35515 

We also increased node_nf_conntrack_entries_limit and alert on node_nf_conntrack_entries, we do not see any issues there.

CoreOS release

cat /etc/lsb-release 
DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1688.5.3
DISTRIB_CODENAME="Rhyolite"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1688.5.3 (Rhyolite)"

@maxlaverse
Copy link

Hi @szuecs
Most of the DNS requests timeout issues I read about are located between the Pods and the Cluster DNS. This is likely caused by a race condition the DNAT path, not the SNAT (which this PR is about). In that case it's not the Network Plugin's fault but kube-proxy's and its implementation of the ClusterIP load-balancing.

@Quentin-M wrote an article about it that might interest you, with a workaround: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/

I hope it will help

@szuecs
Copy link

szuecs commented Jul 10, 2018

@maxlaverse thanks for the response. I think also that DNAT will be the problem. I ported basically the code by @Quentin-M to flannel (we use 53 as destination so also change 5353 to 53, but anything else is pretty similar).

I added this as sidecar to flannel running as daemonset and added:

  - args:
    - -c
    - /tc-flannel.sh
    command:
    - /bin/bash
    image: registry.opensource.zalan.do/teapot/flannel-tc:7d93e04
    imagePullPolicy: IfNotPresent
    name: flannel-tc
    resources:
      requests:
        cpu: 25m
        memory: 25Mi
    securityContext:
      privileged: true
    stdin: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run
      name: run
    - mountPath: /lib/tc
      name: lib-tc

I published the code repository at https://github.com/szuecs/flannel-tc

I started the test at 6PM on 7/9 and it looks better than before (so the race condition is less likely to happen than before). We have to wait before to claim a success:
image

@tomdee
Copy link
Contributor

tomdee commented Oct 5, 2018

Thanks @discordianfish for the PR and @maxlaverse and @szuecs for the additional diags. Now that iptables v1.6.2 is in the latest Alpine release I'm going to take the fix from #1040, so closing this one.

@tomdee tomdee closed this Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants