dnsmasq: Maximum number of concurrent DNS queries reached (max: 150) #166

cookandy · 2016-01-07T18:59:54Z

Hi,

For some reason, recently, I've started experiencing the following error:

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

As a result, my lookups are failing for my marathon services. Do you have any idea what could be causing this, or if there's a way to look at the dnsmasq logs to determine where the queries are coming from?

Thanks!

The text was updated successfully, but these errors were encountered:

sielaq · 2016-01-07T19:20:21Z

you can raise default (150) value of concurrent queries:
-0 300 or --dns-forward-max=300
you can add it by:

echo 'DNSMASQ_PARAMS="--dns-forward-max=300"' >> restricted/host

sielaq · 2016-01-07T19:24:11Z

you can also use consul as your default DNS instead of dnsmasq by:

export START_DNSMASQ=false
export CONSUL_PARAMS="-config-dir=/etc/consul.d/"
./generate_yml.sh

cookandy · 2016-01-07T19:25:10Z

Ok, thanks @sielaq! I will try to update that change. The weird thing is, I am not sure why I am suddenly getting this error. I haven't added any new servers to the cluster, so I wouldn't expect any more DNS traffic than usual. Is there anything inside of PanteraS that would be flooding dnsmasq? Lastly, do you know if there's a way to look closer at the dnsmasq logs to see which host is causing the problem?

sielaq · 2016-01-07T19:27:58Z

if this is caused by one of your service then yes, it looks like your service is flooding.
PanteraS infrastructure itself is not using much DNS.

sielaq · 2016-01-07T19:35:35Z

can you trace what exactly is flooding ?

cookandy · 2016-01-07T19:36:47Z

Strange because I don't actually have any services running and I still get the error.

cookandy · 2016-01-07T19:37:52Z

Can I look at the dnsmasq logs to determine which server is the culprit? I've only got 2 slaves and 1 master and they're all configured to use the master as the DNS server... so I wouldn't expect it to be flooded...

sielaq · 2016-01-07T19:43:29Z

apt-get install ngrep tcpdump
check whats going on:
ngrep -l -q udp and port 53
or
tcpdump -vvv -s 0 -l -n port 53

sielaq · 2016-01-07T19:51:04Z

yea you can log queries too, additional dnsmasq option:
-q you should see it in docker-compose logs

cookandy · 2016-01-07T20:15:00Z

Thanks! I ran the commands and originally saw a bunch of traffic (probably outbound) going to my corporate DNS servers. So I removed the corp DNS servers from /etc/resolv.conf (so my master node is the only nameserver listed) and re-ran the commands.

I never saw any traffic (using either of the commands), yet, I'm still experiencing the error. I have stopped and removed the PanteraS container, but I'm still seeing the issue.

cookandy · 2016-01-07T20:42:30Z

I'm actually seeing the error immediately after startup:

panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | started, version 2.68 cachesize 150 dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | using nameserver 10.7.112.99#8600 for domain consul dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  mesos-master stderr | I0107 20:41:21.575283    66 contender.cpp:149] Joining the ZK group
panteras_1 | dnsmasq stderr | reading /etc/resolv.conf.orig dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | using nameserver 10.7.9.12#53 dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | using nameserver 10.7.9.11#53 dnsmasq stderr |
panteras_1 | dnsmasq: using nameserver 10.7.112.99#53
panteras_1 | dnsmasq: using nameserver 10.7.112.99#8600 for domain consul
panteras_1 | dnsmasq: read /etc/hosts - 9 addresses
panteras_1 | chronos stdout | [2016-01-07 20:41:22,283] INFO --------------------- (org.apache.mesos.chronos.scheduler.Main$:26)
panteras_1 | chronos stdout | [2016-01-07 20:41:22,287] INFO Initializing chronos. (org.apache.mesos.chronos.scheduler.Main$:27)
panteras_1 | chronos stdout | [2016-01-07 20:41:22,290] INFO --------------------- (org.apache.mesos.chronos.scheduler.Main$:28)
panteras_1 | marathon stdout | [2016-01-07 20:41:22,901] INFO Starting Marathon 0.13.0 with --master zk://C01NHVD652:2181/mesos --zk zk://C01NHVD652:2181/marathon --hostname c01nhvd652.nh.corp --http_address 0.0.0.0 --https_address 0.0.0.0 (mesosphere.marathon.Main$:main)
panteras_1 | consul stdout | ==> WARNING: BootstrapExpect Mode is specified as 1; this is the same as Bootstrap mode.
panteras_1 | ==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
panteras_1 | ==> Starting Consul agent...
panteras_1 | consul stdout | ==> Starting Consul agent RPC...
panteras_1 | ==> Joining cluster...
panteras_1 | dnsmasq stderr | dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

How can dnsmasq be saturated after it has only been running for less than 5 seconds?

sielaq · 2016-01-07T20:45:04Z

" I have stopped and removed the PanteraS container, but I'm still seeing the issue."
I'm confused now, is it related to PanteraS or not ?

cookandy · 2016-01-07T20:46:31Z

Yes, I believe so. Right now I've only got a single server (master) running. When I start fresh with a new docker-compose up, I see the error during startup (as shown above). I don't understand how dnsmasq can reach its max connections immediately during startup.

sielaq · 2016-01-07T20:57:44Z

login into the PaaS container like docker exec -ti panteras_panteras_1 bash
(name of container might be different )
check supervisorctl status multiple times if some services are running
or re spawning all the time (PID should change) - might be one of the service have problems.

if all is stable, stop everything except dnsmasq , (supervisorctl stop <service>) and check if you still have problems.
I suspect that this is some external application.

cookandy · 2016-01-07T21:07:31Z

Can you confirm the following requests are showing inbound to 10.7.112.99?

U 10.7.9.12:53 -> 10.7.112.99:19762
  v............MASTER-NODE.....
U 10.7.9.11:53 -> 10.7.112.99:19762
  .............MASTER-NODE.....
U 10.7.9.12:53 -> 10.7.112.99:19762
  .............MASTER-NODE.....
U 10.7.9.11:53 -> 10.7.112.99:19762

My master node is 10.7.112.99, and my corp DNS servers are 10.7.9.11 and 10.7.9.12. I only see these requests coming in when PanteraS is running, but it appears they're coming into port 19762.

sielaq · 2016-01-07T21:12:57Z

This is rather the response, 53 answer to port 19762.
Something is running on 10.7.112.99 that asks 10.7.9.11:53:53

sielaq · 2016-01-07T21:17:01Z

ahh 10.7.112.99 is your PanetraS. Did you try to stop mesos/marathon/zookeper with supervisorctl as I asked ?

cookandy · 2016-01-07T21:36:44Z

Yes, 10.7.112.99 is my only PanteraS. I tried what you suggested with supervisord and I think the problem is with consul. First, the services don't appear to be restarting on their own. You can tell by the uptime:

chronos                          RUNNING   pid 23, uptime 0:25:02
consul                           RUNNING   pid 122, uptime 0:25:00
consul-template_haproxy          STOPPED   Not started
dnsmasq                          RUNNING   pid 15, uptime 0:25:02
marathon                         RUNNING   pid 19, uptime 0:25:02
mesos-master                     RUNNING   pid 18, uptime 0:25:02
mesos-slave                      STOPPED   Not started
registrator                      STOPPED   Not started
stdout                           RUNNING   pid 14, uptime 0:25:02
zookeeper                        RUNNING   pid 17, uptime 0:25:02

I stopped each of the services, one by one, until only dnsmasq was left running:

chronos                          STOPPED   Jan 07 09:25 PM
consul                           STOPPED   Jan 07 09:25 PM
consul-template_haproxy          STOPPED   Not started
dnsmasq                          RUNNING   pid 15, uptime 0:29:25
marathon                         STOPPED   Jan 07 09:25 PM
mesos-master                     STOPPED   Jan 07 09:26 PM
mesos-slave                      STOPPED   Not started
registrator                      STOPPED   Not started
stdout                           STOPPED   Jan 07 09:26 PM
zookeeper                        STOPPED   Jan 07 09:26 PM

At this point, I was still receiving requests on port 53 as shown in my previous comment. So I stopped dnsmasq, and then the requests stopped.

When I started dnsmasq again, I did not see the requests coming in. So I started each of the services one by one, and started noticing the flood immediately after starting consul. Could this have to do with the consul LISTEN_IP changes recently?

cookandy · 2016-01-07T21:38:23Z

I stopped just consul and dnsmasq and the flooding stopped. I then started dnsmasq and didn't see the flooding. Only after starting consul do I see the flooding...

cookandy · 2016-01-07T23:10:12Z

I went back to a build before the consul upgrade to 0.6 and the problem has gone away. So I'm thinking it's either something to do with the new version of consul, or the new LISTEN arguments in the docker-compose.

cookandy · 2016-01-07T23:26:05Z

I reverted to commit eec2c7673ebd49aa6a240234725d07aa79328e3d (before consul was updated to 0.6), and everything works as expected. Are you experiencing this issue?

cookandy · 2016-01-08T00:40:01Z

Ok, after hours of switching between commits, I finally got it working on the latest build. I haven't changed anything, so I'm not sure what has happened. I opened up another issue asking about proper DNS configuration. I'll close this one for now. Thanks for the help @sielaq!

cookandy · 2016-01-08T02:40:43Z

Ok, actually, I'm reopening this issue because I've gotten to the point where I can get my masters and slaves up and running without the DNS flood. However, the first docker container I spawn with Marathon causes the issue.

I start a docker image with marathon and it spawns a container ID of fe7c4453eb31. After that, I immediately start seeing the following DNS messages being flooded:

U 10.7.9.11:53 -> 10.7.112.99:56479
  .q...........fe7c4453eb31.....
U 10.7.112.99:56479 -> 10.7.9.12:53
  .q...........fe7c4453eb31.....
U 10.7.112.99:56479 -> 10.7.9.11:53
  .q...........fe7c4453eb31.....
U 10.7.9.12:53 -> 10.7.112.99:56479
  .q...........fe7c4453eb31.....
U 10.7.9.11:53 -> 10.7.112.99:56479
  .q...........fe7c4453eb31.....

As you can see, the query in the above snippet is referencing fe7c4453eb31. So it appears the docker container is trying to use DNS for some reason.

Even after I stop and remove the container, I still continue to see the DNS floods with the same reference to fe7c4453eb31. The only way I can get the flooding to stop is to stop all masters and slaves and restart.

Do you have any ideas of why this would be happening?

sielaq · 2016-01-08T07:31:24Z

Yea I also don't think it is a consul issue.

my 2 hypotheses :

Looks like some loop for me.
PanteraS asks your DNS
U 10.7.112.99:56479 -> 10.7.9.12:53
which means it looks for your DNS instead of himself - where to looks for,
then your DNS returns back (probably forward zones) that PanteraS should (know) resolve it.
And circle repeats and flood starts.
Can you show me from PanteraS container env | grep DNS
It could be connected with last change, but I cannot reproduce it.
Can you check if problem still exists,
after ./generateyaml.sh remove stuff we have added: --bind-interfaces --listen-address=x.x.x.x
from docker-compose.yml where DNSMASQ_APP_PARAMS is configured and run fresh:

docker-compose stop
docker-compose rm --force
docker-compose up -d

cookandy · 2016-01-08T15:33:35Z

I tend to agree with your theory about the DNS loop. Is there any way to tell my PanteraS hosts to only send requests to the corporate DNS servers if they are not service.consul lookups?

Here is the output of my env | grep DNS from each container:

master (IP address 10.0.0.10)

DNSMASQ_APP_PARAMS=-d  -u dnsmasq  -r /etc/resolv.conf.orig  -7 /etc/dnsmasq.d  --server=/consul/10.0.0.10#8600  --host-record=master-1.mydomain.corp,10.0.0.10  --bind-interfaces  --listen-address=0.0.0.0

slave 1 (IP address 10.0.0.11)

DNSMASQ_APP_PARAMS=-d  -u dnsmasq  -r /etc/resolv.conf.orig  -7 /etc/dnsmasq.d  --server=/consul/10.0.0.11#8600  --host-record=slave-1.mydomain.corp,10.0.0.11  --bind-interfaces  --listen-address=0.0.0.0  --address=/consul/10.0.0.11

slave 2 (IP address 10.0.0.12)

DNSMASQ_APP_PARAMS=-d  -u dnsmasq  -r /etc/resolv.conf.orig  -7 /etc/dnsmasq.d  --server=/consul/10.0.0.12#8600  --host-record=slave-2.mydomain.corp,10.0.0.12  --bind-interfaces  --listen-address=0.0.0.0  --address=/consul/10.0.0.12

You are correct. Removing the --bind-interfaces --listen-address=x.x.x.x from the generate_yml.sh hasn't seemed to make a difference. I verified that the parameters were gone by running env | grep DNS from within the running containers again.

cookandy · 2016-01-08T16:52:22Z

Actually, I retract my last comment. Removing the --bind-interfaces --listen-address=x.x.x.x from the generate_yml.sh has fixed the issue! I thought I saw flooding, but it was just our service starting up checking for updates, etc. With the --bind-interfaces --listen-address=x.x.x.x options our service never actually starts, but rather just gets stuck in the lookup loop (and continues to do so, even after the container has been removed/destroyed).

cookandy · 2016-01-08T17:05:57Z

More specifically, it seems to be the --bind-interfaces argument causing the problem. I'm not doing anything fancy with regards to networking. I've only got a single NIC (eth0) on each of the masters and slaves, and only have the one additional interface for docker0.

sielaq · 2016-01-08T20:23:37Z

I was able to reproduce it.
That was very unexpected behavior from dnsmasq. I have made correction.

Fix #166 issue

cookandy · 2016-01-09T01:05:21Z

Actually, it looks like --listen-address is also causing issues for my running container. When I start a container which calls out to lots of urls (maybe > 100), I notice a considerable delay in starting the service. My service normally takes around 2 minutes to start, and with the --listen-address active, it takes approximately 16 minutes.

The only change I have made is the --listen-address under DNSMASQ_APP_PARAMS in docker-compose.yml, so I know for sure it is the culprit.

Have you noticed the same? Should I open a new issue?

… interfaces generate loop

sielaq · 2016-01-09T07:43:48Z

I will fix it with this issue.

Fix #166 issue

cookandy closed this as completed Jan 8, 2016

cookandy reopened this Jan 8, 2016

sielaq added the bug label Jan 8, 2016

sielaq closed this as completed in 9c73512 Jan 8, 2016

sielaq added a commit that referenced this issue Jan 8, 2016

Merge pull request #168 from sielaq/master

e0eb3d2

Fix #166 issue

cookandy mentioned this issue Jan 8, 2016

DNS configuration question #167

Closed

sielaq added a commit to sielaq/PanteraS that referenced this issue Jan 9, 2016

Fix eBayClassifiedsGroup#166 issue - dnsmasq listen 0.0.0.0 with bind…

e7bda32

… interfaces generate loop

sielaq added a commit that referenced this issue Jan 9, 2016

Merge pull request #170 from sielaq/master

616911b

Fix #166 issue

jonashackt mentioned this issue Aug 27, 2018

kube-dns nslookup kubernetes gives timeout jonashackt/kubernetes-the-ansible-way#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150) #166

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150) #166

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 8, 2016

cookandy commented Jan 8, 2016

sielaq commented Jan 8, 2016

cookandy commented Jan 8, 2016

cookandy commented Jan 8, 2016

cookandy commented Jan 8, 2016

sielaq commented Jan 8, 2016

cookandy commented Jan 9, 2016

sielaq commented Jan 9, 2016

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150) #166

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150) #166

Comments

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

sielaq commented Jan 7, 2016

sielaq commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 7, 2016

cookandy commented Jan 8, 2016

cookandy commented Jan 8, 2016

sielaq commented Jan 8, 2016

cookandy commented Jan 8, 2016

cookandy commented Jan 8, 2016

cookandy commented Jan 8, 2016

sielaq commented Jan 8, 2016

cookandy commented Jan 9, 2016

sielaq commented Jan 9, 2016