You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We noticed that when testing TCP with DNSPerf (and ResPerf) almost all queries are failing with timeouts. We see results that we can't really explain away or find any config changes to remediate the issue.
I am hoping that maybe you'll be able to shed some light on this issue and maybe point me to the right direction.
Use case
Client located in US
DNSdist (but the same behavior is seen when using CoreDNS instead of dist) "near" the client in US.
PDNS Authoritative is the dist backend and is located far and with high latency (e.g. Singapur, Australia).
Increasing the client timeout time improve the failure percentage
Results
Using 60 seconds 100 QPS test to the dist server more than 90% of queries are lost and the test duration takes 100 seconds (40 seconds over the 60 requested).
Doing the same exact test from the same client directly to the distant PDNS gets 100% success (no query lost) and all are done during the 60 seconds requested.
The queries lost percentage is decreasing if the act as n clients argument is increased. When using resperf with -C 100 (probably 100 different source ports/5-tuples) there are zero queries lost.
Software version: DNS Perf tools: 2.10.0, PDNS: 4.5.4, DNSDist: 1.6.1
Software source: PowerDNS repository
Steps to reproduce
Create a VPC in X region (in our case we used us-east-1 region).
In it start-up a DNSDist server and another one for testing (same hardware/vm e.g. c5n.large in our case).
In a remote VPC/ location start up a PDNS Authoritative server (In our case we used ap-southeast-2 region) with BIND files as its backend.
DNSDist is configured to forward all requests to the PDNS server IP.
From the test server create a data file (intput.txt) with records that are present in the PDNS bind backend and install DNSperf tools and run the following:
5.1. Single 5-tuple:time resperf -d input.txt -s ${DNSDIST_IP_ADRESS} -r 0 -c 60 -R -m 100 -M tcp -C 1 - most queries failed, and the test duration is around 100 seconds
Warning: received a response with an unexpected id: 493
Warning: received a response with an unexpected id: 494
Warning: received a response with an unexpected id: 495
Warning: received a response with an unexpected id: 496
Warning: received a response with an unexpected id: 497
Warning: received a response with an unexpected id: 498
Warning: received a response with an unexpected id: 499
Warning: received a response with an unexpected id: 500
Warning: received a response with an unexpected id: 501
Warning: received a response with an unexpected id: 502
Warning: received a response with an unexpected id: 503
...
Statistics:
Queries sent: 5999
Queries completed: 238
Queries lost: 5761
Response codes: NOERROR 230 (96.64%), NXDOMAIN 8 (3.36%)
Reconnection(s): 0
Run time (s): 100.000000
Maximum throughput: 100.000000 qps
Lost at that point: 0.00%
real 1m40.005s
user 0m59.683s
sys 0m40.319s
5.2. Multiple clients:time resperf -d input.txt -s ${DNSDIST_IP_ADRESS} -r 0 -c 60 -R -m 100 -M tcp -C 100 - No failures, test done in 60 seconds
Statistics:
Queries sent: 5999
Queries completed: 5999
Queries lost: 0
Response codes: NOERROR 5760 (96.02%), NXDOMAIN 239 (3.98%)
Reconnection(s): 0
Run time (s): 60.188028
Maximum throughput: 100.000000 qps
Lost at that point: 0.00%
real 1m0.193s
user 0m36.248s
sys 0m23.944s
5.3. directly to remote backend:time resperf -d input.txt -s ${PDNS_IP_ADRESS} -r 0 -c 60 -R -m 100 -M tcp -C 1 - No failures, test done in 60 seconds
Statistics:
Queries sent: 5999
Queries completed: 5999
Queries lost: 0
Response codes: NOERROR 5760 (96.02%), NXDOMAIN 239 (3.98%)
Reconnection(s): 0
Run time (s): 60.461496
Maximum throughput: 100.000000 qps
Lost at that point: 0.00%
real 1m0.466s
user 0m36.201s
sys 0m24.068s
Same data file using DNSPerf tool:
6.1. Single 5-tuple:dnsperf -s ${DNSDIST_IP_ADRESS} -m tcp -d input.txt -c 1 -T 1 -l 60 -t 5 -q 100 - most queries failed, 1222 sent and only 24 completed (1.96%)
...
Warning: received a response with an unexpected (maybe timed out) id: 213
[Timeout] Query timed out: msg id 805
Warning: received a response with an unexpected (maybe timed out) id: 214
[Timeout] Query timed out: msg id 806
Warning: received a response with an unexpected (maybe timed out) id: 215
[Timeout] Query timed out: msg id 807
Warning: received a response with an unexpected (maybe timed out) id: 216
[Timeout] Query timed out: msg id 808
Warning: received a response with an unexpected (maybe timed out) id: 217
[Timeout] Query timed out: msg id 809
Warning: received a response with an unexpected (maybe timed out) id: 218
[Timeout] Query timed out: msg id 810
Warning: received a response with an unexpected (maybe timed out) id: 219
[Timeout] Query timed out: msg id 811
Warning: received a response with an unexpected (maybe timed out) id: 220
[Timeout] Query timed out: msg id 812
Warning: received a response with an unexpected (maybe timed out) id: 221
[Timeout] Query timed out: msg id 813
Warning: received a response with an unexpected (maybe timed out) id: 222
[Timeout] Query timed out: msg id 814
...
Statistics:
Queries sent: 1222
Queries completed: 24 (1.96%)
Queries lost: 1198 (98.04%)
Response codes: NOERROR 24 (100.00%)
Average packet size: request 73, response 87
Run time (s): 64.751393
Queries per second: 0.370648
Average Latency (s): 2.659450 (min 0.395701, max 4.923507)
Latency StdDev (s): 1.392166
Connection Statistics:
Reconnections: 0
Average Latency (s): 0.000429 (min 0.000429, max 0.000429)
real 1m4.922s
user 0m2.258s
sys 0m2.704s
6.2. Multiple clients:dnsperf -s ${DNSDIST_IP_ADRESS} -m tcp -d input.txt -c 100 -T 10 -l 60 -t 5 -q 100 - No failures, test done in 60 seconds
Statistics:
Queries sent: 2582120
Queries completed: 2582120 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 2517567 (97.50%), NXDOMAIN 64553 (2.50%)
Average packet size: request 73, response 90
Run time (s): 60.093897
Queries per second: 42968.090420
Average Latency (s): 0.002280 (min 0.000335, max 0.595486)
Latency StdDev (s): 0.015051
Connection Statistics:
Reconnections: 0
Average Latency (s): 0.012769 (min 0.000390, max 0.025119)
Latency StdDev (s): 0.006465
real 1m0.121s
user 0m26.466s
sys 1m16.860s
6.3. directly to remote backend:dnsperf -s ${PDNS_IP_ADRESS} -m tcp -d input.txt -c 1 -T 1 -l 60 -t 5 -q 100 - No failures, test done in 60 seconds
Statistics:
Queries sent: 25665
Queries completed: 25665 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 25024 (97.50%), NXDOMAIN 641 (2.50%)
Average packet size: request 73, response 90
Run time (s): 60.460280
Queries per second: 424.493568
Average Latency (s): 0.233273 (min 0.196206, max 0.623432)
Latency StdDev (s): 0.073038
Connection Statistics:
Reconnections: 0
Average Latency (s): 0.196184 (min 0.196184, max 0.196184)
real 1m0.466s
user 0m0.264s
sys 0m0.296s
Statistics:
Queries sent: 411
Queries completed: 411 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 401 (97.57%), NXDOMAIN 10 (2.43%)
Average packet size: request 73, response 89
Run time (s): 79.455371
Queries per second: 5.172715
Average Latency (s): 16.993428 (min 0.394257, max 19.614739)
Latency StdDev (s): 4.818326
Connection Statistics:
Reconnections: 0
Average Latency (s): 0.000405 (min 0.000405, max 0.000405)
real 1m19.461s
user 0m9.069s
sys 0m10.413s
Expected behaviour
Since there are no failures when sending the queries directly to the remote PDNS or when using multiple 5-tuples, I expect this to work the same when passing through some middle-man e.g. DNSDist.
Actual behaviour
Looks like there is some congestion that delays responses and the client time out when trying to reuse the TCP socket.
During the test there are no logs (system or application) on the DNSDIst, Client, or PDNS side.
No packet errors were observed on the three servers (checking netstat, ifconfig).
No load observed on the three servers (memory, IO, CPU)
The only DNSDist metrics that are active during the test are: dnsdist_frontend_tcpdiedreadingquerydnsdist_server_tcpreusedconnections, dnsdist_frontend_tcpavgqueriesperconnection and dnsdist_frontend_queries|dnsdist_frontend_responses. No errors or timeouts (e.g. tcpConnectTimeouts, tcpReadTimeouts, tcpWriteTimeouts, tcpGaveUp, tcpTooManyConcurrentConnections, dnsdist_server_drops all are unchanged during the test).
Other information
I'll try to list some of the options we tried to change to help narrow this down. All the below had no effect at all on the results mentioned above (and during the test, there is no OS logs or application logs at all).
PDNS settings that may be related
max-tcp-connections=1024
max-tcp-connections-per-client=0 # also tried with higher values
max-tcp-transactions-per-conn=0 # also tried with higher values
tcp-idle-timeout=10 # also tried with higher values
tcp-fast-open=1
reuseport=yes
# also tested multiple values to:
distributor-threads=n
receiver-threads=n
retrieval-threads=n
max-queue-length=n
queue-limit=n
DNSDist settings that may be related
newServer({
...
maxInFlight=10240 # also tried with higher values,
maxConcurrentTCPConnections=10000 # also tried with higher values,
tcpFastOpen=true,
tcpConnectTimeout=100,
tcpSendTimeout=100,
tcpRecvTimeout=100
})
# Opening multiple sockets on startup
addLocal("3.104.42.175:53", {reusePort=true,tcpFastOpenQueueSize=10000, tcpListenQueueSize=10000, maxInFlight=10000,maxConcurrentTCPConnections=10000})
addLocal("3.104.42.175:53", {reusePort=true,tcpFastOpenQueueSize=10000, tcpListenQueueSize=10000, maxInFlight=10000,maxConcurrentTCPConnections=10000})
addLocal("3.104.42.175:53", {reusePort=true,tcpFastOpenQueueSize=10000, tcpListenQueueSize=10000, maxInFlight=10000,maxConcurrentTCPConnections=10000})
Short description
We noticed that when testing TCP with DNSPerf (and ResPerf) almost all queries are failing with timeouts. We see results that we can't really explain away or find any config changes to remediate the issue.
I am hoping that maybe you'll be able to shed some light on this issue and maybe point me to the right direction.
Use case
Results
act as n clients
argument is increased. When using resperf with-C 100
(probably 100 different source ports/5-tuples) there are zero queries lost.Environment
Steps to reproduce
Create a VPC in X region (in our case we used us-east-1 region).
In it start-up a DNSDist server and another one for testing (same hardware/vm e.g.
c5n.large
in our case).In a remote VPC/ location start up a PDNS Authoritative server (In our case we used ap-southeast-2 region) with BIND files as its backend.
DNSDist is configured to forward all requests to the PDNS server IP.
From the test server create a data file (
intput.txt
) with records that are present in the PDNS bind backend and install DNSperf tools and run the following:5.1. Single 5-tuple:
time resperf -d input.txt -s ${DNSDIST_IP_ADRESS} -r 0 -c 60 -R -m 100 -M tcp -C 1
- most queries failed, and the test duration is around 100 seconds5.2. Multiple clients:
time resperf -d input.txt -s ${DNSDIST_IP_ADRESS} -r 0 -c 60 -R -m 100 -M tcp -C 100
- No failures, test done in 60 seconds5.3. directly to remote backend:
time resperf -d input.txt -s ${PDNS_IP_ADRESS} -r 0 -c 60 -R -m 100 -M tcp -C 1
- No failures, test done in 60 secondsSame data file using DNSPerf tool:
6.1. Single 5-tuple:
dnsperf -s ${DNSDIST_IP_ADRESS} -m tcp -d input.txt -c 1 -T 1 -l 60 -t 5 -q 100
- most queries failed, 1222 sent and only 24 completed (1.96%)6.2. Multiple clients:
dnsperf -s ${DNSDIST_IP_ADRESS} -m tcp -d input.txt -c 100 -T 10 -l 60 -t 5 -q 100
- No failures, test done in 60 seconds6.3. directly to remote backend:
dnsperf -s ${PDNS_IP_ADRESS} -m tcp -d input.txt -c 1 -T 1 -l 60 -t 5 -q 100
- No failures, test done in 60 seconds6.3. Increasing client timeout:
dnsperf -s ${DNSDIST_IP_ADRESS} -m tcp -d input.txt -c 1 -T 1 -l 60 -t 30 -q 100
Expected behaviour
Since there are no failures when sending the queries directly to the remote PDNS or when using multiple 5-tuples, I expect this to work the same when passing through some middle-man e.g. DNSDist.
Actual behaviour
netstat
,ifconfig
).dnsdist_frontend_tcpdiedreadingquery
dnsdist_server_tcpreusedconnections
,dnsdist_frontend_tcpavgqueriesperconnection
anddnsdist_frontend_queries|dnsdist_frontend_responses
. No errors or timeouts (e.g.tcpConnectTimeouts
,tcpReadTimeouts
,tcpWriteTimeouts
,tcpGaveUp
,tcpTooManyConcurrentConnections
,dnsdist_server_drops
all are unchanged during the test).Other information
I'll try to list some of the options we tried to change to help narrow this down. All the below had no effect at all on the results mentioned above (and during the test, there is no OS logs or application logs at all).
PDNS settings that may be related
DNSDist settings that may be related
OS level settings that may be related
DNSDist config
PDNS config
/var/powerdns/bind/named.conf
/var/powerdns/bind/example.com
input.txt (data file for dnsperf/resperf)
The text was updated successfully, but these errors were encountered: