Skip to content

Conversation

@bradh352
Copy link
Member

@bradh352 bradh352 commented Jun 21, 2024

With very little effort we should be able to determine fairly proper timeouts we can use based on prior query history. We track in order to be able to auto-scale when network conditions change (e.g. maybe there is a provider failover and timings change due to that). Apple appears to do this within their system resolver in MacOS. Obviously we should have a minimum, maximum, and initial value to make sure the algorithm doesn't somehow go off the rails.

Values:

  • Minimum Timeout: 250ms (approximate RTT half-way around the globe)
  • Maximum Timeout: 5000ms (Recommended timeout in RFC 1123), can be reduced by ARES_OPT_MAXTIMEOUTMS, but otherwise the bound specified by the option caps the retry timeout.
  • Initial Timeout: User-specified via configuration or ARES_OPT_TIMEOUTMS
  • Average latency multiplier: 5x (a local DNS server returning a cached value will be quicker than if it needs to recurse so we need to account for this)
  • Minimum Count for Average: 3. This is the minimum number of queries we need to form an average for the bucket.

Per-server buckets for tracking latency over time (these are ephemeral meaning they don't persist once a channel is destroyed). We record both the current timespan for the bucket and the immediate preceding timespan in case of roll-overs we can still maintain recent metrics for calculations:

  • 1 minute
  • 15 minutes
  • 1 hr
  • 1 day
  • since inception

Each bucket contains:

  • timestamp (divided by interval)
  • minimum latency
  • maximum latency
  • total time
  • count

NOTE: average latency is (total time / count), we will calculate this dynamically when needed

Basic algorithm for calculating timeout to use would be:

  • Scan from most recent bucket to least recent
  • Check timestamp of bucket, if doesn't match current time, continue to next bucket
  • Check count of bucket, if its not at least the "Minimum Count for Average", check the previous bucket, otherwise continue to next bucket
  • If we reached the end with no bucket match, use "Initial Timeout"
  • If bucket is selected, take ("total time" / count) as Average latency, multiply by "Average Latency Multiplier", bound by "Minimum Timeout" and "Maximum Timeout"

NOTE: The timeout calculated may not be the timeout used. If we are retrying
the query on the same server another time, then it will use a larger value

On each query reply where the response is legitimate (proper response or NXDOMAIN) and not something like a server error:

  • Cycle through each bucket in order
  • Check timestamp of bucket against current timestamp, if out of date overwrite previous entry with values, clear current values
  • Compare current minimum and maximum recorded latency against query time and adjust if necessary
  • Increment "count" by 1 and "total time" by the query time

Other Notes:

  • This is always-on, the only user-configurable value is the initial timeout which will simply re-uses the current option.
  • Minimum and Maximum latencies for a bucket are currently unused but are there in case we find a need for them in the future.

Fixes Issue: #736
Fix By: Brad House (@bradh352)

@bradh352 bradh352 merged commit a488525 into c-ares:main Jun 22, 2024
@bradh352 bradh352 deleted the autotimeout branch June 23, 2024 13:49
ClifHouck added a commit to ClifHouck/envoy that referenced this pull request Mar 3, 2025
In version 1.31.0 c-ares enabled its query cache feature by default.
See c-ares/c-ares#786 .
This subverted test expectations that each query would be run in the
same way. This commit turns it off again in the c-ares dns impl by
default.  Adds a test to check that query cache is indeed off.

Adds a bit of addtional safety around handling TTLs in c-ares dns impl and
test code.

In PendingTimerEnable test a resolver address is now passed, otherwise
c-ares returns an error immediately. The number of expected timeouts is
also different because starting in version 1.32.0 c-ares now manages it's
own timeout expectations past the first timeout.
See c-ares/c-ares#794

Signed-off-by: Clif Houck <me@clifhouck.com>
ClifHouck added a commit to ClifHouck/envoy that referenced this pull request Mar 3, 2025
In version 1.31.0 c-ares enabled its query cache feature by default.
See c-ares/c-ares#786 .
This subverted test expectations that each query would be run in the
same way. This commit turns it off again in the c-ares dns impl by
default.  Adds a test to check that query cache is indeed off.

Adds a bit of addtional safety around handling TTLs in c-ares dns impl and
test code.

In PendingTimerEnable test a resolver address is now passed, otherwise
c-ares returns an error immediately. The number of expected timeouts is
also different because starting in version 1.32.0 c-ares now manages it's
own timeout expectations past the first timeout.
See c-ares/c-ares#794

Signed-off-by: Clif Houck <me@clifhouck.com>
ClifHouck added a commit to ClifHouck/envoy that referenced this pull request Mar 10, 2025
In version 1.31.0 c-ares enabled its query cache feature by default.
See c-ares/c-ares#786 .
This subverted test expectations that each query would be run in the
same way. This commit turns it off again in the c-ares dns impl by
default.  Adds a test to check that query cache is indeed off.

Adds a bit of addtional safety around handling TTLs in c-ares dns impl and
test code.

In PendingTimerEnable test a resolver address is now passed, otherwise
c-ares returns an error immediately. The number of expected timeouts is
also different because starting in version 1.32.0 c-ares now manages it's
own timeout expectations past the first timeout.
See c-ares/c-ares#794

Signed-off-by: Clif Houck <me@clifhouck.com>
ClifHouck added a commit to ClifHouck/envoy that referenced this pull request Mar 14, 2025
In version 1.31.0 c-ares enabled its query cache feature by default.
See c-ares/c-ares#786 .
This subverted test expectations that each query would be run in the
same way. This commit turns it off again in the c-ares dns impl by
default.  Adds a test to check that query cache is indeed off.

Adds a bit of addtional safety around handling TTLs in c-ares dns impl and
test code.

In PendingTimerEnable test a resolver address is now passed, otherwise
c-ares returns an error immediately. The number of expected timeouts is
also different because starting in version 1.32.0 c-ares now manages it's
own timeout expectations past the first timeout.
See c-ares/c-ares#794

Signed-off-by: Clif Houck <me@clifhouck.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant