-
Notifications
You must be signed in to change notification settings - Fork 643
Automatic query timeout adjustment based on server history #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ClifHouck
added a commit
to ClifHouck/envoy
that referenced
this pull request
Mar 3, 2025
In version 1.31.0 c-ares enabled its query cache feature by default. See c-ares/c-ares#786 . This subverted test expectations that each query would be run in the same way. This commit turns it off again in the c-ares dns impl by default. Adds a test to check that query cache is indeed off. Adds a bit of addtional safety around handling TTLs in c-ares dns impl and test code. In PendingTimerEnable test a resolver address is now passed, otherwise c-ares returns an error immediately. The number of expected timeouts is also different because starting in version 1.32.0 c-ares now manages it's own timeout expectations past the first timeout. See c-ares/c-ares#794 Signed-off-by: Clif Houck <me@clifhouck.com>
ClifHouck
added a commit
to ClifHouck/envoy
that referenced
this pull request
Mar 3, 2025
In version 1.31.0 c-ares enabled its query cache feature by default. See c-ares/c-ares#786 . This subverted test expectations that each query would be run in the same way. This commit turns it off again in the c-ares dns impl by default. Adds a test to check that query cache is indeed off. Adds a bit of addtional safety around handling TTLs in c-ares dns impl and test code. In PendingTimerEnable test a resolver address is now passed, otherwise c-ares returns an error immediately. The number of expected timeouts is also different because starting in version 1.32.0 c-ares now manages it's own timeout expectations past the first timeout. See c-ares/c-ares#794 Signed-off-by: Clif Houck <me@clifhouck.com>
ClifHouck
added a commit
to ClifHouck/envoy
that referenced
this pull request
Mar 10, 2025
In version 1.31.0 c-ares enabled its query cache feature by default. See c-ares/c-ares#786 . This subverted test expectations that each query would be run in the same way. This commit turns it off again in the c-ares dns impl by default. Adds a test to check that query cache is indeed off. Adds a bit of addtional safety around handling TTLs in c-ares dns impl and test code. In PendingTimerEnable test a resolver address is now passed, otherwise c-ares returns an error immediately. The number of expected timeouts is also different because starting in version 1.32.0 c-ares now manages it's own timeout expectations past the first timeout. See c-ares/c-ares#794 Signed-off-by: Clif Houck <me@clifhouck.com>
ClifHouck
added a commit
to ClifHouck/envoy
that referenced
this pull request
Mar 14, 2025
In version 1.31.0 c-ares enabled its query cache feature by default. See c-ares/c-ares#786 . This subverted test expectations that each query would be run in the same way. This commit turns it off again in the c-ares dns impl by default. Adds a test to check that query cache is indeed off. Adds a bit of addtional safety around handling TTLs in c-ares dns impl and test code. In PendingTimerEnable test a resolver address is now passed, otherwise c-ares returns an error immediately. The number of expected timeouts is also different because starting in version 1.32.0 c-ares now manages it's own timeout expectations past the first timeout. See c-ares/c-ares#794 Signed-off-by: Clif Houck <me@clifhouck.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With very little effort we should be able to determine fairly proper timeouts we can use based on prior query history. We track in order to be able to auto-scale when network conditions change (e.g. maybe there is a provider failover and timings change due to that). Apple appears to do this within their system resolver in MacOS. Obviously we should have a minimum, maximum, and initial value to make sure the algorithm doesn't somehow go off the rails.
Values:
Per-server buckets for tracking latency over time (these are ephemeral meaning they don't persist once a channel is destroyed). We record both the current timespan for the bucket and the immediate preceding timespan in case of roll-overs we can still maintain recent metrics for calculations:
Each bucket contains:
NOTE: average latency is (total time / count), we will calculate this dynamically when needed
Basic algorithm for calculating timeout to use would be:
NOTE: The timeout calculated may not be the timeout used. If we are retrying
the query on the same server another time, then it will use a larger value
On each query reply where the response is legitimate (proper response or NXDOMAIN) and not something like a server error:
Other Notes:
Fixes Issue: #736
Fix By: Brad House (@bradh352)