Add observability into DNS server health via a server state callback, invoked whenever a query finishes#744
Add observability into DNS server health via a server state callback, invoked whenever a query finishes#744bradh352 merged 6 commits intoc-ares:mainfrom oliverwelsh:oliverwelsh/add-server-observability
Conversation
…server finishes. The callback is invoked with the server details (as a string), a boolean indicating whether the query succeeded or failed, and custom userdata.
|
on a different topic, it looks like the ServerFailoverOpts based tests are failing on Windows occasionally. Likely due to the timeout chosen with a dead sleep in there and the system just behaving differently. Can you evaluate that a little? Its likely to hit other systems if they're overloaded I'm guessing. |
|
I think I'm fine with this this in general. We should probably provide some method to also do a deeper dive into server statistics and whatnot, but if this solves your needs then its ok. The only change I'd like to see is a flags argument that can provide some details. Right now the only flag would be to indicate the failure (or success) was via TCP vs via UDP. In the future we may have more flags. |
Ah yes they are failing on this PR too. I will prioritise a fix for those tests today and raise as a separate PR. I suspect it will be a timing issue since we only have 50ms leeway on the final testcase, but will check the logs to convince myself.
Thank you for being accepting. I'll work on implementing the extra flags argument. It looks like every call to |
Looking at the pipeline logs, it actually looks like the second testcase is failing: Server 0 is not retried, instead server 1 is used again. Given the intermittent nature I suspect this will be due to inaccurate timing. Possibly something like NTP slew, causing the 100ms sleep to not increment system time by the full 100ms. I will update the timings to sleep for a little more than the retry delay (like 110ms), and see if that improves reliability. |
Apologies for the delay here. I have made the markups to add the I have also added regression tests to cover success and failure scenarios for the server state callback. These check that:
Please let me know if you have any feedback on the implementation of this. |
|
I think this looks good to me, other than the missing manpages of course. |
Thank you for being accepting of this change. I have added the manpage to this PR now (also I updated the description to be ready for merging). Please let me know if I can improve the documentation or if you'd like the manpage to be formatted different; I'm not very used to writing them! |
Summary
This PR adds a server state callback that is invoked whenever a query to a DNS server finishes.
The callback is invoked with the server details (as a string), a boolean indicating whether the query succeeded or failed, flags describing the query (currently just indicating whether TCP or UDP was used), and custom userdata.
This can be used by user applications to gain observability into DNS server health and usage. For example, alerts when a DNS server fails/recovers or metrics to track how often a DNS server is used and responds successfully.
Testing
Three new regression tests
MockChannelTest.ServStateCallback*have been added to test the new callback in different success/failure scenarios.