Skip to content

Add observability into DNS server health via a server state callback, invoked whenever a query finishes#744

Merged
bradh352 merged 6 commits intoc-ares:mainfrom
oliverwelsh:oliverwelsh/add-server-observability
Apr 28, 2024
Merged

Add observability into DNS server health via a server state callback, invoked whenever a query finishes#744
bradh352 merged 6 commits intoc-ares:mainfrom
oliverwelsh:oliverwelsh/add-server-observability

Conversation

@oliverwelsh
Copy link
Contributor

@oliverwelsh oliverwelsh commented Apr 15, 2024

Summary

This PR adds a server state callback that is invoked whenever a query to a DNS server finishes.

The callback is invoked with the server details (as a string), a boolean indicating whether the query succeeded or failed, flags describing the query (currently just indicating whether TCP or UDP was used), and custom userdata.

This can be used by user applications to gain observability into DNS server health and usage. For example, alerts when a DNS server fails/recovers or metrics to track how often a DNS server is used and responds successfully.

Testing

Three new regression tests MockChannelTest.ServStateCallback* have been added to test the new callback in different success/failure scenarios.

…server

finishes. The callback is invoked with the server details (as a string),
a boolean indicating whether the query succeeded or failed, and custom userdata.
@bradh352
Copy link
Member

on a different topic, it looks like the ServerFailoverOpts based tests are failing on Windows occasionally. Likely due to the timeout chosen with a dead sleep in there and the system just behaving differently. Can you evaluate that a little? Its likely to hit other systems if they're overloaded I'm guessing.

@bradh352
Copy link
Member

I think I'm fine with this this in general. We should probably provide some method to also do a deeper dive into server statistics and whatnot, but if this solves your needs then its ok. The only change I'd like to see is a flags argument that can provide some details. Right now the only flag would be to indicate the failure (or success) was via TCP vs via UDP. In the future we may have more flags.

@oliverwelsh
Copy link
Contributor Author

on a different topic, it looks like the ServerFailoverOpts based tests are failing on Windows occasionally. Likely due to the timeout chosen with a dead sleep in there and the system just behaving differently. Can you evaluate that a little? Its likely to hit other systems if they're overloaded I'm guessing.

Ah yes they are failing on this PR too. I will prioritise a fix for those tests today and raise as a separate PR. I suspect it will be a timing issue since we only have 50ms leeway on the final testcase, but will check the logs to convince myself.

I think I'm fine with this this in general. We should probably provide some method to also do a deeper dive into server statistics and whatnot, but if this solves your needs then its ok. The only change I'd like to see is a flags argument that can provide some details. Right now the only flag would be to indicate the failure (or success) was via TCP vs via UDP. In the future we may have more flags.

Thank you for being accepting. I'll work on implementing the extra flags argument. It looks like every call to server_increment_failures or server_set_good already has context of whether the query is TCP/UDP, so I will implement this too.

@oliverwelsh
Copy link
Contributor Author

oliverwelsh commented Apr 17, 2024

on a different topic, it looks like the ServerFailoverOpts based tests are failing on Windows occasionally. Likely due to the timeout chosen with a dead sleep in there and the system just behaving differently. Can you evaluate that a little? Its likely to hit other systems if they're overloaded I'm guessing.

Looking at the pipeline logs, it actually looks like the second testcase is failing:

  // 2. Failed servers should be retried after the retry delay.
  //
  // Fail server #0 but leave server #1 as healthy.

  // Sleep for the retry delay and send in another query. Server #0 should be
  // retried.

Server 0 is not retried, instead server 1 is used again.

Given the intermittent nature I suspect this will be due to inaccurate timing. Possibly something like NTP slew, causing the 100ms sleep to not increment system time by the full 100ms.

I will update the timings to sleep for a little more than the retry delay (like 110ms), and see if that improves reliability.

@oliverwelsh
Copy link
Contributor Author

I think I'm fine with this this in general. We should probably provide some method to also do a deeper dive into server statistics and whatnot, but if this solves your needs then its ok. The only change I'd like to see is a flags argument that can provide some details. Right now the only flag would be to indicate the failure (or success) was via TCP vs via UDP. In the future we may have more flags.

Apologies for the delay here. I have made the markups to add the flags argument and indicate whether the completed queries used TCP or UDP.

I have also added regression tests to cover success and failure scenarios for the server state callback. These check that:

  • the number of success/failure calls to the callback is correct
  • the server string argument matches the output from ares_get_servers_csv()
  • the flags argument has either the UDP flag or the TCP flag enabled, but not both (note I am not checking which one is enabled)

Please let me know if you have any feedback on the implementation of this.

@bradh352
Copy link
Member

I think this looks good to me, other than the missing manpages of course.

@oliverwelsh
Copy link
Contributor Author

I think this looks good to me, other than the missing manpages of course.

Thank you for being accepting of this change. I have added the manpage to this PR now (also I updated the description to be ready for merging). Please let me know if I can improve the documentation or if you'd like the manpage to be formatted different; I'm not very used to writing them!

@bradh352 bradh352 merged commit 89a8856 into c-ares:main Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants