Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support jitter and maxtimeout for retry penalties #606

Merged
merged 5 commits into from Nov 13, 2023

Conversation

Kontakter
Copy link
Contributor

We use c-ares for DNS resolving at the project https://github.com/YTsaurus/YTsaurus

In our installations we face a problem with synchronized request to the DNS from hundreds of thousands of processes and I want to add jitter for retry backoffs to smooth these requests over time.

@bradh352
Copy link
Member

A few things here:

  1. on FreeBSD you'll need to include stdint.h to get SIZE_MAX (just make sure you have it in #ifdef HAVE_STDINT_H guards)
  2. The OptionsChannelInit test is failing, looks like in ares_save_options() you used the wrong variable name in your if() statement, should use channel->optmask
  3. Need to update docs/ares_init_options.3 for the new params
  4. Can you provide more insight into the algorithm used here and how it helps? I'm not familiar at all with jitter type algorithms, and the code really doesn't discuss anything.

@Kontakter
Copy link
Contributor Author

Thank you for fast reply!

I have fixed 1, 2 and 3.

About 4 – new code just add some noise to the retry backoff. Value of the jitter regulates amplitude of this noise. The code could look tricky since I am trying to avoid overflows of timeplus after adding this noise.

@bradh352
Copy link
Member

bradh352 commented Nov 11, 2023

Whats the purpose of the backoff "noise"? I'm just trying to understand the issue its resolving.

The current code is just adjusting timeouts for how long it waits for a response when the first round across all servers fails with timeouts. Since c-ares didn't get any responses from any servers in the configured timeout interval, we have to assume that maybe the system configuration is too low for the timeout values, so we should try a round with more of a wait time.

The current code appears to try to double the timeout on each full pass across all servers ... yours appears to limit it to less than that in a randomized way.

@Kontakter
Copy link
Contributor Author

The issue is following, I start large cluster (or start large distributed computation) with 10'000+ processes, all of these processes at the start want to resolve addresses of some control hosts (or of some source hosts) and goes to the DNS server. In such case this server can start throttle resolve requests and we want to avoid any synchronisation of retries (that really happens in such kind of situations)

@bradh352
Copy link
Member

bradh352 commented Nov 11, 2023

Interesting. Is there a true need for this to be configurable? Or any reason it shouldn't be always-on? If we could come up with parameters that should be acceptable for all use cases, my preference would be to not add more config options (one because its more to maintain, but two because other people could benefit from this but may not have realized there's a config to fix their usecase). I'm really not tied in any way to the current algorithm in use.

I'd think as long as ares__retry_penalty() returns something in the range of channel->timeout to channel->timeout * (ntries / servercnt), I'd be fine with it. Even better if it is more likely to return a larger number as (ntries / servercnt) increases.

@Kontakter
Copy link
Contributor Author

Kontakter commented Nov 12, 2023

"Is there a true need for this to be configurable?" – it is a good question!

In our project we actually use fixed value of this jitter for years. But my initial will was to remain unchanged the default behaviour of such a low-level library as c-ares, because my experience tells that engineers often rely on very concrete aspects of behaviour.

If we ok with changes of default behaviour, then I can suggest following:

  • I saves current behaviour with exponential backoffs that library has right now
  • I add maxtimeout parameters (it is a separate problem, but we actually want to have some upper bound)
  • I drop jitter parament and add randomness to pick the value from range [channel->timeout * (ntries / servercnt) * 0.5, channel->timeout * (ntries / servercnt)]

Are you ok with such plan?

@bradh352
Copy link
Member

I think that sounds reasonable to me :)

@bradh352
Copy link
Member

Looking at this, I think ares__retry_penalty() is misnamed as it takes the value directly from this function to set the overall query timeout (not just adding a penalty on retries). It should actually be ares__calc_query_timeout() I think, so it needs to return a minimum of channel->timeout, and a maximum of channel->timeoutmax (if set), as this function is also called on even the first attempt. Infact, this jitter should only be run if shift != 0 as 0 is the first pass.

@Kontakter
Copy link
Contributor Author

Looking at this, I think ares__retry_penalty() is misnamed as it takes the value directly from this function to set the overall query timeout (not just adding a penalty on retries). It should actually be ares__calc_query_timeout() I think, so it needs to return a minimum of channel->timeout, and a maximum of channel->timeoutmax (if set), as this function is also called on even the first attempt. Infact, this jitter should only be run if shift != 0 as 0 is the first pass.

I have fixed it, thank you for the remark

@bradh352
Copy link
Member

not seeing another commit after my last comment, forget to push it ?

@Kontakter
Copy link
Contributor Author

not seeing another commit after my last comment, forget to push it ?

Oops, fixed

@bradh352 bradh352 merged commit 7a140cb into c-ares:main Nov 13, 2023
19 of 23 checks passed
@bradh352
Copy link
Member

can you look at 4acd575 to make sure I didn't negatively impact your logic? Just some fixes I noticed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants