Support jitter and maxtimeout for retry penalties #606

Kontakter · 2023-11-10T09:26:14Z

We use c-ares for DNS resolving at the project https://github.com/YTsaurus/YTsaurus

In our installations we face a problem with synchronized request to the DNS from hundreds of thousands of processes and I want to add jitter for retry backoffs to smooth these requests over time.

bradh352 · 2023-11-10T11:47:50Z

A few things here:

on FreeBSD you'll need to include stdint.h to get SIZE_MAX (just make sure you have it in #ifdef HAVE_STDINT_H guards)
The OptionsChannelInit test is failing, looks like in ares_save_options() you used the wrong variable name in your if() statement, should use channel->optmask
Need to update docs/ares_init_options.3 for the new params
Can you provide more insight into the algorithm used here and how it helps? I'm not familiar at all with jitter type algorithms, and the code really doesn't discuss anything.

Kontakter · 2023-11-11T06:46:09Z

Thank you for fast reply!

I have fixed 1, 2 and 3.

About 4 – new code just add some noise to the retry backoff. Value of the jitter regulates amplitude of this noise. The code could look tricky since I am trying to avoid overflows of timeplus after adding this noise.

bradh352 · 2023-11-11T14:55:39Z

Whats the purpose of the backoff "noise"? I'm just trying to understand the issue its resolving.

The current code is just adjusting timeouts for how long it waits for a response when the first round across all servers fails with timeouts. Since c-ares didn't get any responses from any servers in the configured timeout interval, we have to assume that maybe the system configuration is too low for the timeout values, so we should try a round with more of a wait time.

The current code appears to try to double the timeout on each full pass across all servers ... yours appears to limit it to less than that in a randomized way.

Kontakter · 2023-11-11T18:19:24Z

The issue is following, I start large cluster (or start large distributed computation) with 10'000+ processes, all of these processes at the start want to resolve addresses of some control hosts (or of some source hosts) and goes to the DNS server. In such case this server can start throttle resolve requests and we want to avoid any synchronisation of retries (that really happens in such kind of situations)

bradh352 · 2023-11-11T18:45:10Z

Interesting. Is there a true need for this to be configurable? Or any reason it shouldn't be always-on? If we could come up with parameters that should be acceptable for all use cases, my preference would be to not add more config options (one because its more to maintain, but two because other people could benefit from this but may not have realized there's a config to fix their usecase). I'm really not tied in any way to the current algorithm in use.

I'd think as long as ares__retry_penalty() returns something in the range of channel->timeout to channel->timeout * (ntries / servercnt), I'd be fine with it. Even better if it is more likely to return a larger number as (ntries / servercnt) increases.

Kontakter · 2023-11-12T08:07:12Z

"Is there a true need for this to be configurable?" – it is a good question!

In our project we actually use fixed value of this jitter for years. But my initial will was to remain unchanged the default behaviour of such a low-level library as c-ares, because my experience tells that engineers often rely on very concrete aspects of behaviour.

If we ok with changes of default behaviour, then I can suggest following:

I saves current behaviour with exponential backoffs that library has right now
I add maxtimeout parameters (it is a separate problem, but we actually want to have some upper bound)
I drop jitter parament and add randomness to pick the value from range [channel->timeout * (ntries / servercnt) * 0.5, channel->timeout * (ntries / servercnt)]

Are you ok with such plan?

bradh352 · 2023-11-12T12:39:15Z

I think that sounds reasonable to me :)

bradh352 · 2023-11-12T16:16:41Z

Looking at this, I think ares__retry_penalty() is misnamed as it takes the value directly from this function to set the overall query timeout (not just adding a penalty on retries). It should actually be ares__calc_query_timeout() I think, so it needs to return a minimum of channel->timeout, and a maximum of channel->timeoutmax (if set), as this function is also called on even the first attempt. Infact, this jitter should only be run if shift != 0 as 0 is the first pass.

Kontakter · 2023-11-13T11:59:05Z

Looking at this, I think ares__retry_penalty() is misnamed as it takes the value directly from this function to set the overall query timeout (not just adding a penalty on retries). It should actually be ares__calc_query_timeout() I think, so it needs to return a minimum of channel->timeout, and a maximum of channel->timeoutmax (if set), as this function is also called on even the first attempt. Infact, this jitter should only be run if shift != 0 as 0 is the first pass.

I have fixed it, thank you for the remark

bradh352 · 2023-11-13T17:08:10Z

not seeing another commit after my last comment, forget to push it ?

Kontakter · 2023-11-13T17:18:13Z

not seeing another commit after my last comment, forget to push it ?

Oops, fixed

bradh352 · 2023-11-13T17:54:00Z

can you look at 4acd575 to make sure I didn't negatively impact your logic? Just some fixes I noticed.

Support jitter and maxtimeout for retry penalties

012eba0

Kontakter added 2 commits November 10, 2023 19:27

Fixes

84b521d

Fix for SIZE_MAX

5b402e1

Drop jitter option, add randomness to the retry penalty by default

2c985fb

Rename function, add lower clamp with channel->timeout

384a2c5

bradh352 merged commit 7a140cb into c-ares:main Nov 13, 2023
19 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support jitter and maxtimeout for retry penalties #606

Support jitter and maxtimeout for retry penalties #606

Kontakter commented Nov 10, 2023

bradh352 commented Nov 10, 2023

Kontakter commented Nov 11, 2023

bradh352 commented Nov 11, 2023 •

edited

Kontakter commented Nov 11, 2023

bradh352 commented Nov 11, 2023 •

edited

Kontakter commented Nov 12, 2023 •

edited

bradh352 commented Nov 12, 2023

bradh352 commented Nov 12, 2023

Kontakter commented Nov 13, 2023

bradh352 commented Nov 13, 2023

Kontakter commented Nov 13, 2023

bradh352 commented Nov 13, 2023

Support jitter and maxtimeout for retry penalties #606

Support jitter and maxtimeout for retry penalties #606

Conversation

Kontakter commented Nov 10, 2023

bradh352 commented Nov 10, 2023

Kontakter commented Nov 11, 2023

bradh352 commented Nov 11, 2023 • edited

Kontakter commented Nov 11, 2023

bradh352 commented Nov 11, 2023 • edited

Kontakter commented Nov 12, 2023 • edited

bradh352 commented Nov 12, 2023

bradh352 commented Nov 12, 2023

Kontakter commented Nov 13, 2023

bradh352 commented Nov 13, 2023

Kontakter commented Nov 13, 2023

bradh352 commented Nov 13, 2023

bradh352 commented Nov 11, 2023 •

edited

bradh352 commented Nov 11, 2023 •

edited

Kontakter commented Nov 12, 2023 •

edited