Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugins/forward: Add max_concurrent option #3640

Merged
merged 14 commits into from
Feb 4, 2020

Conversation

chrisohaver
Copy link
Member

1. Why is this pull request needed and what does it do?

Adds a concurrent query limit option to the forward plugin, as described in #3635

2. Which issues (if any) are related?

#3635

3. Which documentation changes (if any) need to be made?

included

4. Does this introduce a backward incompatible change or deprecation?

no

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@codecov-io
Copy link

codecov-io commented Jan 29, 2020

Codecov Report

Merging #3640 into master will decrease coverage by <.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3640      +/-   ##
==========================================
- Coverage    56.6%   56.59%   -0.01%     
==========================================
  Files         220      220              
  Lines       11039    11055      +16     
==========================================
+ Hits         6249     6257       +8     
- Misses       4311     4317       +6     
- Partials      479      481       +2
Impacted Files Coverage Δ
plugin/forward/forward.go 51.61% <0%> (-3.56%) ⬇️
plugin/forward/setup.go 58.38% <80%> (+1.55%) ⬆️
plugin/kubernetes/setup.go 64.24% <0%> (ø) ⬆️
test/server.go 82.75% <0%> (ø) ⬆️
plugin/test/scrape.go 0% <0%> (ø) ⬆️
plugin/trace/trace.go 79.06% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff981b1...ed7bcdb. Read the comment docs.

if f.maxQueryCount > 0 {
count := atomic.AddInt64(&(f.queryCount), 1)
defer atomic.AddInt64(&(f.queryCount), -1)
if count > f.maxQueryCount {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory there is a race condition where we may drop some extra queries when we're hovering near the edge. I don't think we care though, not enough to mutex this whole thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I lingered on this issue for a bit after seeing it happen during testing, but decided to stick with simple/performant atomic despite this. The count can go over the limit if overwhelmed, since we don't check for the limit in the same atomic action as the increment. Would be nice if there was an atomic action "increment-if-less-than", but I don't think this is a big deal, since these increments above the limit are immediately decremented.

Adding an atomic check of the value before the increment lessens, but does not completely eliminate the issue. But I don't think it's a problem enough to warrant the extra check.

@johnbelamaric
Copy link
Member

/lgtm

Copy link

@corbot corbot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved by johnbelamaric

@@ -37,6 +38,9 @@ type Forward struct {
maxfails uint32
expire time.Duration

maxQueryCount int64
queryCount int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sync atomic these need to be put first in the struct otherwise they don't align on ARM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

count := atomic.AddInt64(&(f.queryCount), 1)
defer atomic.AddInt64(&(f.queryCount), -1)
if count > f.maxQueryCount {
return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would refuse be better here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think either could work. I initially thought REFUSED, because this is a known/expected failure (soft failure). ... but then changed it to SERVFAIL in a later commit after reading some discussions on the difference between the two.

There isn't a strict difference: It's fuzzy...
REFUSE is supposed to indicate refusal due to policy. I read that as a function of the client or queried name. But thats vague, and you could describe a query rate type limit like this as a "policy".
SERVFAIL is supposed to indicate something broken.

@@ -211,6 +212,18 @@ func parseBlock(c *caddy.Controller, f *Forward) error {
default:
return c.Errf("unknown policy '%s'", x)
}
case "max_queries":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better name with maybe 'concurrent' in it? (not sure there is, but max_queries is not what's implemented here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to max_concurrent

@miekg
Copy link
Member

miekg commented Jan 30, 2020

Also see #2593

count := atomic.AddInt64(&(f.queryCount), 1)
defer atomic.AddInt64(&(f.queryCount), -1)
if count > f.maxQueryCount {
return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No needs to construct the error object here on each query, it can be constructed once as a global var, see ErrNoHealthy for example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@chrisohaver
Copy link
Member Author

I should add a metric here for number of refused packets.

@miekg
Copy link
Member

miekg commented Jan 30, 2020 via email

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

opts options // also here for testing

Next plugin.Handler
}

// ErrLimitExceeded indicates that a query was rejected because the number of concurrent queries has exceeded
// the maximum allowed (maxConcurrent)
var ErrLimitExceeded = errors.New("concurrent queries exceeded maximum")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is valuable to include the configured max here.

defer atomic.AddInt64(&(f.concurrent), -1)
if count > f.maxConcurrent {
MaxConcurrentRejectCount.Add(1)
return dns.RcodeServerFailure, ErrLimitExceeded
Copy link
Member

@miekg miekg Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference between SERVFAIL or REFUSED is fuzzy. REFUSED is better here, because the server is fine (no SERVFAIL), but actually refuses to do work for you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from rfc1035, section 4.1.1.

Server failure - The name server was unable to process this query due to a problem with the name server.

Refused - The name server refuses to perform the specified operation for policy reasons. For example, a name server may not wish to provide the information to the particular requester, or a name server may not wish to perform a particular operation (e.g., zone transfer) for particular data.

I think SERVFAIL suits better in this case. This is a temporary error. If we return REFUSED, client will probably decide that this query is not permitted and will not retry. In case of SERVFAIL it could try again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sense, this kind of failure is like a timeout, in that there is some transient condition (some action has exceeded a configured limit) that makes it so we are not going to answer. Although in a timeout, we cannot get an answer, and in the concurrent limit case, we choose not to.

I think clients would treat either of these equally, as a negative response. Since the definition of REFUSED/SERVFAIL is fuzzy, it would be unwise for a client to draw special conclusions in a general context for one answer vs the other.

If the desired behavior is to have the client retry, then I think we would need to drop the response (not respond at all).

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@miekg
Copy link
Member

miekg commented Jan 30, 2020 via email

if n < 0 {
return fmt.Errorf("max_concurrent can't be negative: %d", n)
}
ErrLimitExceeded = errors.New("concurrent queries exceeded maximum " + c.Val())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here may be a race condition. When coredns is reloaded (e.g. with SIGUSR1) the old instance of forward plugin is still handling traffic and may access ErrLimitExceeded, and the new instance of forward plugin can update ErrLimitExceeded at the same time

@miekg
Copy link
Member

miekg commented Jan 30, 2020 via email

@rdrozhdzh
Copy link
Contributor

Well, since the error text is not a constant anymore, that would be a reasonable choice.
Alternatively, the error can be saved in a Forward object.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@miekg
Copy link
Member

miekg commented Jan 30, 2020 via email

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver chrisohaver changed the title plugins/forward: Add max_queries option plugins/forward: Add max_concurrent option Jan 31, 2020
@miekg miekg merged commit 22cd28a into coredns:master Feb 4, 2020
@chrisohaver chrisohaver mentioned this pull request Feb 18, 2020
7 tasks
nyodas pushed a commit to DataDog/coredns that referenced this pull request Oct 26, 2020
* count and limit concurrent queries

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* add option

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* return servfail when limit exceeded

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* docs

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* docs

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* docs

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* review feedback

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* move atomic counter to beginning of struct

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* add comment for ErrLimitExceeded

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* rename option to max_concurrent

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* add metric

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* response REFUSED; incl max in error; add more docs

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* avoid err setup race

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* respond SERVFAIL; doc memory usage

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver chrisohaver deleted the forward-limit branch January 9, 2021 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants