Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugins/forward: Add max_concurrent option #3640

Merged
merged 14 commits into from Feb 4, 2020
Merged

Conversation

@chrisohaver
Copy link
Member

chrisohaver commented Jan 29, 2020

1. Why is this pull request needed and what does it do?

Adds a concurrent query limit option to the forward plugin, as described in #3635

2. Which issues (if any) are related?

#3635

3. Which documentation changes (if any) need to be made?

included

4. Does this introduce a backward incompatible change or deprecation?

no

chrisohaver added 6 commits Jan 29, 2020
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
docs
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jan 29, 2020

Codecov Report

Merging #3640 into master will decrease coverage by <.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3640      +/-   ##
==========================================
- Coverage    56.6%   56.59%   -0.01%     
==========================================
  Files         220      220              
  Lines       11039    11055      +16     
==========================================
+ Hits         6249     6257       +8     
- Misses       4311     4317       +6     
- Partials      479      481       +2
Impacted Files Coverage Δ
plugin/forward/forward.go 51.61% <0%> (-3.56%) ⬇️
plugin/forward/setup.go 58.38% <80%> (+1.55%) ⬆️
plugin/kubernetes/setup.go 64.24% <0%> (ø) ⬆️
test/server.go 82.75% <0%> (ø) ⬆️
plugin/test/scrape.go 0% <0%> (ø) ⬆️
plugin/trace/trace.go 79.06% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff981b1...ed7bcdb. Read the comment docs.

if f.maxQueryCount > 0 {
count := atomic.AddInt64(&(f.queryCount), 1)
defer atomic.AddInt64(&(f.queryCount), -1)
if count > f.maxQueryCount {

This comment has been minimized.

Copy link
@johnbelamaric

johnbelamaric Jan 29, 2020

Member

In theory there is a race condition where we may drop some extra queries when we're hovering near the edge. I don't think we care though, not enough to mutex this whole thing.

This comment has been minimized.

Copy link
@chrisohaver

chrisohaver Jan 30, 2020

Author Member

Yes, I lingered on this issue for a bit after seeing it happen during testing, but decided to stick with simple/performant atomic despite this. The count can go over the limit if overwhelmed, since we don't check for the limit in the same atomic action as the increment. Would be nice if there was an atomic action "increment-if-less-than", but I don't think this is a big deal, since these increments above the limit are immediately decremented.

Adding an atomic check of the value before the increment lessens, but does not completely eliminate the issue. But I don't think it's a problem enough to warrant the extra check.

@johnbelamaric

This comment has been minimized.

Copy link
Member

johnbelamaric commented Jan 29, 2020

/lgtm

@corbot
corbot bot approved these changes Jan 29, 2020
Copy link

corbot bot left a comment

Approved by johnbelamaric

@@ -37,6 +38,9 @@ type Forward struct {
maxfails uint32
expire time.Duration

maxQueryCount int64
queryCount int64

This comment has been minimized.

Copy link
@miekg

miekg Jan 30, 2020

Member

for sync atomic these need to be put first in the struct otherwise they don't align on ARM

This comment has been minimized.

Copy link
@chrisohaver

chrisohaver Jan 30, 2020

Author Member

Thanks, fixed.

count := atomic.AddInt64(&(f.queryCount), 1)
defer atomic.AddInt64(&(f.queryCount), -1)
if count > f.maxQueryCount {
return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum")

This comment has been minimized.

Copy link
@miekg

miekg Jan 30, 2020

Member

Would refuse be better here?

This comment has been minimized.

Copy link
@chrisohaver

chrisohaver Jan 30, 2020

Author Member

I'm not sure. I think either could work. I initially thought REFUSED, because this is a known/expected failure (soft failure). ... but then changed it to SERVFAIL in a later commit after reading some discussions on the difference between the two.

There isn't a strict difference: It's fuzzy...
REFUSE is supposed to indicate refusal due to policy. I read that as a function of the client or queried name. But thats vague, and you could describe a query rate type limit like this as a "policy".
SERVFAIL is supposed to indicate something broken.

@@ -211,6 +212,18 @@ func parseBlock(c *caddy.Controller, f *Forward) error {
default:
return c.Errf("unknown policy '%s'", x)
}
case "max_queries":

This comment has been minimized.

Copy link
@miekg

miekg Jan 30, 2020

Member

is there a better name with maybe 'concurrent' in it? (not sure there is, but max_queries is not what's implemented here)

This comment has been minimized.

Copy link
@chrisohaver

chrisohaver Jan 30, 2020

Author Member

changed to max_concurrent

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Jan 30, 2020

Also see #2593

count := atomic.AddInt64(&(f.queryCount), 1)
defer atomic.AddInt64(&(f.queryCount), -1)
if count > f.maxQueryCount {
return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum")

This comment has been minimized.

Copy link
@rdrozhdzh

rdrozhdzh Jan 30, 2020

Contributor

No needs to construct the error object here on each query, it can be constructed once as a global var, see ErrNoHealthy for example

This comment has been minimized.

Copy link
@chrisohaver

chrisohaver Jan 30, 2020

Author Member

Thanks, fixed.

@chrisohaver

This comment has been minimized.

Copy link
Member Author

chrisohaver commented Jan 30, 2020

I should add a metric here for number of refused packets.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Jan 30, 2020

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
plugin/forward/forward.go Outdated Show resolved Hide resolved
chrisohaver added 4 commits Jan 30, 2020
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

opts options // also here for testing

Next plugin.Handler
}

// ErrLimitExceeded indicates that a query was rejected because the number of concurrent queries has exceeded
// the maximum allowed (maxConcurrent)
var ErrLimitExceeded = errors.New("concurrent queries exceeded maximum")

This comment has been minimized.

Copy link
@miekg

miekg Jan 30, 2020

Member

I think is valuable to include the configured max here.

defer atomic.AddInt64(&(f.concurrent), -1)
if count > f.maxConcurrent {
MaxConcurrentRejectCount.Add(1)
return dns.RcodeServerFailure, ErrLimitExceeded

This comment has been minimized.

Copy link
@miekg

miekg Jan 30, 2020

Member

Difference between SERVFAIL or REFUSED is fuzzy. REFUSED is better here, because the server is fine (no SERVFAIL), but actually refuses to do work for you

This comment has been minimized.

Copy link
@rdrozhdzh

rdrozhdzh Jan 30, 2020

Contributor

from rfc1035, section 4.1.1.

Server failure - The name server was unable to process this query due to a problem with the name server.

Refused - The name server refuses to perform the specified operation for policy reasons. For example, a name server may not wish to provide the information to the particular requester, or a name server may not wish to perform a particular operation (e.g., zone transfer) for particular data.

I think SERVFAIL suits better in this case. This is a temporary error. If we return REFUSED, client will probably decide that this query is not permitted and will not retry. In case of SERVFAIL it could try again.

This comment has been minimized.

Copy link
@chrisohaver

chrisohaver Jan 30, 2020

Author Member

In a sense, this kind of failure is like a timeout, in that there is some transient condition (some action has exceeded a configured limit) that makes it so we are not going to answer. Although in a timeout, we cannot get an answer, and in the concurrent limit case, we choose not to.

I think clients would treat either of these equally, as a negative response. Since the definition of REFUSED/SERVFAIL is fuzzy, it would be unwise for a client to draw special conclusions in a general context for one answer vs the other.

If the desired behavior is to have the client retry, then I think we would need to drop the response (not respond at all).

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@miekg

This comment has been minimized.

Copy link
Member

miekg commented Jan 30, 2020

if n < 0 {
return fmt.Errorf("max_concurrent can't be negative: %d", n)
}
ErrLimitExceeded = errors.New("concurrent queries exceeded maximum " + c.Val())

This comment has been minimized.

Copy link
@rdrozhdzh

rdrozhdzh Jan 30, 2020

Contributor

Here may be a race condition. When coredns is reloaded (e.g. with SIGUSR1) the old instance of forward plugin is still handling traffic and may access ErrLimitExceeded, and the new instance of forward plugin can update ErrLimitExceeded at the same time

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Jan 30, 2020

@rdrozhdzh

This comment has been minimized.

Copy link
Contributor

rdrozhdzh commented Jan 30, 2020

Well, since the error text is not a constant anymore, that would be a reasonable choice.
Alternatively, the error can be saved in a Forward object.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@miekg

This comment has been minimized.

Copy link
Member

miekg commented Jan 30, 2020

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
@chrisohaver chrisohaver changed the title plugins/forward: Add max_queries option plugins/forward: Add max_concurrent option Jan 31, 2020
@miekg
miekg approved these changes Feb 4, 2020
@miekg miekg merged commit 22cd28a into coredns:master Feb 4, 2020
5 checks passed
5 checks passed
DCO DCO
Details
ci/circleci: kubernetes-tests Your tests passed on CircleCI!
Details
codecov/project 56.59% (target 50%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
stickler-ci No lint errors found.
@chrisohaver chrisohaver mentioned this pull request Feb 18, 2020
6 of 7 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.