-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugins/forward: Add max_concurrent option #3640
Conversation
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Codecov Report
@@ Coverage Diff @@
## master #3640 +/- ##
==========================================
- Coverage 56.6% 56.59% -0.01%
==========================================
Files 220 220
Lines 11039 11055 +16
==========================================
+ Hits 6249 6257 +8
- Misses 4311 4317 +6
- Partials 479 481 +2
Continue to review full report at Codecov.
|
plugin/forward/forward.go
Outdated
if f.maxQueryCount > 0 { | ||
count := atomic.AddInt64(&(f.queryCount), 1) | ||
defer atomic.AddInt64(&(f.queryCount), -1) | ||
if count > f.maxQueryCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory there is a race condition where we may drop some extra queries when we're hovering near the edge. I don't think we care though, not enough to mutex this whole thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I lingered on this issue for a bit after seeing it happen during testing, but decided to stick with simple/performant atomic despite this. The count can go over the limit if overwhelmed, since we don't check for the limit in the same atomic action as the increment. Would be nice if there was an atomic action "increment-if-less-than", but I don't think this is a big deal, since these increments above the limit are immediately decremented.
Adding an atomic check of the value before the increment lessens, but does not completely eliminate the issue. But I don't think it's a problem enough to warrant the extra check.
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved by johnbelamaric
plugin/forward/forward.go
Outdated
@@ -37,6 +38,9 @@ type Forward struct { | |||
maxfails uint32 | |||
expire time.Duration | |||
|
|||
maxQueryCount int64 | |||
queryCount int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sync atomic these need to be put first in the struct otherwise they don't align on ARM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
plugin/forward/forward.go
Outdated
count := atomic.AddInt64(&(f.queryCount), 1) | ||
defer atomic.AddInt64(&(f.queryCount), -1) | ||
if count > f.maxQueryCount { | ||
return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would refuse be better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think either could work. I initially thought REFUSED, because this is a known/expected failure (soft failure). ... but then changed it to SERVFAIL in a later commit after reading some discussions on the difference between the two.
There isn't a strict difference: It's fuzzy...
REFUSE is supposed to indicate refusal due to policy. I read that as a function of the client or queried name. But thats vague, and you could describe a query rate type limit like this as a "policy".
SERVFAIL is supposed to indicate something broken.
plugin/forward/setup.go
Outdated
@@ -211,6 +212,18 @@ func parseBlock(c *caddy.Controller, f *Forward) error { | |||
default: | |||
return c.Errf("unknown policy '%s'", x) | |||
} | |||
case "max_queries": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a better name with maybe 'concurrent' in it? (not sure there is, but max_queries is not what's implemented here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to max_concurrent
Also see #2593 |
plugin/forward/forward.go
Outdated
count := atomic.AddInt64(&(f.queryCount), 1) | ||
defer atomic.AddInt64(&(f.queryCount), -1) | ||
if count > f.maxQueryCount { | ||
return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No needs to construct the error object here on each query, it can be constructed once as a global var, see ErrNoHealthy
for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
I should add a metric here for number of refused packets. |
[ Quoting <notifications@github.com> in "Re: [coredns/coredns] plugins/forwa..." ]
I should add a metric here for number of refused packets.
yes, that would be a nice enhancement
|
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
plugin/forward/forward.go
Outdated
|
||
opts options // also here for testing | ||
|
||
Next plugin.Handler | ||
} | ||
|
||
// ErrLimitExceeded indicates that a query was rejected because the number of concurrent queries has exceeded | ||
// the maximum allowed (maxConcurrent) | ||
var ErrLimitExceeded = errors.New("concurrent queries exceeded maximum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is valuable to include the configured max here.
plugin/forward/forward.go
Outdated
defer atomic.AddInt64(&(f.concurrent), -1) | ||
if count > f.maxConcurrent { | ||
MaxConcurrentRejectCount.Add(1) | ||
return dns.RcodeServerFailure, ErrLimitExceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference between SERVFAIL or REFUSED is fuzzy. REFUSED is better here, because the server is fine (no SERVFAIL), but actually refuses to do work for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from rfc1035, section 4.1.1.
Server failure - The name server was unable to process this query due to a problem with the name server.
Refused - The name server refuses to perform the specified operation for policy reasons. For example, a name server may not wish to provide the information to the particular requester, or a name server may not wish to perform a particular operation (e.g., zone transfer) for particular data.
I think SERVFAIL suits better in this case. This is a temporary error. If we return REFUSED, client will probably decide that this query is not permitted and will not retry. In case of SERVFAIL it could try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a sense, this kind of failure is like a timeout, in that there is some transient condition (some action has exceeded a configured limit) that makes it so we are not going to answer. Although in a timeout, we cannot get an answer, and in the concurrent limit case, we choose not to.
I think clients would treat either of these equally, as a negative response. Since the definition of REFUSED/SERVFAIL is fuzzy, it would be unwise for a client to draw special conclusions in a general context for one answer vs the other.
If the desired behavior is to have the client retry, then I think we would need to drop the response (not respond at all).
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
Fair enough, servfail works for me.
We should never not reply.
…On Thu, 30 Jan 2020, 17:20 chrisohaver, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In plugin/forward/forward.go
<#3640 (comment)>:
> @@ -68,6 +76,15 @@ func (f *Forward) ServeDNS(ctx context.Context, w dns.ResponseWriter, r *dns.Msg
return plugin.NextOrFailure(f.Name(), f.Next, ctx, w, r)
}
+ if f.maxConcurrent > 0 {
+ count := atomic.AddInt64(&(f.concurrent), 1)
+ defer atomic.AddInt64(&(f.concurrent), -1)
+ if count > f.maxConcurrent {
+ MaxConcurrentRejectCount.Add(1)
+ return dns.RcodeServerFailure, ErrLimitExceeded
In a sense, this kind of failure is like a timeout, in that there is some
transient condition (some action has exceeded a configured limit) that
makes it so we are not going to answer. Although in a timeout, we *cannot*
get an answer, and in the concurrent limit case, we *choose* not to.
I think clients would treat either of these equally, as a negative
response. Since the definition of REFUSED/SERVFAIL is fuzzy, it would be
unwise for a client to draw special conclusions in a general context for
one answer vs the other.
If the desired behavior is to have the client retry, then I think we would
need to drop the response (not respond at all).
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#3640?email_source=notifications&email_token=AACWIW5PVSKZDQYYF4N2ESLRAMD6DA5CNFSM4KNKI3LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCTV5ZUA#discussion_r373084321>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACWIW73JRCNUPQWHGUM2TLRAMD6DANCNFSM4KNKI3LA>
.
|
plugin/forward/setup.go
Outdated
if n < 0 { | ||
return fmt.Errorf("max_concurrent can't be negative: %d", n) | ||
} | ||
ErrLimitExceeded = errors.New("concurrent queries exceeded maximum " + c.Val()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here may be a race condition. When coredns is reloaded (e.g. with SIGUSR1) the old instance of forward plugin is still handling traffic and may access ErrLimitExceeded
, and the new instance of forward plugin can update ErrLimitExceeded
at the same time
Just create the err in forward when you return, this is getting silly. If
you worry about allocations, there is plenty of lowhanging fruit with more
bang for the buck
…On Thu, 30 Jan 2020, 17:34 Ruslan Drozhdzh, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In plugin/forward/setup.go
<#3640 (comment)>:
> @@ -211,6 +213,19 @@ func parseBlock(c *caddy.Controller, f *Forward) error {
default:
return c.Errf("unknown policy '%s'", x)
}
+ case "max_concurrent":
+ if !c.NextArg() {
+ return c.ArgErr()
+ }
+ n, err := strconv.Atoi(c.Val())
+ if err != nil {
+ return err
+ }
+ if n < 0 {
+ return fmt.Errorf("max_concurrent can't be negative: %d", n)
+ }
+ ErrLimitExceeded = errors.New("concurrent queries exceeded maximum " + c.Val())
Here may be a race condition. When coredns is reloaded (e.g. with SIGUSR1)
the old instance of forward plugin is still handling traffic and may access
ErrLimitExceeded, and the new instance of forward plugin can update
ErrLimitExceeded at the same time
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#3640?email_source=notifications&email_token=AACWIW5EC6FHKWFOFITA74TRAMFTPA5CNFSM4KNKI3LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCTWACNA#pullrequestreview-351011124>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACWIWY5HGQSFQH5XEAZJVLRAMFTPANCNFSM4KNKI3LA>
.
|
Well, since the error text is not a constant anymore, that would be a reasonable choice. |
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
I think mem usage is a good one..with the assumption a goroutine is 2kb. I
was pondering a similar text for discarding packets directly in miekg/dns,
but this pr is probably better, because here you can actually have
lingering go-routines.
Also note we timeout after 5s (I think), we can probably discard writing an
answer if that happens, cause the client is long gone (but that's a
different pr)
…On Thu, 30 Jan 2020, 17:46 chrisohaver, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In plugin/forward/README.md
<#3640 (comment)>:
> @@ -83,6 +84,8 @@ forward FROM TO... {
* `round_robin` is a policy that selects hosts based on round robin ordering.
* `sequential` is a policy that selects hosts based on sequential ordering.
* `health_check`, use a different **DURATION** for health checking, the default duration is 0.5s.
+* `max_concurrent` **MAX** will limit the number of concurrent queries to **MAX**. Any new query that would
I put in notes for these. For picking a value, I describe a lower bound.
I didn't tackle describing an upper bound yet. For an upper bound (to bind
memory usage), it's a matter of how much memory an average query takes up
inflight in CoreDNS - which is variable (e.g. encryption). I can try to
estimate this experimentally. I don't know how globally applicable or
accurate that would be.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#3640?email_source=notifications&email_token=AACWIW6SDAIID5KQUU4B7PLRAMG6HA5CNFSM4KNKI3LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCTWB2BI#discussion_r373097141>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACWIW5DRYEZWT3WAONSPT3RAMG6HANCNFSM4KNKI3LA>
.
|
Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
* count and limit concurrent queries Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * add option Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * return servfail when limit exceeded Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * review feedback Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * move atomic counter to beginning of struct Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * add comment for ErrLimitExceeded Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * rename option to max_concurrent Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * add metric Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * response REFUSED; incl max in error; add more docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * avoid err setup race Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * respond SERVFAIL; doc memory usage Signed-off-by: Chris O'Haver <cohaver@infoblox.com>
1. Why is this pull request needed and what does it do?
Adds a concurrent query limit option to the forward plugin, as described in #3635
2. Which issues (if any) are related?
#3635
3. Which documentation changes (if any) need to be made?
included
4. Does this introduce a backward incompatible change or deprecation?
no