plugins/forward: Add max_concurrent option #3640

chrisohaver · 2020-01-29T19:33:38Z

1. Why is this pull request needed and what does it do?

Adds a concurrent query limit option to the forward plugin, as described in #3635

2. Which issues (if any) are related?

#3635

3. Which documentation changes (if any) need to be made?

included

4. Does this introduce a backward incompatible change or deprecation?

no

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

codecov-io · 2020-01-29T19:43:09Z

Codecov Report

Merging #3640 into master will decrease coverage by <.01%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master    #3640      +/-   ##
==========================================
- Coverage    56.6%   56.59%   -0.01%     
==========================================
  Files         220      220              
  Lines       11039    11055      +16     
==========================================
+ Hits         6249     6257       +8     
- Misses       4311     4317       +6     
- Partials      479      481       +2

Impacted Files	Coverage Δ
plugin/forward/forward.go	`51.61% <0%> (-3.56%)`	⬇️
plugin/forward/setup.go	`58.38% <80%> (+1.55%)`	⬆️
plugin/kubernetes/setup.go	`64.24% <0%> (ø)`	⬆️
test/server.go	`82.75% <0%> (ø)`	⬆️
plugin/test/scrape.go	`0% <0%> (ø)`	⬆️
plugin/trace/trace.go	`79.06% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff981b1...ed7bcdb. Read the comment docs.

johnbelamaric · 2020-01-29T22:43:59Z

plugin/forward/forward.go

+	if f.maxQueryCount > 0 {
+		count := atomic.AddInt64(&(f.queryCount), 1)
+		defer atomic.AddInt64(&(f.queryCount), -1)
+		if count > f.maxQueryCount {


In theory there is a race condition where we may drop some extra queries when we're hovering near the edge. I don't think we care though, not enough to mutex this whole thing.

Yes, I lingered on this issue for a bit after seeing it happen during testing, but decided to stick with simple/performant atomic despite this. The count can go over the limit if overwhelmed, since we don't check for the limit in the same atomic action as the increment. Would be nice if there was an atomic action "increment-if-less-than", but I don't think this is a big deal, since these increments above the limit are immediately decremented.

Adding an atomic check of the value before the increment lessens, but does not completely eliminate the issue. But I don't think it's a problem enough to warrant the extra check.

johnbelamaric · 2020-01-29T22:45:19Z

/lgtm

corbot

Approved by johnbelamaric

miekg · 2020-01-30T11:42:57Z

plugin/forward/forward.go

@@ -37,6 +38,9 @@ type Forward struct {
 	maxfails      uint32
 	expire        time.Duration

+	maxQueryCount int64
+	queryCount int64


for sync atomic these need to be put first in the struct otherwise they don't align on ARM

Thanks, fixed.

miekg · 2020-01-30T11:43:20Z

plugin/forward/forward.go

+		count := atomic.AddInt64(&(f.queryCount), 1)
+		defer atomic.AddInt64(&(f.queryCount), -1)
+		if count > f.maxQueryCount {
+			return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum")


Would refuse be better here?

I'm not sure. I think either could work. I initially thought REFUSED, because this is a known/expected failure (soft failure). ... but then changed it to SERVFAIL in a later commit after reading some discussions on the difference between the two.

There isn't a strict difference: It's fuzzy...
REFUSE is supposed to indicate refusal due to policy. I read that as a function of the client or queried name. But thats vague, and you could describe a query rate type limit like this as a "policy".
SERVFAIL is supposed to indicate something broken.

miekg · 2020-01-30T11:44:11Z

plugin/forward/setup.go

@@ -211,6 +212,18 @@ func parseBlock(c *caddy.Controller, f *Forward) error {
 		default:
 			return c.Errf("unknown policy '%s'", x)
 		}
+	case "max_queries":


is there a better name with maybe 'concurrent' in it? (not sure there is, but max_queries is not what's implemented here)

changed to max_concurrent

miekg · 2020-01-30T11:44:54Z

Also see #2593

rdrozhdzh · 2020-01-30T13:49:42Z

plugin/forward/forward.go

+		count := atomic.AddInt64(&(f.queryCount), 1)
+		defer atomic.AddInt64(&(f.queryCount), -1)
+		if count > f.maxQueryCount {
+			return dns.RcodeServerFailure, errors.New("inflight forward queries exceeded maximum")


No needs to construct the error object here on each query, it can be constructed once as a global var, see ErrNoHealthy for example

Thanks, fixed.

chrisohaver · 2020-01-30T15:04:00Z

I should add a metric here for number of refused packets.

miekg · 2020-01-30T15:11:39Z

[ Quoting <notifications@github.com> in "Re: [coredns/coredns] plugins/forwa..." ]

I should add a metric here for number of refused packets.

yes, that would be a nice enhancement

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

plugin/forward/forward.go

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

plugin/forward/README.md

miekg · 2020-01-30T16:43:20Z

plugin/forward/forward.go


 	opts options // also here for testing

 	Next plugin.Handler
 }

+// ErrLimitExceeded indicates that a query was rejected because the number of concurrent queries has exceeded
+// the maximum allowed (maxConcurrent)
+var ErrLimitExceeded = errors.New("concurrent queries exceeded maximum")


I think is valuable to include the configured max here.

miekg · 2020-01-30T16:44:46Z

plugin/forward/forward.go

+		defer atomic.AddInt64(&(f.concurrent), -1)
+		if count > f.maxConcurrent {
+			MaxConcurrentRejectCount.Add(1)
+			return dns.RcodeServerFailure, ErrLimitExceeded


Difference between SERVFAIL or REFUSED is fuzzy. REFUSED is better here, because the server is fine (no SERVFAIL), but actually refuses to do work for you

from rfc1035, section 4.1.1.

Server failure - The name server was unable to process this query due to a problem with the name server.

Refused - The name server refuses to perform the specified operation for policy reasons. For example, a name server may not wish to provide the information to the particular requester, or a name server may not wish to perform a particular operation (e.g., zone transfer) for particular data.

I think SERVFAIL suits better in this case. This is a temporary error. If we return REFUSED, client will probably decide that this query is not permitted and will not retry. In case of SERVFAIL it could try again.

In a sense, this kind of failure is like a timeout, in that there is some transient condition (some action has exceeded a configured limit) that makes it so we are not going to answer. Although in a timeout, we cannot get an answer, and in the concurrent limit case, we choose not to.

I think clients would treat either of these equally, as a negative response. Since the definition of REFUSED/SERVFAIL is fuzzy, it would be unwise for a client to draw special conclusions in a general context for one answer vs the other.

If the desired behavior is to have the client retry, then I think we would need to drop the response (not respond at all).

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

miekg · 2020-01-30T17:34:04Z

Fair enough, servfail works for me. We should never not reply.

…

On Thu, 30 Jan 2020, 17:20 chrisohaver, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In plugin/forward/forward.go <#3640 (comment)>: > @@ -68,6 +76,15 @@ func (f *Forward) ServeDNS(ctx context.Context, w dns.ResponseWriter, r *dns.Msg return plugin.NextOrFailure(f.Name(), f.Next, ctx, w, r) } + if f.maxConcurrent > 0 { + count := atomic.AddInt64(&(f.concurrent), 1) + defer atomic.AddInt64(&(f.concurrent), -1) + if count > f.maxConcurrent { + MaxConcurrentRejectCount.Add(1) + return dns.RcodeServerFailure, ErrLimitExceeded In a sense, this kind of failure is like a timeout, in that there is some transient condition (some action has exceeded a configured limit) that makes it so we are not going to answer. Although in a timeout, we *cannot* get an answer, and in the concurrent limit case, we *choose* not to. I think clients would treat either of these equally, as a negative response. Since the definition of REFUSED/SERVFAIL is fuzzy, it would be unwise for a client to draw special conclusions in a general context for one answer vs the other. If the desired behavior is to have the client retry, then I think we would need to drop the response (not respond at all). — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3640?email_source=notifications&email_token=AACWIW5PVSKZDQYYF4N2ESLRAMD6DA5CNFSM4KNKI3LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCTV5ZUA#discussion_r373084321>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACWIW73JRCNUPQWHGUM2TLRAMD6DANCNFSM4KNKI3LA> .

rdrozhdzh · 2020-01-30T17:34:45Z

plugin/forward/setup.go

+		if n < 0 {
+			return fmt.Errorf("max_concurrent can't be negative: %d", n)
+		}
+		ErrLimitExceeded = errors.New("concurrent queries exceeded maximum " + c.Val())


Here may be a race condition. When coredns is reloaded (e.g. with SIGUSR1) the old instance of forward plugin is still handling traffic and may access ErrLimitExceeded, and the new instance of forward plugin can update ErrLimitExceeded at the same time

miekg · 2020-01-30T17:39:13Z

Just create the err in forward when you return, this is getting silly. If you worry about allocations, there is plenty of lowhanging fruit with more bang for the buck

…

On Thu, 30 Jan 2020, 17:34 Ruslan Drozhdzh, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In plugin/forward/setup.go <#3640 (comment)>: > @@ -211,6 +213,19 @@ func parseBlock(c *caddy.Controller, f *Forward) error { default: return c.Errf("unknown policy '%s'", x) } + case "max_concurrent": + if !c.NextArg() { + return c.ArgErr() + } + n, err := strconv.Atoi(c.Val()) + if err != nil { + return err + } + if n < 0 { + return fmt.Errorf("max_concurrent can't be negative: %d", n) + } + ErrLimitExceeded = errors.New("concurrent queries exceeded maximum " + c.Val()) Here may be a race condition. When coredns is reloaded (e.g. with SIGUSR1) the old instance of forward plugin is still handling traffic and may access ErrLimitExceeded, and the new instance of forward plugin can update ErrLimitExceeded at the same time — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3640?email_source=notifications&email_token=AACWIW5EC6FHKWFOFITA74TRAMFTPA5CNFSM4KNKI3LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCTWACNA#pullrequestreview-351011124>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACWIWY5HGQSFQH5XEAZJVLRAMFTPANCNFSM4KNKI3LA> .

rdrozhdzh · 2020-01-30T17:46:00Z

Well, since the error text is not a constant anymore, that would be a reasonable choice.
Alternatively, the error can be saved in a Forward object.

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

miekg · 2020-01-30T17:50:13Z

I think mem usage is a good one..with the assumption a goroutine is 2kb. I was pondering a similar text for discarding packets directly in miekg/dns, but this pr is probably better, because here you can actually have lingering go-routines. Also note we timeout after 5s (I think), we can probably discard writing an answer if that happens, cause the client is long gone (but that's a different pr)

…

On Thu, 30 Jan 2020, 17:46 chrisohaver, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In plugin/forward/README.md <#3640 (comment)>: > @@ -83,6 +84,8 @@ forward FROM TO... { * `round_robin` is a policy that selects hosts based on round robin ordering. * `sequential` is a policy that selects hosts based on sequential ordering. * `health_check`, use a different **DURATION** for health checking, the default duration is 0.5s. +* `max_concurrent` **MAX** will limit the number of concurrent queries to **MAX**. Any new query that would I put in notes for these. For picking a value, I describe a lower bound. I didn't tackle describing an upper bound yet. For an upper bound (to bind memory usage), it's a matter of how much memory an average query takes up inflight in CoreDNS - which is variable (e.g. encryption). I can try to estimate this experimentally. I don't know how globally applicable or accurate that would be. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3640?email_source=notifications&email_token=AACWIW6SDAIID5KQUU4B7PLRAMG6HA5CNFSM4KNKI3LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCTWB2BI#discussion_r373097141>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACWIW5DRYEZWT3WAONSPT3RAMG6HANCNFSM4KNKI3LA> .

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

* count and limit concurrent queries Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * add option Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * return servfail when limit exceeded Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * review feedback Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * move atomic counter to beginning of struct Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * add comment for ErrLimitExceeded Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * rename option to max_concurrent Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * add metric Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * response REFUSED; incl max in error; add more docs Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * avoid err setup race Signed-off-by: Chris O'Haver <cohaver@infoblox.com> * respond SERVFAIL; doc memory usage Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver added 6 commits January 29, 2020 14:07

count and limit concurrent queries

7a83235

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

add option

8252635

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

return servfail when limit exceeded

622dc26

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

docs

e2e8972

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

docs

9183240

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

docs

ffdac70

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver requested review from grobie, johnbelamaric and miekg as code owners January 29, 2020 19:33

johnbelamaric reviewed Jan 29, 2020

View reviewed changes

corbot bot approved these changes Jan 29, 2020

View reviewed changes

miekg reviewed Jan 30, 2020

View reviewed changes

rdrozhdzh reviewed Jan 30, 2020

View reviewed changes

review feedback

13a5df0

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

stickler-ci reviewed Jan 30, 2020

View reviewed changes

plugin/forward/forward.go Outdated Show resolved Hide resolved

chrisohaver added 4 commits January 30, 2020 11:02

move atomic counter to beginning of struct

d3b67bd

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

add comment for ErrLimitExceeded

a45ab13

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

rename option to max_concurrent

b785834

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

add metric

e715153

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

miekg reviewed Jan 30, 2020

View reviewed changes

plugin/forward/README.md Show resolved Hide resolved

miekg reviewed Jan 30, 2020

View reviewed changes

response REFUSED; incl max in error; add more docs

e55f78d

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

rdrozhdzh reviewed Jan 30, 2020

View reviewed changes

avoid err setup race

5985436

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

respond SERVFAIL; doc memory usage

ed7bcdb

Signed-off-by: Chris O'Haver <cohaver@infoblox.com>

chrisohaver changed the title ~~plugins/forward: Add max_queries option~~ plugins/forward: Add max_concurrent option Jan 31, 2020

miekg approved these changes Feb 4, 2020

View reviewed changes

miekg merged commit 22cd28a into coredns:master Feb 4, 2020

This was referenced Feb 4, 2020

plugin/forward: max concurrent queries #3635

Closed

request spike creates memory spike #2593

Closed

chrisohaver mentioned this pull request Feb 18, 2020

WIP: feat: add plugin fanout #3652

Closed

7 tasks

prameshj mentioned this pull request Mar 19, 2020

Upgrade to coreDNS 1.7.0 kubernetes/dns#359

Closed

arjunrn mentioned this pull request Mar 29, 2020

Update CoreDNS to version 1.6.9 zalando-incubator/kubernetes-on-aws#3094

Merged

chrisohaver mentioned this pull request Dec 9, 2020

plugin/forward: respond with REFUSED when max_concurrent is exceeded #4326

Merged

chrisohaver deleted the forward-limit branch January 9, 2021 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plugins/forward: Add max_concurrent option #3640

plugins/forward: Add max_concurrent option #3640

chrisohaver commented Jan 29, 2020

codecov-io commented Jan 29, 2020 •

edited

Loading

johnbelamaric Jan 29, 2020

chrisohaver Jan 30, 2020

johnbelamaric commented Jan 29, 2020

corbot bot left a comment

miekg Jan 30, 2020

chrisohaver Jan 30, 2020

miekg Jan 30, 2020

chrisohaver Jan 30, 2020

miekg Jan 30, 2020

chrisohaver Jan 30, 2020

miekg commented Jan 30, 2020

rdrozhdzh Jan 30, 2020

chrisohaver Jan 30, 2020

chrisohaver commented Jan 30, 2020

miekg commented Jan 30, 2020 via email

miekg Jan 30, 2020

miekg Jan 30, 2020 •

edited

Loading

rdrozhdzh Jan 30, 2020

chrisohaver Jan 30, 2020

miekg commented Jan 30, 2020 via email

rdrozhdzh Jan 30, 2020

miekg commented Jan 30, 2020 via email

rdrozhdzh commented Jan 30, 2020

miekg commented Jan 30, 2020 via email

plugins/forward: Add max_concurrent option #3640

plugins/forward: Add max_concurrent option #3640

Conversation

chrisohaver commented Jan 29, 2020

1. Why is this pull request needed and what does it do?

2. Which issues (if any) are related?

3. Which documentation changes (if any) need to be made?

4. Does this introduce a backward incompatible change or deprecation?

codecov-io commented Jan 29, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric commented Jan 29, 2020

corbot bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miekg commented Jan 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisohaver commented Jan 30, 2020

miekg commented Jan 30, 2020 via email

Choose a reason for hiding this comment

miekg Jan 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miekg commented Jan 30, 2020 via email

Choose a reason for hiding this comment

miekg commented Jan 30, 2020 via email

rdrozhdzh commented Jan 30, 2020

miekg commented Jan 30, 2020 via email

codecov-io commented Jan 29, 2020 •

edited

Loading

miekg Jan 30, 2020 •

edited

Loading