Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for explicit no-cache policy for positive/negative dns… #2588

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
7 participants
@uruddarraju
Copy link

commented Feb 20, 2019

1. Why is this pull request needed and what does it do?

Please see #2586 for more context.
TLDR; This PR provides a way to disable one of the caches (positive/negative) and not the other. This enables usders to use coredns for positive caching with a ttl, while opting out of negative caching similar to the no-neg-cache support in dnsmasq

2. Which documentation changes (if any) need to be made?

cache.md should be updated, but will update this PR with the doc changes once I get some feedback.

3. Does this introduce a backward incompatible change or deprecation?

No

Also this enables us to have a cache policy like the following:

        cache {
          success 5000 5
          denial 0
        }

that would enable success cache but disable denial cache.

fixes #2586

@corbot

This comment has been minimized.

Copy link

commented Feb 20, 2019

Thank you for your contribution. I've just checked the OWNERS files to find a suitable reviewer. This search was successful and I've asked miekg (via plugin/cache/OWNERS) for a review.

If you have questions or suggestions for this bot, please file an issue against the miekg/dreck repository.

The bot understands the commands that are listed here.

@corbot corbot bot requested a review from miekg Feb 20, 2019

@uruddarraju

This comment has been minimized.

Copy link
Author

commented Feb 20, 2019

@codecov-io

This comment has been minimized.

Copy link

commented Feb 20, 2019

Codecov Report

Merging #2588 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2588      +/-   ##
==========================================
- Coverage   55.58%   55.57%   -0.02%     
==========================================
  Files         203      203              
  Lines       10305    10309       +4     
==========================================
+ Hits         5728     5729       +1     
- Misses       4147     4148       +1     
- Partials      430      432       +2
Impacted Files Coverage Δ
plugin/cache/cache.go 74.03% <0%> (-2.97%) ⬇️
plugin/file/reload.go 73.68% <0%> (+2.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ce9308...7148039. Read the comment docs.

@stp-ip

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

This seems to change the behavior even so it might be the correct behavior from a config perspective.
denial 0 did reduce the size to 1024 before and now would not cache at all. This would be a breaking change, but it might be ok, if documented correctly and announced in the release notes as it seems to be more in line with the actual intention.

@uruddarraju

This comment has been minimized.

Copy link
Author

commented Feb 20, 2019

@stp-ip Thanks for the comments. So would a doc change suffice for me to get this merged ? :)

Or, are there other things you'd need me to address before this gets merged ?

@stp-ip

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

From my side a docs change and a prepared release note to announce the breaking change would be enough as this seems to be a simple change, but I want another approval to get this merged as I might have missed some context. Just wanted to get feedback so that the second approval might already merge.
/lgtm

@corbot

corbot bot approved these changes Feb 20, 2019

Copy link

left a comment

LGTM by stp-ip

@uruddarraju

This comment has been minimized.

Copy link
Author

commented Feb 20, 2019

Thanks @stp-ip

@miekg

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

Setting size to zero seems natural to me. I don't see a backwards incompatible change.

But this is checking in the wrong place, nothing should be put in the cache in the first place.

@uruddarraju

This comment has been minimized.

Copy link
Author

commented Feb 21, 2019

@miekg yeah I initially thought of implementing this at Cache struct that is used commonly across. Or do you think it would be better for us to have the condition to check for the pcap/ncap here: https://github.com/coredns/coredns/blob/master/plugin/cache/cache.go#L201

@uruddarraju

This comment has been minimized.

Copy link
Author

commented Feb 21, 2019

Or would you think it is more appropriate to change the behavior of cache.Cache https://github.com/coredns/coredns/blob/master/plugin/pkg/cache/cache.go#L19 instead ? This would probably be a bigger change.

@uruddarraju uruddarraju force-pushed the uruddarraju:ncap branch from a91d0ce to 77378e2 Feb 21, 2019

@uruddarraju

This comment has been minimized.

Copy link
Author

commented Feb 21, 2019

@miekg can you please take an other look ? Moved the logic to not put stuff into cache when the capacity is set to 0.

Show resolved Hide resolved plugin/cache/cache.go Outdated

@uruddarraju uruddarraju force-pushed the uruddarraju:ncap branch from 77378e2 to 451cac2 Feb 21, 2019

@uruddarraju

This comment has been minimized.

Copy link
Author

commented Feb 21, 2019

Thanks @chrisohaver for the feedback. Addressed the indentation comment.

@uruddarraju uruddarraju force-pushed the uruddarraju:ncap branch from 451cac2 to 7148039 Feb 21, 2019

@chrisohaver
Copy link
Member

left a comment

Looks OK as is codewise IMO: To be super tidy, we could also avoid initializing the cache if it's not going to be used. i.e. In cache.New(size), returning nil if the size is zero, and possible other tweaks elsewhere to avoid nil reference.

Will approve when docs are updated and unit test is added.

@miekg

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

lgtm, but please fix what Chris mentioned here.

@chrisohaver

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@uruddarraju

This comment has been minimized.

Copy link
Author

commented May 1, 2019

on it, sorry for the miss.

@miekg

This comment has been minimized.

Copy link
Member

commented May 23, 2019

@uruddarraju can you fix the conflict? Thanks!

@m4r10k

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

@chrisohaver Thank you for picking this one up! We are also facing the same issue like mentioned here on stackoverflow. In our case it seems like that with quite old Linux versions (Debian, CentOS,...) there might be a bug somewhere either in Python or in some case in C libraries?

Output from a failing rsync operation:

rsync: getaddrinfo: rsync.app.strabag.com 873: No address associated with hostname
rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.9]

If the cache is disabled (removed from config), everything works neat. I think it would be helpful to be able to disable the denied cache to avoid this problems.

If I get some time today, we will try this PR in our environment, as we can test it easily - we have an reproducible case for this.

@miekg

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

@m4r10k

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

What does a tcpdump return here? I.e. would be good to get to the bottom. Disable the cache seems looks like a (good) mitigation, but doesn't sit well with me.

We are currently locked down in a meeting 😃, but I think we can provide a tcp dump later. I will send the dump via email to you for privacy reasons.

But are you sure that we should dig into it, as it seems like that only pretty old systems are affected?

@miekg

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

@m4r10k

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

OK, then we will try to get the dumps, write some description and send it over to you. This might take some time (maybe until tomorrow) ... 😃 I will give feedback here, to let the others know about the progress!

@m4r10k

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

@miekg First tcpdump files should be in your inbox 😄

@miekg

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

Thanks. We had a private discussion and I looked at the dump. It seems Ruby's "DNS" client disregards responses that don't the the AA (authoritative) bit set. I may be mistaken by I believe if an answer comes from a cache you shouldn't set this bit. OTOH I don't care that much, it's 1 bit and DNS is badly defined anyway; IOW: we can just set the bit even if we respond with an answer from the cache.

This would be a 1-line change - modulo some tests prolly.

@stefanmb

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

So for what it's worth I ran into a nasty production issue with Ubuntu 14.0.4 LTS (eglibc 2.19-0ubuntu6.14) where CoreDNS's lack of authoritative bit on cached responses caused failures for all apps using getaddrinfo, including Ruby and a MySQL client.

I couldn't reproduce the behaviour after Ubuntu 14.10 (glibc 2.19-10ubuntu2.3). I assume it's a glibc bug, but I couldn't track down the exact lines/patch involved.

I ended up working around it by writing a CoreDNS "authoritative" plugin that simply sets the authoritative bit on all responses from a domain whitelist (e.g. *.cluster.local since I have autopath enabled), and then moving the legacy workloads away from Ubuntu 14.04.

@m4r10k

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

@miekg it seams like that what @stefanmb wrote confirms that there is some strange behavior of some older or buggy implementations which are caused by this "non/authoritative" flag switching in the cache plugin case. Can you point me the line, where the code can be changed to enforce the always authoritative response? I would hack it in tomorrow to prove this in our environment.

Afterwards it might be an option to make it configurable within the cache plugin maybe or if this flag doesn't really border anything of the DNS protocol you can pin it like you mentioned before.

Question: I don't want to hijack this issue for this - should I open a separate to not mix up the things here? Especially because kubernetes/kubeadm#1512 seems like to have the same root cause?

@stefanmb

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

@m4r10k All you need to do is clobber Authoritative = true: https://github.com/coredns/coredns/blob/master/plugin/cache/item.go#L59

Which OS and glibc version are you running?

@miekg

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

@m4r10k

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

@stefanmb Thanks! Some legacies are always ghosting around like in your example and as always...
@miekg Thanks too!
😄 - I will open an issue for this tomorrow.

@miekg

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

@m4r10k

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

@miekg Wow, thanks!

@miekg

This comment has been minimized.

Copy link
Member

commented Jul 1, 2019

Fixed with 1.5.1

@miekg miekg closed this Jul 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.