Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure and expose DNS programming latency from Kubernetes plugin. #3171

Merged
merged 1 commit into from Oct 4, 2019

Conversation

@oxddr
Copy link
Contributor

oxddr commented Aug 22, 2019

For now metric is measure only for headless services. Informer has been slighlty
refactored, so the code can measure latency without storing extra fields on
Endpoint struct.

1. Why is this pull request needed and what does it do?

Measure and expose DNS programming latency from Kubernetes plugin.

2. Which issues (if any) are related?

ref kubernetes/perf-tests#617

3. Which documentation changes (if any) need to be made?

none

4. Does this introduce a backward incompatible change or deprecation?

no

@corbot corbot bot requested a review from pmoroney Aug 22, 2019
@corbot

This comment has been minimized.

Copy link

corbot bot commented Aug 22, 2019

Thank you for your contribution. I've just checked the OWNERS files to find a suitable reviewer. This search was successful and I've asked pmoroney (via /OWNERS) for a review.
Note this is not an exclusive request. Anyone is free to provide a review of this pull request.

If you have questions or suggestions for this bot, please file an issue against the miekg/dreck repository.

The bot understands the commands that are listed here.

@oxddr oxddr force-pushed the oxddr:dns-prog-lat branch 2 times, most recently from fe12bd6 to 0a71216 Aug 22, 2019
@oxddr oxddr force-pushed the oxddr:dns-prog-lat branch from 0a71216 to ac309a8 Sep 2, 2019
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Sep 2, 2019

Codecov Report

Merging #3171 into master will increase coverage by 1.03%.
The diff coverage is 64.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3171      +/-   ##
==========================================
+ Coverage   55.25%   56.29%   +1.03%     
==========================================
  Files         217      218       +1     
  Lines       10777    10818      +41     
==========================================
+ Hits         5955     6090     +135     
+ Misses       4365     4252     -113     
- Partials      457      476      +19
Impacted Files Coverage Δ
plugin/kubernetes/setup.go 64.24% <0%> (-1.36%) ⬇️
plugin/kubernetes/metrics.go 100% <100%> (ø)
plugin/kubernetes/controller.go 47.03% <57.14%> (+36.43%) ⬆️
plugin/file/reload.go 75% <0%> (+5.55%) ⬆️
plugin/kubernetes/watch.go 44.44% <0%> (+44.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03ea2ae...b9ec733. Read the comment docs.

@oxddr

This comment has been minimized.

Copy link
Contributor Author

oxddr commented Sep 3, 2019

@johnbelamaric any chance you can have a look at this PR? It follows your suggestions on how measuring should be done without increasing memory footprint.

@johnbelamaric

This comment has been minimized.

Copy link
Member

johnbelamaric commented Sep 10, 2019

/assign

@johnbelamaric

This comment has been minimized.

Copy link
Member

johnbelamaric commented Sep 10, 2019

/assign: johnbelamaric

plugin/kubernetes/object/informer.go Show resolved Hide resolved
now := time.Now()
controller := newdnsController(client, dnsControlOpts{
initEndpointsCache: true,
// This is needed as otherwise the fake k8s client doesn't work properly.

This comment has been minimized.

Copy link
@johnbelamaric

johnbelamaric Sep 11, 2019

Member

Can you explain this more? If we can eliminate skipAPIObjectsCleanup this all becomes a lot simpler. I don't remember why we need that as an option or see anywhere it gets set?

This comment has been minimized.

Copy link
@oxddr

oxddr Sep 11, 2019

Author Contributor

This is needed purely for testing purposes. If we do cleanup resources in test, cache doesn't work properly and tests fail. I am not at my workstation at the moment, so I can't check the exact effect.

One option is to remove tests. However this non-trivial change to the code and I'd rather have it tested.

plugin/kubernetes/metrics.go Outdated Show resolved Hide resolved
plugin/kubernetes/metrics.go Outdated Show resolved Hide resolved
lastChangeTriggerTime := getLastChangeTriggerTime(endpoints)

if endpoints == nil || !isEndpointForHeadlessService(svcs, endpoints) || lastChangeTriggerTime.IsZero() {
return

This comment has been minimized.

Copy link
@johnbelamaric

johnbelamaric Sep 12, 2019

Member

so we're not recording the latency for programming clusterIP or headless_without_selector? For clusterIP, I would think it's the service creation time vs. now - since clusterIP is immutable I think that should work.

For headless_without_selector, maybe the endpoints creation time...or update time if we have those in metadata. I recall you saying that we need the annotation though I can recall the exact reason offhand.

This comment has been minimized.

Copy link
@oxddr

oxddr Sep 12, 2019

Author Contributor

Yes, this one does work only for headless services as for now. I was planning to follow up with the rest once this PR is hopefully merged. I assume adding those two missing types should be relatively easy.

}
val, err := time.Parse(time.RFC3339Nano, stringVal)
if err != nil {
log.Warningf("Error while parsing EndpointsLastChangeTriggerTimeAnnotation: '%s'. Error is %v",

This comment has been minimized.

Copy link
@johnbelamaric

johnbelamaric Sep 12, 2019

Member

Could this flood the logs?

If we're going to be wordy we should include enough info to debug and explain the actual effect we are warning about. The error value is not going to be useful I don't think, but knowing what it's supposed to look like will help.

log.Warningf("DnsProgrammingLatency cannot be calculated for Endpoints '%s/%s'; invalid %q annotation RFC3339 value of %q",
  endpoints.GetNamespace(), endpoints.GetName(), api.EndpointsLastChangeTriggerTime, stringVal)

This comment has been minimized.

Copy link
@oxddr

oxddr Sep 12, 2019

Author Contributor

Thanks for the snippet!

I doubt. This annotations is machine-generated and it's not coming from user. I'd assume it should be formatted correctly unless we have a bug (should be easy to catch in tests) or a pathological case (user tampering with it).

@oxddr oxddr force-pushed the oxddr:dns-prog-lat branch 2 times, most recently from af8eadc to 71ffed4 Sep 12, 2019
@oxddr oxddr force-pushed the oxddr:dns-prog-lat branch 2 times, most recently from 7191299 to 9215418 Oct 2, 2019
@oxddr

This comment has been minimized.

Copy link
Contributor Author

oxddr commented Oct 2, 2019

@johnbelamaric PTAL :)

@johnbelamaric

This comment has been minimized.

Copy link
Member

johnbelamaric commented Oct 3, 2019

I think we're OK here but I'd like to have @chrisohaver take a look as well.

/assign: @chrisohaver
/lgtm

@corbot
corbot bot approved these changes Oct 3, 2019
Copy link

corbot bot left a comment

LGTM by johnbelamaric

Copy link
Member

miekg left a comment

this misses an explanation of the new metrics in the README.md

@oxddr oxddr force-pushed the oxddr:dns-prog-lat branch from 2622eac to f941fe7 Oct 4, 2019
@oxddr

This comment has been minimized.

Copy link
Contributor Author

oxddr commented Oct 4, 2019

this misses an explanation of the new metrics in the README.md

Explanation added.

plugin/kubernetes/README.md Outdated Show resolved Hide resolved
For now metric is measure only for headless services. Informer has been slighlty
refactored, so the code can measure latency without storing extra fields on
Endpoint struct.

Signed-off-by: Janek Łukaszewicz <janluk@google.com>

Suggestions from code review

Co-Authored-By: Chris O'Haver <cohaver@infoblox.com>
@oxddr oxddr force-pushed the oxddr:dns-prog-lat branch from f941fe7 to b9ec733 Oct 4, 2019
Copy link
Member

miekg left a comment

no worries. I'll also propose some small changes to have things more inline with other READMEs. But let's merge this first.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Oct 4, 2019

/lgtm

@corbot
corbot bot approved these changes Oct 4, 2019
Copy link

corbot bot left a comment

LGTM by miekg

@miekg miekg merged commit d7cdb99 into coredns:master Oct 4, 2019
5 checks passed
5 checks passed
DCO DCO
Details
ci/circleci: kubernetes-tests Your tests passed on CircleCI!
Details
codecov/project 56.29% (target 50%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
stickler-ci No lint errors found.
@oxddr

This comment has been minimized.

Copy link
Contributor Author

oxddr commented Oct 4, 2019

Thank you all for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.