Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable full call stacks from reconcile errors in logs #529

Closed
jbw976 opened this issue Jun 10, 2019 · 8 comments
Closed

Disable full call stacks from reconcile errors in logs #529

jbw976 opened this issue Jun 10, 2019 · 8 comments
Assignees

Comments

@jbw976
Copy link
Member

jbw976 commented Jun 10, 2019

Is this a bug report or feature request?

  • Feature Request

What should the feature do: Currently, a big scary call stack is logged every time a Reconcile loop returns an error. The error is interesting, but the huge call stack adds no value and makes the logs harder to read.

Discussion from Slack:

jbw976
is there an easy way to run with debug logging on and not print out the stack trace for every error? it’s making it hard to read the worthwhile errors.

illya
we can add our own profile similar to development https://github.com/kubernetes-sigs/controller-runtime/blob/release-0.1/pkg/runtime/log/log.go#L54
where we can parameterize zap.AddStracktrace call based on the flag

jbw976
that could make sense, hopefully it could be done in a maintainable way where if something changes about the upstream way that controller-runtime initializes loggers, we’d get that update as well without having to manually copy it over to our implementation

What is use case behind this feature: As part of #331 to make the troubleshooting story better, removing these scary but unhelpful call stacks will also make the experience better.

@jbw976 jbw976 added this to the v0.3 milestone Jun 10, 2019
@jbw976
Copy link
Member Author

jbw976 commented Jun 10, 2019

Example call stack that doesn't provide any additional value:

2019-06-03T08:01:52.343Z	ERROR	kubebuilder.controller	Reconciler error	{"controller": "mysql.storage.crossplane.io", "request": "default/demo", "error": "Operation cannot be fulfilled on mysqlinstances.storage.crossplane.io \"demo\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/crossplaneio/crossplane/vendor/github.com/go-logr/zapr.(*zapLogger).Error
	/home/upbound/go/src/github.com/crossplaneio/crossplane/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/crossplaneio/crossplane/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/upbound/go/src/github.com/crossplaneio/crossplane/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/crossplaneio/crossplane/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/home/upbound/go/src/github.com/crossplaneio/crossplane/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/crossplaneio/crossplane/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/upbound/go/src/github.com/crossplaneio/crossplane/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/crossplaneio/crossplane/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/upbound/go/src/github.com/crossplaneio/crossplane/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/crossplaneio/crossplane/vendor/k8s.io/apimachinery/pkg/util/wait.Until
	/home/upbound/go/src/github.com/crossplaneio/crossplane/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88

@jbw976 jbw976 self-assigned this Jun 11, 2019
@infinitecompute infinitecompute modified the milestones: v0.3, Sprint 30 Jul 1, 2019
@jbw976 jbw976 removed this from the Sprint 30 milestone Jul 1, 2019
@jbw976 jbw976 added this to the v0.3 milestone Aug 14, 2019
@jbw976 jbw976 removed this from the v0.3 milestone Sep 9, 2019
@jbw976 jbw976 added this to the v0.4 milestone Sep 30, 2019
@turkenh
Copy link
Member

turkenh commented Oct 7, 2019

I am having difficulties to reproduce this issue.

AFAIU, for controllers on stack side, a fail function is implemented does not return error to controller runtime, but only resulting error (if any) of Status().Update() returned.
Also tried to provide some malformed CRs of different types, (KubernetesApplicationResource with bad template, Network without spec.name, etc) but always found error in either status or logs without stack trace. Looks like controllers do not return errors to controller-runtime directly.

I will try to disable/break connectivity to api-server now.

@jbw976
Copy link
Member Author

jbw976 commented Oct 7, 2019

@turkenh the failure to reproduce this call stack issue with a cloud provider stack is likely related to #849. The logging framework used in Crossplane will emit these call stacks only when in debug logging mode I believe.

Main crossplane's helm chart by default passes the --debug arg so that Crossplane uses debug logging. The cloud provider stacks are not using debug logging, so that may be why you're not seeing any call stacks.

Ideally:

  • main crossplane would not emit any of these useless call stacks
  • cloud provider stacks would use debug logging by default and not emit any of these useless call stacks

@jbw976 jbw976 removed their assignment Oct 7, 2019
@turkenh
Copy link
Member

turkenh commented Oct 10, 2019

thanks @jbw976
I had already passed --debug but could not find a way to trigger controller runtime client calls to fail.

Now I can reproduce by putting a breakpoint in sync function of the resource, modifying object via kubectl (i.e. adding a label) while it is waiting at breakpoint which causes resourceVersion to change and result in the error the object has been modified; please apply your changes to the latest version and try again with stacktrace :)

I am now trying to figure out a way to disable them.

@jbw976
Copy link
Member Author

jbw976 commented Oct 10, 2019

I think an easy way to reproduce this is to simply return an error from a Reconcile() function. The handling of that should result in a call stack being logged (as long as --debug was passed in as an arg). This can be any error, it doesn't have to be a k8s api specific error, so you could just modify a Reconcile() func of your choice to simply return a new fmt.Errorf("mock error").

This is the code within the controller-runtime that sets up the logger with "development" settings, which means to print the call stack:
https://github.com/kubernetes-sigs/controller-runtime/blob/e1159d6655b260c4812fd0792cd1344ecc96a57e/pkg/log/zap/zap.go#L59

Looking at the v0.2 controller-runtime branch, it looks like this code has changed a bit since the revision we are using (at least in stack-gcp) that is linked above: https://github.com/kubernetes-sigs/controller-runtime/blob/release-0.2/pkg/log/zap/zap.go

It looks like with that latest version, you can now specify the log level to use stack traces https://github.com/kubernetes-sigs/controller-runtime/blob/release-0.2/pkg/log/zap/zap.go#L104

Hopefully we can just use that now instead of having to fork and make other changes ourselves. This is the context around these new logging changes, which may be exactly what we were looking for.
kubernetes-sigs/controller-runtime#560
kubernetes-sigs/controller-runtime#442

@turkenh
Copy link
Member

turkenh commented Oct 10, 2019

I think an easy way to reproduce this is to simply return an error from a Reconcile() function. The handling of that should result in a call stack being logged (as long as --debug was passed in as an arg). This can be any error, it doesn't have to be a k8s api specific error, so you could just modify a Reconcile() func of your choice to simply return a new fmt.Errorf("mock error").

This is so true, for some reason, I was thinking I need to reproduce this without changing the code🤦‍♂️

Thanks for the links, I think I already found a workaround without not changing current controller-runtime revision but it makes more sense to update it and use builtin options as you pointed.

@negz
Copy link
Member

negz commented Jan 27, 2020

Having been working on adding logging and events to our controllers for the past week I really feel this is working as intended. Or put otherwise I feel like this feature request is overindexing on making our debug logs (which are intended for just that - debugging Crossplane) a super friendly, non-threatening user experience. Quoting myself from a related PR, I think events are the place for that:

Are you sure debug level logging is the place we should be focusing on simplifying and making (arguably) less scary at the expense of becoming (also arguably) less informative? Perhaps I spent too much time in the Java world, but when I am debugging a thing I prefer more context, and definitely feel that stack traces are helpful. They let me easily find the code and context of the call I may be concerned about.

I agree with the sentiment of making it easier for folks to understand what is going on under the covers in Crossplane, but I don't think debug level logging is the place for that. I like the way @suskin framed this problem in https://github.com/crossplaneio/crossplane/pull/858/files#r330818106. If we're optimising for case 1 (users need to figure out why stuff isn't working as expected) I suggest we invest our time in adding support for emitting informative Kubernetes events. If we're optimising for scenarios 3 or 4 (infrastructure operators or developers deep debugging Crossplane) then I suggest more context and information (i.e. traces) is better.

It's worth noting that crossplane/crossplane-runtime#108 increases the amount of places we log errors, and thus increases the amount of places we log stack traces. In the following example I think the stack trace is actually quite useful, in that it would allow me to find the file and line from which the error originated:

2020-01-26T21:56:55.362-0800    DEBUG   stack-gcp       Cannot connect to provider      {"controller": "managed/cloudsqlinstance", "request": "/default-app-postgresql-sz22q", "uid": "f287809b-3295-4cd8-ade6-ec6a6331033d", "version": "17603", "external-name": "", "error": "provider could not be retrieved: Provider.gcp.crossplane.io \"example\" not found", "errorVerbose": "Provider.gcp.crossplane.io \"example\" not found\nprovider could not be retrieved\ngithub.com/crossplaneio/stack-gcp/pkg/controller/database.(*cloudsqlConnector).Connect\n\t/Users/negz/control/crossplaneio/stack-gcp/pkg/controller/database/cloudsql.go:90\ngithub.com/crossplaneio/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile\n\t/Users/negz/control/go/pkg/mod/github.com/crossplaneio/crossplane-runtime@v0.4.1-0.20200127052939-8661974d35bd/pkg/reconciler/managed/reconciler.go:496\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.17.0/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.17.0/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.17.0/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/Cellar/go/1.13.6/libexec/src/runtime/asm_amd64.s:1357", "requeue-after": "2020-01-26T21:57:25.362-0800"}

@jbw976
Copy link
Member Author

jbw976 commented Feb 3, 2020

Thank you very much @negz for the recent work on standardizing our logging approaches and adding events to the controller-runtime reconciler patterns. I think the work you've done goes a long way towards surfacing troubleshooting information in a standard way, and I agree with your sentiments on the stack traces in the approach you have been working on. Great work @negz! 💪

@jbw976 jbw976 closed this as completed Feb 3, 2020
luebken pushed a commit to luebken/crossplane that referenced this issue Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants