Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make it easier to reason about health check failures #374

Merged
merged 1 commit into from
Jul 5, 2021

Conversation

makkes
Copy link
Member

@makkes makkes commented Jun 24, 2021

Whenever a health check times out now, the most recently collected error for each resource will be printed as part of the error message. This excludes errors for those resources for which no error was reported in the last update. This is because whenever a timeout occurs, an error is reported on ALL resources, even those that have been seen as healthy before.

Also, this commit causes all successfully checked resources to be omitted in the error event.

This is how it will look in the API:

Before:

$ k -n flux-system get events --field-selector=involvedObject.kind=Kustomization -w
LAST SEEN   TYPE     REASON   OBJECT                MESSAGE
34m         Normal   info     kustomization/test1   configmap/test1 created
32m         Normal   info     kustomization/test1   Update completed
0s          Normal   error    kustomization/test1   Health check timed out for [Deployment 'default/fails', Deployment 'default/fails2', Job 'default/wrong', Deployment 'default/git']

After:

$ k -n flux-system get events --field-selector=involvedObject.kind=Kustomization -w
LAST SEEN   TYPE     REASON   OBJECT                MESSAGE
34m         Normal   info     kustomization/test1   configmap/test1 created
32m         Normal   info     kustomization/test1   Update completed
0s          Normal   error    kustomization/test1   Health check failed for [Deployment 'default/fails' (status 'InProgress'): context deadline exceeded, Deployment 'default/fails2' (status 'NotFound'): context deadline exceeded, Job 'default/wrong' (status 'Unknown'): no matches for kind "Job" in group ""]

Signed-off-by: Max Jonas Werner mail@makk.es

@makkes makkes force-pushed the better-health-check-messaging branch from 5bb31e2 to 63d83ff Compare June 24, 2021 16:07
@hiddeco hiddeco added area/kstatus Health checking related issues and pull requests enhancement New feature or request labels Jun 24, 2021
@stefanprodan
Copy link
Member

@makkes this looks great! Can you please test this with more objects, let's say one with a wrong api version, one that passes and another one that fails due to image not found or some scheduling timeout. Thanks!

@makkes
Copy link
Member Author

makkes commented Jun 24, 2021

Another spec I tested:

[...]
  healthChecks:
  - apiVersion: v1
    kind: Job
    name: somejob
    namespace: default
  - apiVersion: apps/v1
    kind: Deployment
    name: fails
    namespace: default
  - apiVersion: apps/v1
    kind: Deployment
    name: succeeds
    namespace: default
[...]

Events:

0s          Normal   error    kustomization/test1   Health check failed for [Deployment 'default/fails', Job 'default/somejob']: no matches for kind "Job" in group ""
0s          Normal   error    kustomization/test1   Health check failed for [Job 'default/somejob', Deployment 'default/fails', Deployment 'default/succeeds']: no matches for kind "Job" in group ""

The "succeeds" Deployment is there because its status in "Unknown". I believe the reason for that is that I changed the loop to exit early as soon as an error ocurrs. I'll see whether it would make sense to store all errors in a slice instead.

@stefanprodan
Copy link
Member

So looks like we can capture only one error. Can you please try with 2 failed deployments and one ok.

@makkes
Copy link
Member Author

makkes commented Jun 25, 2021

Yep, I'm on it and also trying to find a neat way to catch all errors. But I need to dig into cli-utils a little more.

@hiddeco
Copy link
Member

hiddeco commented Jun 25, 2021

https://pkg.go.dev/k8s.io/apimachinery/pkg/util/errors may be of help here to collect an aggregation of errors.

@makkes makkes force-pushed the better-health-check-messaging branch from 4301524 to 14da055 Compare July 2, 2021 14:40
@makkes
Copy link
Member Author

makkes commented Jul 2, 2021

So looks like we can capture only one error. Can you please try with 2 failed deployments and one ok.

@stefanprodan this is how it looks like with two failing Deployments and one succeeding:

0s          Normal   error    kustomization/k   Health check failed for [Deployment 'default/fails', Deployment 'default/dep', Deployment 'default/fails2']: Deployment 'default/dep': context deadline exceeded, Deployment 'default/fails2': context deadline exceeded, Deployment 'default/fails': context deadline exceeded

As you can see, it shows all 3 of them as timed out, even though the 'default/dep' one is always healthy. It looks like the timing out of one resources check leads to errors being reported for all of them.

I will test out one more idea on a more precise reporting.

@makkes makkes force-pushed the better-health-check-messaging branch from d494266 to 6304f61 Compare July 2, 2021 20:49
@makkes
Copy link
Member Author

makkes commented Jul 2, 2021

Ok, this version works with 1 successful and 2 failing ones:

0s          Normal   error    kustomization/k   Health check failed for [Deployment 'default/fails2', Deployment 'default/fails', Deployment 'default/dep']: Deployment 'default/fails2': context deadline exceeded, Deployment 'default/fails': context deadline exceeded

@makkes makkes force-pushed the better-health-check-messaging branch from 6304f61 to 5bb78ad Compare July 2, 2021 21:02
@makkes
Copy link
Member Author

makkes commented Jul 2, 2021

@stefanprodan @hiddeco ready for another round of reviews, please. 🙂

@makkes makkes force-pushed the better-health-check-messaging branch from 5bb78ad to f73621b Compare July 4, 2021 19:38
Whenever a health check times out now, the most recently collected
error for each resource will be printed as part of the error message.
This excludes errors for those resources for which no error was
reported in the last update. This is because whenever a timeout
occurs, an error is reported on ALL resources, even those that have
been seen as healthy before.

Also, this commit causes all successfully checked resources to be
omitted in the error event.

Signed-off-by: Max Jonas Werner <mail@makk.es>
@makkes makkes force-pushed the better-health-check-messaging branch from f73621b to bbc4208 Compare July 4, 2021 19:54
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @makkes 🏅 I really like that we also show the status condition reason 👌

@stefanprodan stefanprodan merged commit bb71e74 into fluxcd:main Jul 5, 2021
@makkes makkes deleted the better-health-check-messaging branch July 5, 2021 07:54
@makkes
Copy link
Member Author

makkes commented Jul 22, 2021

@makkes Do you see a scenario where lastStatus[rs.Identifier] potentially could be nil?

[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1bc3985]

goroutine 309 [running]:
github.com/fluxcd/kustomize-controller/controllers.(*KustomizeHealthCheck).Assess(0xc0009c3b78, 0x3b9aca00, 0x0, 0x0)
	/workspace/controllers/kustomization_healthcheck.go:90 +0x425
github.com/fluxcd/kustomize-controller/controllers.(*KustomizationReconciler).checkHealth(0xc00083e5a0, 0x2381328, 0xc0009af710, 0xc00000fd98, 0x1c15899, 0xd, 0xc0009a2ba0, 0x23, 0xc0000587d0, 0xd, ...)
	/workspace/controllers/kustomization_controller.go:744 +0xfd
github.com/fluxcd/kustomize-controller/controllers.(*KustomizationReconciler).reconcile(0xc00083e5a0, 0x2381328, 0xc0009af710, 0x1c15899, 0xd, 0xc0009a2ba0, 0x23, 0xc0000587d0, 0xd, 0x0, ...)
	/workspace/controllers/kustomization_controller.go:385 +0xf7b
github.com/fluxcd/kustomize-controller/controllers.(*KustomizationReconciler).Reconcile(0xc00083e5a0, 0x2381328, 0xc0009af710, 0xc0000587f0, 0x8, 0xc0000587d0, 0xd, 0xc0009af700, 0x0, 0x0, ...)
	/workspace/controllers/kustomization_controller.go:233 +0xe58
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00089dea0, 0x2381280, 0xc000776000, 0x1e16e60, 0xc0002651a0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00089dea0, 0x2381280, 0xc000776000, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc0002eb480, 0xc00089dea0, 0x2381280, 0xc000776000)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:210 +0x425

Hu, I'll need to take a look to find a scenario where that might happen.

@stefanprodan
Copy link
Member

@makkes why is the map made out of pointers? we should change it to values or check for nil before each operation .

@makkes
Copy link
Member Author

makkes commented Jul 26, 2021

@stefanprodan the main reason is that collector.ResourceStatusCollector's ResourceStatuses is also a map of pointers:

type ResourceStatusCollector struct {
[...]
	ResourceStatuses map[object.ObjMetadata]*event.ResourceStatus
[...]
}

But I agree we should just skip nils. I'll put up a PR.

@stefanprodan
Copy link
Member

thanks @makkes

makkes pushed a commit to makkes/kustomize-controller that referenced this pull request Jul 26, 2021
The `lastStatus` map now stores `event.ResourceStatus` values instead
of pointers to them, preventing nil pointers being dereferenced by
accident.

refs fluxcd#374

Signed-off-by: Max Jonas Werner <mail@makk.es>
makkes pushed a commit to makkes/kustomize-controller that referenced this pull request Jul 26, 2021
When checking the health status of each declared resource, kstatus
might return nil for certain resources (for whatever reason). In that
case, this information is now conveyed in the health status event.

fluxcd#374

Signed-off-by: Max Jonas Werner <mail@makk.es>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kstatus Health checking related issues and pull requests enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants