Log error message on unhealthy /healthz check #24683

sjdot · 2023-04-01T15:45:50Z

I was looking into kubelet liveness checks returning HTTP 503 response codes recently and noticed that there was not any logging in the agent that indicated the issue.

The logging I'm adding in this PR allowed me to track down why the liveness checks were failing by giving the error message associated with the current subsystem's unhealthy state. For example:

WARN  [<timestamp>] cilium.io/agent: 1.11.13 (v1.11.13-e4e51408f9) /healthz returning unhealthy: Kvstore service is not ready: Err: not able to connect to any etcd endpoints (state: Failure, subsys: daemon)

christarazi

Thanks for the PR, mostly LGTM. Just one comment below about rate limiting potentially.

christarazi · 2023-05-10T20:03:13Z

daemon/cmd/agenthealth.go

@@ -42,6 +42,7 @@ func (d *Daemon) startAgentHealthHTTPService() {
 		statusCode := http.StatusOK
 		sr := d.getStatus(true)
 		if isUnhealthy(&sr) {
+			log.WithField("state", sr.Cilium.State).Warnf("/healthz returning unhealthy: %s", sr.Cilium.Msg)


Do you think it might be useful to rate limit logs here? I'm concerned that it would spam the logs if the Agent is having flapping issues.

@christarazi this should log at the rate that the /healthz endpoint is hit by an HTTP client. In the default case I believe that's every 10 seconds by the kubelet liveness check. Probably useful to have that log line at that sort of a rate to make sure that it's obvious there is a health check issue when looking at logs. Happy to add the rate limiting however if you would prefer.

Ah, in that case, let's avoid complicating it and see how it goes. We can always fix it up later if it's a problem.

daemon/cmd/agenthealth.go

christarazi

Just one final nit that I forgot last time, allows us to remove the dynamic log msg in favor of logfields.

daemon/cmd/agenthealth.go

christarazi · 2023-05-11T20:30:31Z

Could you add the PR description context into the commit msg? This should be the last thing from my side, pending CI results.

christarazi · 2023-05-11T23:54:17Z

/test

christarazi · 2023-05-12T16:43:29Z

/test

christarazi · 2023-05-12T16:44:28Z

@sjdot Just a heads up, each time there's a push, the CI results are thrown away. Were you pushing to resolve some sort of flake that needed a rebase?

sjdot · 2023-05-12T19:43:26Z

Hi @christarazi. Yeah think the CI had a flake yesterday.

I've been merging to get the latest changes from "main" into the branch which I assume will block a merge if not done?

christarazi · 2023-05-12T19:53:31Z

Sometimes it's not necessary. We merge with "rebase & merge" strategy so in the end, the PR is applied on top of main.

Right now it looks like CI has passed, so we can await for approving reviews, and then merge when we have them.

sjdot · 2023-05-12T20:51:43Z

@christarazi ok great, good to know!

sjdot · 2023-05-15T17:47:31Z

@christarazi is the failing "Chart CI Push" a blocker here?

christarazi · 2023-05-18T23:12:31Z

Yep looks like #25524

christarazi · 2023-05-18T23:12:40Z

/test-1.26-net-next

sjdot · 2023-05-22T22:47:04Z

@christarazi looks like your jenkins instance may be having some issues?

christarazi · 2023-05-22T23:58:20Z

/test-1.26-net-next

christarazi · 2023-05-22T23:58:36Z

We had an outage over the weekend and it should now be resolved.

sjdot · 2023-05-24T13:36:58Z

@christarazi looks like a 404 for the jenkins job that's pending

christarazi · 2023-05-24T17:45:29Z

/test-1.26-net-next

sjdot · 2023-06-02T13:25:36Z

@christarazi looks like tests were hanging again. There was also a merge conflict so I've rebased and pushed.

christarazi · 2023-06-06T22:08:32Z

/test

Edit: runtime hit #25939

sjdot · 2023-06-09T02:38:07Z

@christarazi looks like more flakes

christarazi · 2023-06-09T22:46:07Z

/test-runtime

sjdot · 2023-06-18T22:59:53Z

@christarazi more flakes, mind kicking it again?

ti-mo · 2023-06-20T09:48:06Z

/ci-ginkgo

ti-mo · 2023-06-20T09:58:58Z

I think this may need another rebase after the ginkgo changes that were made to over the past few weeks.

I've tagged this as a release blocker since it would be a nice quality-of-life improvement for 1.14.

I was looking into kubelet liveness checks returning HTTP 503 response codes recently and noticed that there was not any logging in the agent that indicated the issue. The logging I'm adding in this PR allowed me to track down why the liveness checks were failing by giving the error message associated with the current subsystem's unhealthy state. Signed-off-by: Steven Johnson <sjdot@protonmail.com>

sjdot · 2023-06-20T13:54:17Z

Thanks @ti-mo, I've rebased as suggested

ti-mo · 2023-06-20T19:46:53Z

/test

ti-mo

Re-running a few test flakes, this should be good to go.

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 1, 2023

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Apr 1, 2023

christarazi reviewed May 10, 2023

View reviewed changes

sjdot force-pushed the sjdot/health-endpoint-logging branch from 1c35e55 to 72c7ab5 Compare May 10, 2023 21:32

sjdot marked this pull request as ready for review May 10, 2023 21:52

sjdot requested review from a team as code owners May 10, 2023 21:52

sjdot requested a review from nathanjsweet May 10, 2023 21:52

christarazi added area/daemon Impacts operation of the Cilium daemon. area/health Relates to the cilium-health component release-note/misc This PR makes changes that have no direct user impact. labels May 10, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 10, 2023

christarazi approved these changes May 10, 2023

View reviewed changes

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved

sjdot force-pushed the sjdot/health-endpoint-logging branch 3 times, most recently from 0f9ea5f to d4625fb Compare May 11, 2023 20:23

sjdot force-pushed the sjdot/health-endpoint-logging branch 2 times, most recently from 997e1a6 to 3566565 Compare May 11, 2023 22:43

sjdot force-pushed the sjdot/health-endpoint-logging branch 4 times, most recently from deeb0e9 to 627bdd7 Compare May 12, 2023 16:03

sjdot force-pushed the sjdot/health-endpoint-logging branch from 1048027 to 11e6745 Compare June 2, 2023 13:19

sjdot requested a review from a team as a code owner June 2, 2023 13:19

sjdot force-pushed the sjdot/health-endpoint-logging branch from 11e6745 to f7da082 Compare June 2, 2023 13:24

christarazi force-pushed the sjdot/health-endpoint-logging branch from f7da082 to 77aca1f Compare June 6, 2023 22:08

christarazi removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jun 6, 2023

ti-mo added the kind/enhancement This would improve or streamline existing functionality. label Jun 20, 2023

ti-mo added dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. labels Jun 20, 2023

sjdot force-pushed the sjdot/health-endpoint-logging branch from 77aca1f to f342580 Compare June 20, 2023 13:52

ti-mo approved these changes Jun 21, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 21, 2023

ti-mo merged commit 4ec2c36 into cilium:main Jun 21, 2023
62 checks passed

joestringer mentioned this pull request Jun 28, 2023

Prepare for release v1.14.0-rc.0 #26544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log error message on unhealthy /healthz check #24683

Log error message on unhealthy /healthz check #24683

sjdot commented Apr 1, 2023

christarazi left a comment

christarazi May 10, 2023

sjdot May 10, 2023

christarazi May 10, 2023

christarazi left a comment

christarazi commented May 11, 2023

christarazi commented May 11, 2023

christarazi commented May 12, 2023

christarazi commented May 12, 2023

sjdot commented May 12, 2023

christarazi commented May 12, 2023 •

edited

sjdot commented May 12, 2023

sjdot commented May 15, 2023

christarazi commented May 18, 2023

christarazi commented May 18, 2023

sjdot commented May 22, 2023

christarazi commented May 22, 2023

christarazi commented May 22, 2023

sjdot commented May 24, 2023

christarazi commented May 24, 2023

sjdot commented Jun 2, 2023

christarazi commented Jun 6, 2023 •

edited

sjdot commented Jun 9, 2023

christarazi commented Jun 9, 2023

sjdot commented Jun 18, 2023

ti-mo commented Jun 20, 2023

ti-mo commented Jun 20, 2023 •

edited

sjdot commented Jun 20, 2023

ti-mo commented Jun 20, 2023

ti-mo left a comment

Log error message on unhealthy /healthz check #24683

Log error message on unhealthy /healthz check #24683

Conversation

sjdot commented Apr 1, 2023

christarazi left a comment

Choose a reason for hiding this comment

christarazi May 10, 2023

Choose a reason for hiding this comment

sjdot May 10, 2023

Choose a reason for hiding this comment

christarazi May 10, 2023

Choose a reason for hiding this comment

christarazi left a comment

Choose a reason for hiding this comment

christarazi commented May 11, 2023

christarazi commented May 11, 2023

christarazi commented May 12, 2023

christarazi commented May 12, 2023

sjdot commented May 12, 2023

christarazi commented May 12, 2023 • edited

sjdot commented May 12, 2023

sjdot commented May 15, 2023

christarazi commented May 18, 2023

christarazi commented May 18, 2023

sjdot commented May 22, 2023

christarazi commented May 22, 2023

christarazi commented May 22, 2023

sjdot commented May 24, 2023

christarazi commented May 24, 2023

sjdot commented Jun 2, 2023

christarazi commented Jun 6, 2023 • edited

sjdot commented Jun 9, 2023

christarazi commented Jun 9, 2023

sjdot commented Jun 18, 2023

ti-mo commented Jun 20, 2023

ti-mo commented Jun 20, 2023 • edited

sjdot commented Jun 20, 2023

ti-mo commented Jun 20, 2023

ti-mo left a comment

Choose a reason for hiding this comment

christarazi commented May 12, 2023 •

edited

christarazi commented Jun 6, 2023 •

edited

ti-mo commented Jun 20, 2023 •

edited