Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log error message on unhealthy /healthz check #24683

Merged
merged 1 commit into from Jun 21, 2023

Conversation

sjdot
Copy link
Contributor

@sjdot sjdot commented Apr 1, 2023

I was looking into kubelet liveness checks returning HTTP 503 response codes recently and noticed that there was not any logging in the agent that indicated the issue.

The logging I'm adding in this PR allowed me to track down why the liveness checks were failing by giving the error message associated with the current subsystem's unhealthy state. For example:

WARN  [<timestamp>] cilium.io/agent: 1.11.13 (v1.11.13-e4e51408f9) /healthz returning unhealthy: Kvstore service is not ready: Err: not able to connect to any etcd endpoints (state: Failure, subsys: daemon)

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 1, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Apr 1, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, mostly LGTM. Just one comment below about rate limiting potentially.

@@ -42,6 +42,7 @@ func (d *Daemon) startAgentHealthHTTPService() {
statusCode := http.StatusOK
sr := d.getStatus(true)
if isUnhealthy(&sr) {
log.WithField("state", sr.Cilium.State).Warnf("/healthz returning unhealthy: %s", sr.Cilium.Msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might be useful to rate limit logs here? I'm concerned that it would spam the logs if the Agent is having flapping issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christarazi this should log at the rate that the /healthz endpoint is hit by an HTTP client. In the default case I believe that's every 10 seconds by the kubelet liveness check. Probably useful to have that log line at that sort of a rate to make sure that it's obvious there is a health check issue when looking at logs. Happy to add the rate limiting however if you would prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, in that case, let's avoid complicating it and see how it goes. We can always fix it up later if it's a problem.

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved
@sjdot sjdot force-pushed the sjdot/health-endpoint-logging branch from 1c35e55 to 72c7ab5 Compare May 10, 2023 21:32
@sjdot sjdot marked this pull request as ready for review May 10, 2023 21:52
@sjdot sjdot requested review from a team as code owners May 10, 2023 21:52
@sjdot sjdot requested a review from nathanjsweet May 10, 2023 21:52
@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. area/health Relates to the cilium-health component release-note/misc This PR makes changes that have no direct user impact. labels May 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 10, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one final nit that I forgot last time, allows us to remove the dynamic log msg in favor of logfields.

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved
@sjdot sjdot force-pushed the sjdot/health-endpoint-logging branch 3 times, most recently from 0f9ea5f to d4625fb Compare May 11, 2023 20:23
@christarazi
Copy link
Member

Could you add the PR description context into the commit msg? This should be the last thing from my side, pending CI results.

@sjdot sjdot force-pushed the sjdot/health-endpoint-logging branch 2 times, most recently from 997e1a6 to 3566565 Compare May 11, 2023 22:43
@christarazi
Copy link
Member

/test

@sjdot sjdot force-pushed the sjdot/health-endpoint-logging branch 4 times, most recently from deeb0e9 to 627bdd7 Compare May 12, 2023 16:03
@christarazi
Copy link
Member

/test

@christarazi
Copy link
Member

@sjdot Just a heads up, each time there's a push, the CI results are thrown away. Were you pushing to resolve some sort of flake that needed a rebase?

@sjdot
Copy link
Contributor Author

sjdot commented May 12, 2023

Hi @christarazi. Yeah think the CI had a flake yesterday.

I've been merging to get the latest changes from "main" into the branch which I assume will block a merge if not done?

@christarazi
Copy link
Member

christarazi commented May 12, 2023

Sometimes it's not necessary. We merge with "rebase & merge" strategy so in the end, the PR is applied on top of main.

Right now it looks like CI has passed, so we can await for approving reviews, and then merge when we have them.

@sjdot
Copy link
Contributor Author

sjdot commented May 12, 2023

@christarazi ok great, good to know!

@sjdot
Copy link
Contributor Author

sjdot commented May 15, 2023

@christarazi is the failing "Chart CI Push" a blocker here?

@christarazi
Copy link
Member

Yep looks like #25524

@christarazi
Copy link
Member

/test-1.26-net-next

@sjdot
Copy link
Contributor Author

sjdot commented May 22, 2023

@christarazi looks like your jenkins instance may be having some issues?

@christarazi
Copy link
Member

/test-1.26-net-next

@christarazi
Copy link
Member

We had an outage over the weekend and it should now be resolved.

@sjdot
Copy link
Contributor Author

sjdot commented May 24, 2023

@christarazi looks like a 404 for the jenkins job that's pending

@christarazi
Copy link
Member

/test-1.26-net-next

@sjdot sjdot force-pushed the sjdot/health-endpoint-logging branch from 1048027 to 11e6745 Compare June 2, 2023 13:19
@sjdot sjdot requested a review from a team as a code owner June 2, 2023 13:19
@sjdot sjdot force-pushed the sjdot/health-endpoint-logging branch from 11e6745 to f7da082 Compare June 2, 2023 13:24
@sjdot
Copy link
Contributor Author

sjdot commented Jun 2, 2023

@christarazi looks like tests were hanging again. There was also a merge conflict so I've rebased and pushed.

@christarazi christarazi force-pushed the sjdot/health-endpoint-logging branch from f7da082 to 77aca1f Compare June 6, 2023 22:08
@christarazi christarazi removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jun 6, 2023
@christarazi
Copy link
Member

christarazi commented Jun 6, 2023

/test

Edit: runtime hit #25939

@sjdot
Copy link
Contributor Author

sjdot commented Jun 9, 2023

@christarazi looks like more flakes

@christarazi
Copy link
Member

/test-runtime

@sjdot
Copy link
Contributor Author

sjdot commented Jun 18, 2023

@christarazi more flakes, mind kicking it again?

@ti-mo
Copy link
Contributor

ti-mo commented Jun 20, 2023

/ci-ginkgo

@ti-mo ti-mo added the kind/enhancement This would improve or streamline existing functionality. label Jun 20, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Jun 20, 2023

I think this may need another rebase after the ginkgo changes that were made to over the past few weeks.

I've tagged this as a release blocker since it would be a nice quality-of-life improvement for 1.14.

@ti-mo ti-mo added dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. labels Jun 20, 2023
I was looking into kubelet liveness checks returning HTTP 503 response codes recently and noticed that there was not any logging in the agent that indicated the issue.

The logging I'm adding in this PR allowed me to track down why the liveness checks were failing by giving the error message associated with the current subsystem's unhealthy state.

Signed-off-by: Steven Johnson <sjdot@protonmail.com>
@sjdot sjdot force-pushed the sjdot/health-endpoint-logging branch from 77aca1f to f342580 Compare June 20, 2023 13:52
@sjdot
Copy link
Contributor Author

sjdot commented Jun 20, 2023

Thanks @ti-mo, I've rebased as suggested

@ti-mo
Copy link
Contributor

ti-mo commented Jun 20, 2023

/test

Copy link
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-running a few test flakes, this should be good to go.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 21, 2023
@ti-mo ti-mo merged commit 4ec2c36 into cilium:main Jun 21, 2023
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/health Relates to the cilium-health component dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants