New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log error message on unhealthy /healthz check #24683
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, mostly LGTM. Just one comment below about rate limiting potentially.
daemon/cmd/agenthealth.go
Outdated
@@ -42,6 +42,7 @@ func (d *Daemon) startAgentHealthHTTPService() { | |||
statusCode := http.StatusOK | |||
sr := d.getStatus(true) | |||
if isUnhealthy(&sr) { | |||
log.WithField("state", sr.Cilium.State).Warnf("/healthz returning unhealthy: %s", sr.Cilium.Msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it might be useful to rate limit logs here? I'm concerned that it would spam the logs if the Agent is having flapping issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christarazi this should log at the rate that the /healthz
endpoint is hit by an HTTP client. In the default case I believe that's every 10 seconds by the kubelet liveness check. Probably useful to have that log line at that sort of a rate to make sure that it's obvious there is a health check issue when looking at logs. Happy to add the rate limiting however if you would prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, in that case, let's avoid complicating it and see how it goes. We can always fix it up later if it's a problem.
1c35e55
to
72c7ab5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one final nit that I forgot last time, allows us to remove the dynamic log msg in favor of logfields.
0f9ea5f
to
d4625fb
Compare
Could you add the PR description context into the commit msg? This should be the last thing from my side, pending CI results. |
997e1a6
to
3566565
Compare
/test |
deeb0e9
to
627bdd7
Compare
/test |
@sjdot Just a heads up, each time there's a push, the CI results are thrown away. Were you pushing to resolve some sort of flake that needed a rebase? |
Hi @christarazi. Yeah think the CI had a flake yesterday. I've been merging to get the latest changes from "main" into the branch which I assume will block a merge if not done? |
Sometimes it's not necessary. We merge with "rebase & merge" strategy so in the end, the PR is applied on top of main. Right now it looks like CI has passed, so we can await for approving reviews, and then merge when we have them. |
@christarazi ok great, good to know! |
@christarazi is the failing "Chart CI Push" a blocker here? |
Yep looks like #25524 |
/test-1.26-net-next |
@christarazi looks like your jenkins instance may be having some issues? |
/test-1.26-net-next |
We had an outage over the weekend and it should now be resolved. |
@christarazi looks like a 404 for the jenkins job that's pending |
/test-1.26-net-next |
1048027
to
11e6745
Compare
11e6745
to
f7da082
Compare
@christarazi looks like tests were hanging again. There was also a merge conflict so I've rebased and pushed. |
f7da082
to
77aca1f
Compare
/test Edit: runtime hit #25939 |
@christarazi looks like more flakes |
/test-runtime |
@christarazi more flakes, mind kicking it again? |
/ci-ginkgo |
I think this may need another rebase after the ginkgo changes that were made to over the past few weeks. I've tagged this as a release blocker since it would be a nice quality-of-life improvement for 1.14. |
I was looking into kubelet liveness checks returning HTTP 503 response codes recently and noticed that there was not any logging in the agent that indicated the issue. The logging I'm adding in this PR allowed me to track down why the liveness checks were failing by giving the error message associated with the current subsystem's unhealthy state. Signed-off-by: Steven Johnson <sjdot@protonmail.com>
77aca1f
to
f342580
Compare
Thanks @ti-mo, I've rebased as suggested |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-running a few test flakes, this should be good to go.
I was looking into kubelet liveness checks returning HTTP 503 response codes recently and noticed that there was not any logging in the agent that indicated the issue.
The logging I'm adding in this PR allowed me to track down why the liveness checks were failing by giving the error message associated with the current subsystem's unhealthy state. For example: