New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health: Fix cluster-health-port for health endpoint #18061
health: Fix cluster-health-port for health endpoint #18061
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
Non-blocking nit in case another revision is needed: there is a tiny typo in the commit message of the first commit, 2nd paragraph (and the PR description as well): clsuter-health-port
instead of cluster-health-port
.
To determine cluster health, Cilium exposes a HTTP server both on each node, as well as on the artificial health endpoint running on each node. The port used for this HTTP server is the same and can be configured via `cluster-health-port` (introduced in cilium#16926) and defaults to 4240. This commit fixes a bug where the port specified by `cluster-health-port` was not passed to the Cilium health endpoint responder. Which meant that `cilium-health-responder` was always listening on the default port instead of the one configured by the user, while the probe tried to connect via `cluster-health-port`. This resulted in the cluster being reported us unhealthy whenever `cluster-health-port` was set to a non-default value (which is the case our OpenShift OLM for v1.11): ``` Nodes: gandro-7bmc2-worker-2-blgxf.c.cilium-dev.internal (localhost): Host connectivity to 10.0.128.2: ICMP to stack: OK, RTT=634.746µs HTTP to agent: OK, RTT=228.066µs Endpoint connectivity to 10.128.11.73: ICMP to stack: OK, RTT=666.83µs HTTP to agent: Get "http://10.128.11.73:9940/hello": dial tcp 10.128.11.73:9940: connect: connection refused ``` Fixes: e624868 ("health: Add a flag to set HTTP port") Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This sets a custom value for `cluster-health-port` in the K8sHealth test suite, to ensure we support setting a custom health port (e.g. used in OpenShift, which we do not test in our CI at the moment). Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
765e7d0
to
95f54c9
Compare
/test |
@@ -304,7 +304,7 @@ func LaunchAsEndpoint(baseCtx context.Context, | |||
|
|||
pidfile := filepath.Join(option.Config.StateDir, PidfilePath) | |||
prog := "ip" | |||
args := []string{"netns", "exec", netNSName, binaryName, "--pidfile", pidfile} | |||
args := []string{"netns", "exec", netNSName, binaryName, "--listen", strconv.Itoa(option.Config.ClusterHealthPort), "--pidfile", pidfile} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cross-reference:
cilium/cilium-health/responder/main.go
Line 40 in aef5002
flag.IntVar(&listen, "listen", healthDefaults.HTTPPathPort, "Port on which the responder listens") |
This is a cleanup commit with no functional change. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Decided to add drive-by cleanup of the responder while at it. |
/test Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@gandro the individual test case failures look similar to #16938, however in all of the other cases I've been tracking recently, that issue only occurs in an EKS environment when a cilium-cli test run was successful without encryption, then encryption is enabled, then the cilium-cli tests are re-run. For that issue it suggests that the problem is either a test pollution issue or some sort of resource leak issue triggered by the previous CI run. In the case in https://github.com/cilium/cilium/runs/4370353553?check_suite_focus=true , this is failing on the first try in a kind environment so I think we should treat it as a different failure. I agree it doesn't seem related to the changes in this PR though, so I'm OK with going ahead to merge this PR regardless of that failure. |
Looking at the test-1.16-netnext job failure, I see four types of failures:
The test-1.22-4.19 job failure looks like #18014 . LGTM to merge. |
Thanks a ton for triaging the failiures @joestringer 🙏 And yes, agreed that the Kind failure looks new :/ |
To determine cluster health, Cilium exposes a HTTP server both on each
node, as well as on the artificial health endpoint running on each node.
The port used for this HTTP server is the same and can be configured via
cluster-health-port
(introduced in #16926) and defaults to 4240.This commit fixes a bug where the port specified by
cluster-health-port
was not passed to the Cilium health endpointresponder. Which meant that
cilium-health-responder
was alwayslistening on the default port instead of the one configured by the user,
while the probe tried to connect via
cluster-health-port
. Thisresulted in the cluster being reported us unhealthy whenever
cluster-health-port
was set to a non-default value (which is the caseour OpenShift OLM for v1.11):
Fixes: e624868 ("health: Add a flag to set HTTP port")