health: Fix cluster-health-port for health endpoint #18061

gandro · 2021-11-30T13:41:43Z

To determine cluster health, Cilium exposes a HTTP server both on each
node, as well as on the artificial health endpoint running on each node.
The port used for this HTTP server is the same and can be configured via
cluster-health-port (introduced in #16926) and defaults to 4240.

This commit fixes a bug where the port specified by
cluster-health-port was not passed to the Cilium health endpoint
responder. Which meant that cilium-health-responder was always
listening on the default port instead of the one configured by the user,
while the probe tried to connect via cluster-health-port. This
resulted in the cluster being reported us unhealthy whenever
cluster-health-port was set to a non-default value (which is the case
our OpenShift OLM for v1.11):

Nodes:
  gandro-7bmc2-worker-2-blgxf.c.cilium-dev.internal (localhost):
    Host connectivity to 10.0.128.2:
      ICMP to stack:   OK, RTT=634.746µs
      HTTP to agent:   OK, RTT=228.066µs
    Endpoint connectivity to 10.128.11.73:
      ICMP to stack:   OK, RTT=666.83µs
      HTTP to agent:   Get "http://10.128.11.73:9940/hello": dial tcp 10.128.11.73:9940: connect: connection refused

Fixes: e624868 ("health: Add a flag to set HTTP port")

⚠️ Marking as release blocker since this is required for our v1.11 OpenShift OLM templates.

tklauser

Nice find!

Non-blocking nit in case another revision is needed: there is a tiny typo in the commit message of the first commit, 2nd paragraph (and the PR description as well): clsuter-health-port instead of cluster-health-port.

To determine cluster health, Cilium exposes a HTTP server both on each node, as well as on the artificial health endpoint running on each node. The port used for this HTTP server is the same and can be configured via `cluster-health-port` (introduced in cilium#16926) and defaults to 4240. This commit fixes a bug where the port specified by `cluster-health-port` was not passed to the Cilium health endpoint responder. Which meant that `cilium-health-responder` was always listening on the default port instead of the one configured by the user, while the probe tried to connect via `cluster-health-port`. This resulted in the cluster being reported us unhealthy whenever `cluster-health-port` was set to a non-default value (which is the case our OpenShift OLM for v1.11): ``` Nodes: gandro-7bmc2-worker-2-blgxf.c.cilium-dev.internal (localhost): Host connectivity to 10.0.128.2: ICMP to stack: OK, RTT=634.746µs HTTP to agent: OK, RTT=228.066µs Endpoint connectivity to 10.128.11.73: ICMP to stack: OK, RTT=666.83µs HTTP to agent: Get "http://10.128.11.73:9940/hello": dial tcp 10.128.11.73:9940: connect: connection refused ``` Fixes: e624868 ("health: Add a flag to set HTTP port") Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This sets a custom value for `cluster-health-port` in the K8sHealth test suite, to ensure we support setting a custom health port (e.g. used in OpenShift, which we do not test in our CI at the moment). Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2021-11-30T15:54:15Z

/test

gandro · 2021-11-30T16:19:51Z

cilium-health/launch/endpoint.go

@@ -304,7 +304,7 @@ func LaunchAsEndpoint(baseCtx context.Context,

 	pidfile := filepath.Join(option.Config.StateDir, PidfilePath)
 	prog := "ip"
-	args := []string{"netns", "exec", netNSName, binaryName, "--pidfile", pidfile}
+	args := []string{"netns", "exec", netNSName, binaryName, "--listen", strconv.Itoa(option.Config.ClusterHealthPort), "--pidfile", pidfile}


For cross-reference:

cilium/cilium-health/responder/main.go

Line 40 in aef5002

flag.IntVar(&listen, "listen", healthDefaults.HTTPPathPort, "Port on which the responder listens")

This is a cleanup commit with no functional change. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2021-11-30T16:29:57Z

Decided to add drive-by cleanup of the responder while at it.

gandro · 2021-11-30T16:46:09Z

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Request from k8s1 to service http://[fd04::11]:31022 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

gandro · 2021-11-30T17:49:29Z

ConformanceKind1.19 failed with a variant of ConformanceEKS (ci-eks) CI: no-policies/pod-to-world/http-to-jenkins-cilium: Fails during connectivity test after deploying with encryption enabled #16938 https://github.com/cilium/cilium/runs/4370353553?check_suite_focus=true

joestringer

Thanks!

joestringer · 2021-11-30T18:47:21Z

@gandro the individual test case failures look similar to #16938, however in all of the other cases I've been tracking recently, that issue only occurs in an EKS environment when a cilium-cli test run was successful without encryption, then encryption is enabled, then the cilium-cli tests are re-run. For that issue it suggests that the problem is either a test pollution issue or some sort of resource leak issue triggered by the previous CI run.

In the case in https://github.com/cilium/cilium/runs/4370353553?check_suite_focus=true , this is failing on the first try in a kind environment so I think we should treat it as a different failure.

I agree it doesn't seem related to the changes in this PR though, so I'm OK with going ahead to merge this PR regardless of that failure.

joestringer · 2021-11-30T23:08:32Z

Looking at the test-1.16-netnext job failure, I see four types of failures:

Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:, appears to be CI: K8sBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth, vxlan tunneling: Operator Failed to update lock: use of closed network connection #18053
FAIL: failed to ensure kubectl version: failed to download kubectl, CI: failed to ensure kubectl version: failed to download kubectl #18060
BeforeAll failures which are typically triggered by some bad state in the cluster left behind by a previous test, eg because kubectl couldn't be set up
Test suite timed out after 2h50m0s, because too many things failed, taking too long.

The test-1.22-4.19 job failure looks like #18014 .

LGTM to merge.

gandro · 2021-12-01T09:34:29Z

Thanks a ton for triaging the failiures @joestringer 🙏 And yes, agreed that the Kind failure looks new :/

gandro added area/health Relates to the cilium-health component release-note/misc This PR makes changes that have no direct user impact. release-blocker/1.11 This issue will prevent the release of the next version of Cilium. needs-backport/1.11 labels Nov 30, 2021

gandro requested a review from a team as a code owner November 30, 2021 13:41

gandro requested review from a team, tklauser and joestringer November 30, 2021 13:41

maintainer-s-little-helper bot assigned tklauser and joestringer Nov 30, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.11.0 Nov 30, 2021

gandro mentioned this pull request Nov 30, 2021

Validate guides for 1.11 release #17926

Closed

17 tasks

tklauser approved these changes Nov 30, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned tklauser Nov 30, 2021

gandro added 2 commits November 30, 2021 16:07

gandro force-pushed the pr/gandro/fix-clusterhealth-port branch from 765e7d0 to 95f54c9 Compare November 30, 2021 15:07

gandro commented Nov 30, 2021

View reviewed changes

health: Use signal.NotifyContext

102b088

This is a cleanup commit with no functional change. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

joestringer approved these changes Nov 30, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned joestringer Nov 30, 2021

joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 30, 2021

qmonnet merged commit 6334f98 into cilium:master Nov 30, 2021

qmonnet mentioned this pull request Nov 30, 2021

v1.11 backports 2021-11-30 #18076

Merged

18 tasks

qmonnet added backport-pending/1.11 and removed needs-backport/1.11 labels Nov 30, 2021

nathanjsweet added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 2, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport done to v1.11 in 1.11.0 Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health: Fix cluster-health-port for health endpoint #18061

health: Fix cluster-health-port for health endpoint #18061

gandro commented Nov 30, 2021 •

edited

tklauser left a comment

gandro commented Nov 30, 2021

gandro Nov 30, 2021

gandro commented Nov 30, 2021

gandro commented Nov 30, 2021 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

gandro commented Nov 30, 2021

joestringer left a comment

joestringer commented Nov 30, 2021

joestringer commented Nov 30, 2021 •

edited

gandro commented Dec 1, 2021

health: Fix cluster-health-port for health endpoint #18061

health: Fix cluster-health-port for health endpoint #18061

Conversation

gandro commented Nov 30, 2021 • edited

tklauser left a comment

Choose a reason for hiding this comment

gandro commented Nov 30, 2021

gandro Nov 30, 2021

Choose a reason for hiding this comment

gandro commented Nov 30, 2021

gandro commented Nov 30, 2021 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

gandro commented Nov 30, 2021

joestringer left a comment

Choose a reason for hiding this comment

joestringer commented Nov 30, 2021

joestringer commented Nov 30, 2021 • edited

gandro commented Dec 1, 2021

gandro commented Nov 30, 2021 •

edited

gandro commented Nov 30, 2021 •

edited by maintainer-s-little-helper bot

joestringer commented Nov 30, 2021 •

edited