Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health: Fix cluster-health-port for health endpoint #18061

Merged
merged 3 commits into from Nov 30, 2021

Conversation

gandro
Copy link
Member

@gandro gandro commented Nov 30, 2021

To determine cluster health, Cilium exposes a HTTP server both on each
node, as well as on the artificial health endpoint running on each node.
The port used for this HTTP server is the same and can be configured via
cluster-health-port (introduced in #16926) and defaults to 4240.

This commit fixes a bug where the port specified by
cluster-health-port was not passed to the Cilium health endpoint
responder. Which meant that cilium-health-responder was always
listening on the default port instead of the one configured by the user,
while the probe tried to connect via cluster-health-port. This
resulted in the cluster being reported us unhealthy whenever
cluster-health-port was set to a non-default value (which is the case
our OpenShift OLM for v1.11):

Nodes:
  gandro-7bmc2-worker-2-blgxf.c.cilium-dev.internal (localhost):
    Host connectivity to 10.0.128.2:
      ICMP to stack:   OK, RTT=634.746µs
      HTTP to agent:   OK, RTT=228.066µs
    Endpoint connectivity to 10.128.11.73:
      ICMP to stack:   OK, RTT=666.83µs
      HTTP to agent:   Get "http://10.128.11.73:9940/hello": dial tcp 10.128.11.73:9940: connect: connection refused

Fixes: e624868 ("health: Add a flag to set HTTP port")

⚠️ Marking as release blocker since this is required for our v1.11 OpenShift OLM templates.

@gandro gandro added area/health Relates to the cilium-health component release-note/misc This PR makes changes that have no direct user impact. release-blocker/1.11 This issue will prevent the release of the next version of Cilium. needs-backport/1.11 labels Nov 30, 2021
@gandro gandro requested a review from a team as a code owner November 30, 2021 13:41
@gandro gandro requested review from a team, tklauser and joestringer November 30, 2021 13:41
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.0 Nov 30, 2021
@gandro gandro mentioned this pull request Nov 30, 2021
17 tasks
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

Non-blocking nit in case another revision is needed: there is a tiny typo in the commit message of the first commit, 2nd paragraph (and the PR description as well): clsuter-health-port instead of cluster-health-port.

To determine cluster health, Cilium exposes a HTTP server both on each
node, as well as on the artificial health endpoint running on each node.
The port used for this HTTP server is the same and can be configured via
`cluster-health-port` (introduced in cilium#16926) and defaults to 4240.

This commit fixes a bug where the port specified by
`cluster-health-port` was not passed to the Cilium health endpoint
responder. Which meant that `cilium-health-responder` was always
listening on the default port instead of the one configured by the user,
while the probe tried to connect via `cluster-health-port`. This
resulted in the cluster being reported us unhealthy whenever
`cluster-health-port` was set to a non-default value (which is the case
our OpenShift OLM for v1.11):

```
Nodes:
  gandro-7bmc2-worker-2-blgxf.c.cilium-dev.internal (localhost):
    Host connectivity to 10.0.128.2:
      ICMP to stack:   OK, RTT=634.746µs
      HTTP to agent:   OK, RTT=228.066µs
    Endpoint connectivity to 10.128.11.73:
      ICMP to stack:   OK, RTT=666.83µs
      HTTP to agent:   Get "http://10.128.11.73:9940/hello": dial tcp 10.128.11.73:9940: connect: connection refused
```

Fixes: e624868 ("health: Add a flag to set HTTP port")

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This sets a custom value for `cluster-health-port` in the K8sHealth test
suite, to ensure we support setting a custom health port (e.g. used in
OpenShift, which we do not test in our CI at the moment).

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/fix-clusterhealth-port branch from 765e7d0 to 95f54c9 Compare November 30, 2021 15:07
@gandro
Copy link
Member Author

gandro commented Nov 30, 2021

/test

@@ -304,7 +304,7 @@ func LaunchAsEndpoint(baseCtx context.Context,

pidfile := filepath.Join(option.Config.StateDir, PidfilePath)
prog := "ip"
args := []string{"netns", "exec", netNSName, binaryName, "--pidfile", pidfile}
args := []string{"netns", "exec", netNSName, binaryName, "--listen", strconv.Itoa(option.Config.ClusterHealthPort), "--pidfile", pidfile}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cross-reference:

flag.IntVar(&listen, "listen", healthDefaults.HTTPPathPort, "Port on which the responder listens")

This is a cleanup commit with no functional change.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro
Copy link
Member Author

gandro commented Nov 30, 2021

Decided to add drive-by cleanup of the responder while at it.

@gandro
Copy link
Member Author

gandro commented Nov 30, 2021

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Request from k8s1 to service http://[fd04::11]:31022 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@joestringer
Copy link
Member

@gandro the individual test case failures look similar to #16938, however in all of the other cases I've been tracking recently, that issue only occurs in an EKS environment when a cilium-cli test run was successful without encryption, then encryption is enabled, then the cilium-cli tests are re-run. For that issue it suggests that the problem is either a test pollution issue or some sort of resource leak issue triggered by the previous CI run.

In the case in https://github.com/cilium/cilium/runs/4370353553?check_suite_focus=true , this is failing on the first try in a kind environment so I think we should treat it as a different failure.

I agree it doesn't seem related to the changes in this PR though, so I'm OK with going ahead to merge this PR regardless of that failure.

@joestringer
Copy link
Member

joestringer commented Nov 30, 2021

Looking at the test-1.16-netnext job failure, I see four types of failures:

  1. Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:, appears to be CI: K8sBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth, vxlan tunneling: Operator Failed to update lock: use of closed network connection #18053
  2. FAIL: failed to ensure kubectl version: failed to download kubectl, CI: failed to ensure kubectl version: failed to download kubectl #18060
  3. BeforeAll failures which are typically triggered by some bad state in the cluster left behind by a previous test, eg because kubectl couldn't be set up
  4. Test suite timed out after 2h50m0s, because too many things failed, taking too long.

The test-1.22-4.19 job failure looks like #18014 .

LGTM to merge.

@joestringer joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 30, 2021
@qmonnet qmonnet merged commit 6334f98 into cilium:master Nov 30, 2021
@qmonnet qmonnet mentioned this pull request Nov 30, 2021
18 tasks
@gandro
Copy link
Member Author

gandro commented Dec 1, 2021

Thanks a ton for triaging the failiures @joestringer 🙏 And yes, agreed that the Kind failure looks new :/

@nathanjsweet nathanjsweet added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport done to v1.11 in 1.11.0 Dec 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport done to v1.11 in 1.11.0 Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health Relates to the cilium-health component backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.11 This issue will prevent the release of the next version of Cilium. release-note/misc This PR makes changes that have no direct user impact.
Projects
No open projects
1.11.0
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

None yet

5 participants