Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Require cluster-wide connectivity before running tests #18153

Merged
merged 9 commits into from
Dec 15, 2021

Conversation

gandro
Copy link
Member

@gandro gandro commented Dec 7, 2021

This PR introduces more in-depth checks on cilium-health in CI.

The first set of commits ensures that cilium-health properly reports connectivity failures in dual-stack setups. Before this PR, cilium-health would probe the connectivity of secondary endpoint addresses, but not report any failures in the health API and thus also not display the failures in cilium status or cilium-health status. To fix this, the API is extended to expose the status of the secondary endpoint addresses probes, and the formatting and output logic is improved to better display and report these kind of failures.

The second set of commits modifies CI to make use of this new functionality, i.e. ciliumHealthPreFlightCheck is extended such that we wait for all paths to be healthy before we start a test suite. This likely fixes a a few flakes in e.g. direct routing mode, where we started the test suite even though the direct routes for IPv6 endpoint addresses have not yet been installed.

Ref: #17895 (should fix that flake, but the test will not yet be unquarantined)

@gandro gandro added area/health Relates to the cilium-health component area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-note/ci This PR makes changes to the CI. labels Dec 7, 2021
@gandro
Copy link
Member Author

gandro commented Dec 7, 2021

/test

@gandro gandro force-pushed the pr/gandro/cilium-health-dualstack branch from 94df4d3 to 3f12fc5 Compare December 7, 2021 12:56
@maintainer-s-little-helper

This comment has been minimized.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 7, 2021
@gandro gandro force-pushed the pr/gandro/cilium-health-dualstack branch from 3f12fc5 to 169aae3 Compare December 7, 2021 12:57
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 7, 2021
@gandro
Copy link
Member Author

gandro commented Dec 7, 2021

/test

@gandro gandro force-pushed the pr/gandro/cilium-health-dualstack branch from 169aae3 to ff97bfe Compare December 7, 2021 15:35
@gandro
Copy link
Member Author

gandro commented Dec 7, 2021

/test

https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.19/75/
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/2147/

Edit: Seems like the readiness check is not sufficient, as it does not consider node count. Will try to modify ciliumHealthPreFlightCheck instead

@gandro gandro force-pushed the pr/gandro/cilium-health-dualstack branch from ff97bfe to 9fd19f0 Compare December 8, 2021 11:16
@gandro
Copy link
Member Author

gandro commented Dec 8, 2021

/test

Edit: Uncovered a bug in K8sDatapathConfig Host firewall * with endpoint routes

https://jenkins.cilium.io/job/Cilium-PR-K8s-1.23-kernel-4.9/77/
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.19/83/
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-5.4/74/

Reproducible on Vagrant via ping -s 8 <healthep>, the reply packet is dropped as invalid due its size. Will push a work around for now. Created #18177

@gandro
Copy link
Member Author

gandro commented Dec 8, 2021

/test

Runtime failed with a provisioning failure: https://jenkins.cilium.io/job/Cilium-PR-Runtime-net-next/810/

Edit: All Jenkins pipeline affected by these changes are green ✔️ Will re-run after pushing a few minor changes and mark ready for review.

@gandro
Copy link
Member Author

gandro commented Dec 8, 2021

/test

@gandro gandro marked this pull request as ready for review December 8, 2021 17:52
@gandro gandro requested review from a team as code owners December 8, 2021 17:52
@gandro gandro requested review from a team December 8, 2021 17:52
@gandro gandro requested a review from a team as a code owner December 8, 2021 17:52
This commit extends the cilium-health API to expose the status of
secondary addresses of the cilium-health endpoint. To ensure
backwards-compatibility of the cilium-health API and JSON output, the
old field is deprecated and a new one is introduced.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit modifies the cilium-health server to populate the newly
added `health-endpoint` field in node health status model. It now also
includes the address and probe status of secondary endpoint paths.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit changes the way cilium health status is displayed in `cilium
status` and `cilium-health status` with regards to secondary addresses
(i.e. IPv6 host or health endpoint IP addresses in dual-stack setups).

Firstly, in the brief output format (i.e. the output of `cilium
status`), we do not consider a host or endpoint as "reachable" if a
secondary path is unreachable. This is similar to how we already treat
partial reachability for a health probe if ICMP succeeded but HTTP did
not.

Secondly, in `cilium-health status --verbose`, we now also print the
status of secondary endpoint addresses (which we usually probe, but
before this commit never displayed anywhere).

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit changes the output of `cilium-health status` to always
display the status of unhealthy secondary paths. This makes it easier
for users to spot the reason why cilium health reports connectivity
issues, as this information is usually hidden behind the `--verbose`
flag.  This approach is similar to how we already show unhealth nodes in
the brief output.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit changes the semantics of the `unreachable_nodes` and
`unreachable_health_endpoints` metrics to include nodes or endpoints
where the secondary path (i.e. IPv6 in dual-stack) has no connectitivty.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This modifies the bugtool to collect the verbose `cilium-health status`
output which includes information about secondary paths (i.e. IPv6 paths
in dual-stack setups).

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This changes the K8sHealthTest suite to use the newly introduced
`health-endpoint` field insteaed of the deprecated `endpoint` field.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
To work around a bug where host firewall and endpoint routes drops very
small packages on the return path.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
In our CI setup, we have a helper function called
`ciliumHealthPreFlightCheck` which ensures that all nodes participating
in CI have connectivity to each other. However, that check did not
consider connectivity to remote endpoints or alternative paths.

This commit extends that function to ensure that not only connectivity
between the node has been established, but also that connectivity to the
health endpoint both via ICMP and HTTP is healthy.

This commit also introduces checks for secondary addresses (i.e. IPv6
addresses in dual-stack setups). If no secondary addresses are present,
the JSONPath filter used here returns an empty string which will be
ignored.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/cilium-health-dualstack branch from 6dc645e to 8ccdf11 Compare December 14, 2021 09:31
@gandro
Copy link
Member Author

gandro commented Dec 14, 2021

Addressed two nits, rebased and removed the commit which un-qurantines the test (and thus this PR needs no sign-off from team loadbalancer anymore). Main diff:

diff --git a/pkg/health/client/client.go b/pkg/health/client/client.go
index ab2b2f088466..fe93c9834d55 100644
--- a/pkg/health/client/client.go
+++ b/pkg/health/client/client.go
@@ -274,7 +274,7 @@ func GetAllHostAddresses(node *models.NodeStatus) []*models.PathStatus {
 }
 
 // GetEndpointPrimaryAddress returns the PrimaryAddress for the health endpoint
-// within node. If node.Host is nil, returns nil.
+// within node. If node.HealthEndpoint is nil, returns nil.
 func GetEndpointPrimaryAddress(node *models.NodeStatus) *models.PathStatus {
        if node.HealthEndpoint == nil {
                return nil
@@ -283,7 +283,8 @@ func GetEndpointPrimaryAddress(node *models.NodeStatus) *models.PathStatus {
        return node.HealthEndpoint.PrimaryAddress
 }
 
-// GetEndpointSecondaryAddresses returns the secondary host addresses (if any)
+// GetEndpointSecondaryAddresses returns the secondary health endpoint addresses
+// (if any)
 func GetEndpointSecondaryAddresses(node *models.NodeStatus) []*models.PathStatus {
        if node.HealthEndpoint == nil {
                return nil

@gandro
Copy link
Member Author

gandro commented Dec 14, 2021

/test

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sHealthTest cilium-health Checks status between nodes

Failure Output

FAIL: Kubernetes DNS did not become ready in time

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@gandro gandro requested review from joestringer and removed request for a team and jibi December 14, 2021 10:21
@gandro
Copy link
Member Author

gandro commented Dec 14, 2021

@gandro
Copy link
Member Author

gandro commented Dec 14, 2021

/test-gke

@gandro
Copy link
Member Author

gandro commented Dec 14, 2021

/ci-multicluster

@gandro gandro unassigned jibi Dec 14, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 15, 2021
@nbusseneau nbusseneau merged commit 2bd5abb into cilium:master Dec 15, 2021
@tklauser tklauser added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI-improvement Topic or proposal to improve the Continuous Integration workflow area/health Relates to the cilium-health component backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants