Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StateDB based Health #30925

Merged
merged 12 commits into from
Apr 9, 2024
Merged

StateDB based Health #30925

merged 12 commits into from
Apr 9, 2024

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Feb 23, 2024

healthv2: add new module health implementation based on statedb.

Similar to existing implementation in hive/cell/<health/structured>.go, this provides system health data in a tree structure.
However, this seeks to reimplement the health provider such that it is no longer coupled to pkg/hive/cell.
Because this will eventually be run just as any other Agent module, we can use StateDB for the implementation.

For more information about the original tree structured health reporter take a look at code documentation here. This seeks to provide a similar underlying structure of data while using a much simpler statedb schema.

Note: The only intended user impact of these changes is that health data will now be available with cilium-dbg statedb health and cilium-dbg statedb dump.

This massively reduces the complexity of the health implementation, as well as provides a more convenient data model for storing health update data.

Furthermore, this fixes the awkward distinction between "module" and "subcomponent" reporting by unifying how these health reports are stored in one place.

pkg/healthv2/provider.go implements a new health provider that uses statedb to store a table of health updates by their fully qualified identifier. This is composed of:

[module-id].[componenet-id]

Where the module ID is a fully qualified ID of submodules, ex:

i. agent.controlplane.bgpv1

Example cilium statedb healthoutput:

Module                                           Component                                                                                              Level   Message                                                        LastOK                 UpdatedAt              Count
agent.controlplane.endpoint-manager              endpoint-gc                                                                                            OK      endpoint-gc                                                    2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   0
agent.controlplane.l2-announcer                  job-l2-announcer lease-gc                                                                              OK      Running                                                        2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   0
agent.controlplane.auth                          observer-job-auth gc-identity-events                                                                   OK      Primed                                                         2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   0
agent.controlplane.auth                          timer-job-auth gc-cleanup                                                                              OK      Primed                                                         2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   0
agent.controlplane.auth                          observer-job-auth request-authentication                                                               OK      Primed                                                         2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   0
agent.controlplane.node-manager                  nodes-add                                                                                              OK      Node adds successful                                           2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   3
agent.controlplane.node-manager                  nodes-update                                                                                           OK      Node updates successful                                        2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   1
agent.datapath.node-address                      job-node-address-update                                                                                OK      10.244.1.33 (cilium_host), fd00:10:244:1::3c26 (cilium_host)   2024-03-06T23:27:21Z   2024-03-06T23:27:21Z   0
agent.controlplane.bgp-cp                        job-diffstore-events                                                                                   OK      Running                                                        2024-03-06T23:27:22Z   2024-03-06T23:27:22Z   1
agent.controlplane.stale-endpoint-cleanup        job-endpoint-cleanup                                                                                   OK      Running                                                        2024-03-06T23:27:22Z   2024-03-06T23:27:22Z   0
agent.controlplane.envoy-proxy                   timer-job-version-check                                                                                OK      Primed                                                         2024-03-06T23:27:22Z   2024-03-06T23:27:22Z   0
agent.controlplane.service-manager               job-ServiceReconciler                                                                                  OK      2 NodePort frontend addresses                                  2024-03-06T23:27:22Z   2024-03-06T23:27:22Z   0
agent.datapath.l2-responder                      job-l2-responder-reconciler                                                                            OK      Running                                                        2024-03-06T23:27:22Z   2024-03-06T23:27:22Z   0
agent.controlplane.endpoint-manager              cilium-endpoint-1865 (/).policymap-sync                                                                OK      sync-policymap-1865                                            2024-03-06T23:27:23Z   2024-03-06T23:27:23Z   0
agent.controlplane.endpoint-manager              cilium-endpoint-1865 (/).datapath-regenerate                                                           OK      Endpoint regeneration successful                               2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   2
agent.controlplane.endpoint-manager              cilium-endpoint-162 (local-path-storage/local-path-provisioner-7577fdbbfb-pcffg).policymap-sync        OK      sync-policymap-162                                             2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   0
agent.controlplane.endpoint-manager              cilium-endpoint-162 (local-path-storage/local-path-provisioner-7577fdbbfb-pcffg).datapath-regenerate   OK      Endpoint regeneration successful                               2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   1
agent.controlplane.endpoint-manager              cilium-endpoint-2319 (kube-system/coredns-76f75df574-8m4d8).policymap-sync                             OK      sync-policymap-2319                                            2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   0
agent.controlplane.endpoint-manager              cilium-endpoint-2319 (kube-system/coredns-76f75df574-8m4d8).datapath-regenerate                        OK      Endpoint regeneration successful                               2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   1
agent.controlplane.endpoint-manager              cilium-endpoint-605 (kube-system/coredns-76f75df574-ql8kj).policymap-sync                              OK      sync-policymap-605                                             2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   0
agent.controlplane.endpoint-manager              cilium-endpoint-605 (kube-system/coredns-76f75df574-ql8kj).datapath-regenerate                         OK      Endpoint regeneration successful                               2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   1
agent.controlplane.endpoint-manager              cilium-endpoint-3399 (/).policymap-sync                                                                OK      sync-policymap-3399                                            2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   0
agent.controlplane.endpoint-manager              cilium-endpoint-3399 (/).datapath-regenerate                                                           OK      Endpoint regeneration successful                               2024-03-06T23:27:24Z   2024-03-06T23:27:24Z   1
agent.controlplane.node-manager                  background-sync                                                                                        OK      Node validation successful                                     2024-03-06T23:29:57Z   2024-03-06T23:29:57Z   2
agent.controlplane.daemon                        job-sync-hostips                                                                                       OK      Synchronized                                                   2024-03-06T23:30:22Z   2024-03-06T23:30:22Z   3
agent.controlplane.daemon.ep-bpf-prog-watchdog   ep-bpf-prog-watchdog                                                                                   OK      ep-bpf-prog-watchdog                                           2024-03-06T23:30:22Z   2024-03-06T23:30:22Z   6
agent.datapath.sysctl                            job-reconciler-loop                                                                                    OK      OK, 25 objects                                                 2024-03-06T23:30:41Z   2024-03-06T23:30:41Z   62
agent.controlplane.endpoint-manager              cilium-endpoint-2319 (kube-system/coredns-76f75df574-8m4d8).cep-k8s-sync                               OK      sync-to-k8s-ciliumendpoint (2319)                              2024-03-06T23:30:42Z   2024-03-06T23:30:42Z   20
agent.controlplane.endpoint-manager              cilium-endpoint-162 (local-path-storage/local-path-provisioner-7577fdbbfb-pcffg).cep-k8s-sync          OK      sync-to-k8s-ciliumendpoint (162)                               2024-03-06T23:30:42Z   2024-03-06T23:30:42Z   20
agent.controlplane.endpoint-manager              cilium-endpoint-605 (kube-system/coredns-76f75df574-ql8kj).cep-k8s-sync                                OK      sync-to-k8s-ciliumendpoint (605)                               2024-03-06T23:30:42Z   2024-03-06T23:30:42Z   20
agent.datapath.agent-liveness-updater            timer-job-agent-liveness-updater                                                                       OK      OK (107.75µs)                                                  2024-03-06T23:30:43Z   2024-03-06T23:30:43Z   0

Similarily the component id is a ID representing a tree of subcomponents of a module. Together, these form a tree where each path stores information about the module and component being reported on. As well, Status updates store the original "two-component" identifier.

This schema is meant to be less opinionated about how to view health data, it is simply a set of health report rows indexed by a identifier path.
Because of this, there is no longer any distinction between a "reporter" (i.e. leaf) and "scope" (i.e. parent node). This means that a reporter can have a status and have child reports.

Initially this will be shimmed into the existing health infrastructure in hive/cell. Ultimately we will remove all that code and refactor health reporters using the external github.com/cilium/hive library.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 23, 2024
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/healthv2 branch 5 times, most recently from 0bd731f to 6ab23b5 Compare February 24, 2024 01:06
@tommyp1ckles tommyp1ckles marked this pull request as ready for review February 24, 2024 01:12
@tommyp1ckles tommyp1ckles requested review from a team as code owners February 24, 2024 01:12
@tommyp1ckles tommyp1ckles marked this pull request as draft February 24, 2024 15:15
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/healthv2 branch 2 times, most recently from c9b14dc to a512295 Compare February 24, 2024 16:49
@tommyp1ckles tommyp1ckles marked this pull request as ready for review February 24, 2024 17:04
@tommyp1ckles tommyp1ckles requested a review from a team as a code owner February 24, 2024 17:04
Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For API I just have a small nit, otherwise LGTM.

api/v1/openapi.yaml Outdated Show resolved Hide resolved
@tommyp1ckles tommyp1ckles marked this pull request as draft February 27, 2024 04:00
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/healthv2 branch 6 times, most recently from 6c1b0ae to f7ea5e0 Compare March 4, 2024 17:49
@tommyp1ckles
Copy link
Contributor Author

/test

The 'cilium status' cli output now uses the statedb remote table endpoint to fetch health data
Thus we can remove the old module health specific code and openapi schema.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Some code will remain as it is used for testing in other places.
We will remove this once we have switched completely to healthv2.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Adds codeowners entry for healthv2.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
operator/pkg/bgpv2 relies on having a health provider.
This adds a healthv2, as well as its dependency statedb to
fix operator hive.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
This fixes test failures in statedb/reconciler, this
also replaces usingt the to json hack used in the previous
health implementation.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Sets up hive fixture then queries statedb table directly to
check that expected updates occurred

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles
Copy link
Contributor Author

/test

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on docs, thanks! Up to you, but I think it would be a good idea to add a quick note in the upgrade notes about the breaking changes in cilium-dbg usage.

+1, please do add a note. Doc changes look good otherwise.

@qmonnet qmonnet added area/cli Impacts the command line interface of any command in the repository. area/health Relates to the cilium-health component labels Apr 8, 2024
Do to the change of health provider backend, there are some
changes to how data is displayed that may be relevant to users
upgrading to v1.16.

As well, health status data is now strictly sorted by the fully
qualified health status identifier (i.e. [fq-module-id].[fq-component-id]).

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles
Copy link
Contributor Author

/test

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tommyp1ckles Thanks for the updates!

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 8, 2024
@tommyp1ckles tommyp1ckles added this pull request to the merge queue Apr 9, 2024
Merged via the queue into cilium:main with commit 7ed375a Apr 9, 2024
62 checks passed
@tommyp1ckles tommyp1ckles deleted the pr/tp/healthv2 branch April 9, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli Impacts the command line interface of any command in the repository. area/health Relates to the cilium-health component ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet