diag: check-engine metric#8996
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “check-engine” style diagnostic metric to the diag tile, exposing per-component health status via generated metrics metadata and wiring the diag tile into the global metrics registry.
Changes:
- Extend topology tile configuration to include
diag.is_votingso health reporting can be vote-aware. - Introduce
diag_healthy{health_component=...}gauge (Bundle/Vote/Replay/Turbine) and generate corresponding enum + metric metadata. - Implement health computation and reporting in
fd_diag_tile.c, and hookdiagmetrics into the “all metrics” table and docs.
Reviewed changes
Copilot reviewed 5 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/disco/topo/fd_topo.h | Adds diag per-tile configuration (is_voting). |
| src/disco/metrics/metrics.xml | Defines HealthComponent enum and diag.Healthy gauge metric. |
| src/disco/metrics/generated/fd_metrics_enums.h | Generated enum constants for health_component. |
| src/disco/metrics/generated/fd_metrics_diag.h | Generated offsets/meta declarations for diag_healthy. |
| src/disco/metrics/generated/fd_metrics_diag.c | Generated FD_METRICS_DIAG metadata table. |
| src/disco/metrics/generated/fd_metrics_all.c | Registers FD_METRICS_DIAG for the diag tile kind. |
| src/disco/diag/fd_diag_tile.c | Implements component health evaluation and emits the new metrics. |
| src/app/firedancer/topology.c | Populates tile->diag.is_voting from config. |
| book/api/metrics-generated.md | Updates generated metrics documentation for diag_healthy. |
84ba77d to
2d06814
Compare
2d06814 to
7356736
Compare
7356736 to
8350cd6
Compare
8350cd6 to
61a846e
Compare
| </enum> | ||
|
|
||
| <tile name="diag"> | ||
| <gauge name="SystemHealth" enum="SystemHealthIndicator" summary="Per-component health indicator. 0 is unhealthy, 1 is healthy, 2 is unknown/not applicable" /> |
There was a problem hiding this comment.
The cardinality here doesn't make sense. This implies SystemHealth is either in a "bundle", "vote", "replay", or "turbine" state.
You want 4 gauges, and the system health enum to be "Healthy", "Unhealthy". I don't know what "Unknown" represents but that should also be made very precise / clear, or removed.
There was a problem hiding this comment.
Hmm ... actually I might be wrong here about the cardinality let me think about this more, but definitely Unknown is too vague
There was a problem hiding this comment.
changing unknown to disabled
61a846e to
d2b4154
Compare
d2b4154 to
915cf51
Compare
mmcgee-jump
left a comment
There was a problem hiding this comment.
copilot comments seem mostly valid?
915cf51 to
37d34f6
Compare
37d34f6 to
3c4fb8d
Compare
|
Changed to scheme with 4 separate metrics. Will update |
3c4fb8d to
8e96271
Compare
closes #1010