Skip to content

diag: check-engine metric#8996

Merged
jherrera-jump merged 1 commit intofiredancer-io:mainfrom
jherrera-jump:jherrera/diag-check-engine
Mar 27, 2026
Merged

diag: check-engine metric#8996
jherrera-jump merged 1 commit intofiredancer-io:mainfrom
jherrera-jump:jherrera/diag-check-engine

Conversation

@jherrera-jump
Copy link
Copy Markdown
Contributor

@jherrera-jump jherrera-jump commented Mar 20, 2026

closes #1010

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “check-engine” style diagnostic metric to the diag tile, exposing per-component health status via generated metrics metadata and wiring the diag tile into the global metrics registry.

Changes:

  • Extend topology tile configuration to include diag.is_voting so health reporting can be vote-aware.
  • Introduce diag_healthy{health_component=...} gauge (Bundle/Vote/Replay/Turbine) and generate corresponding enum + metric metadata.
  • Implement health computation and reporting in fd_diag_tile.c, and hook diag metrics into the “all metrics” table and docs.

Reviewed changes

Copilot reviewed 5 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/disco/topo/fd_topo.h Adds diag per-tile configuration (is_voting).
src/disco/metrics/metrics.xml Defines HealthComponent enum and diag.Healthy gauge metric.
src/disco/metrics/generated/fd_metrics_enums.h Generated enum constants for health_component.
src/disco/metrics/generated/fd_metrics_diag.h Generated offsets/meta declarations for diag_healthy.
src/disco/metrics/generated/fd_metrics_diag.c Generated FD_METRICS_DIAG metadata table.
src/disco/metrics/generated/fd_metrics_all.c Registers FD_METRICS_DIAG for the diag tile kind.
src/disco/diag/fd_diag_tile.c Implements component health evaluation and emits the new metrics.
src/app/firedancer/topology.c Populates tile->diag.is_voting from config.
book/api/metrics-generated.md Updates generated metrics documentation for diag_healthy.

Comment thread src/disco/diag/fd_diag_tile.c
Comment thread src/disco/diag/fd_diag_tile.c Outdated
Comment thread src/disco/diag/fd_diag_tile.c
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 12 changed files in this pull request and generated 3 comments.

Comment thread src/disco/metrics/metrics.xml Outdated
Comment thread src/discof/tower/fd_tower_tile.c Outdated
Comment thread src/disco/metrics/metrics.xml Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 12 changed files in this pull request and generated 1 comment.

Comment thread src/disco/diag/fd_diag_tile.c Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 11 changed files in this pull request and generated 3 comments.

Comment thread src/disco/diag/fd_diag_tile.c Outdated
Comment thread src/disco/diag/fd_diag_tile.c Outdated
Comment thread src/disco/diag/fd_diag_tile.c
@jherrera-jump jherrera-jump marked this pull request as ready for review March 23, 2026 20:54
@jherrera-jump jherrera-jump force-pushed the jherrera/diag-check-engine branch from 8350cd6 to 61a846e Compare March 23, 2026 21:26
Comment thread src/disco/metrics/metrics.xml Outdated
</enum>

<tile name="diag">
<gauge name="SystemHealth" enum="SystemHealthIndicator" summary="Per-component health indicator. 0 is unhealthy, 1 is healthy, 2 is unknown/not applicable" />
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cardinality here doesn't make sense. This implies SystemHealth is either in a "bundle", "vote", "replay", or "turbine" state.

You want 4 gauges, and the system health enum to be "Healthy", "Unhealthy". I don't know what "Unknown" represents but that should also be made very precise / clear, or removed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ... actually I might be wrong here about the cardinality let me think about this more, but definitely Unknown is too vague

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing unknown to disabled

Copy link
Copy Markdown
Contributor

@mmcgee-jump mmcgee-jump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copilot AI review requested due to automatic review settings March 24, 2026 16:59
@jherrera-jump jherrera-jump force-pushed the jherrera/diag-check-engine branch from 61a846e to d2b4154 Compare March 24, 2026 16:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 11 changed files in this pull request and generated 7 comments.

Comment thread src/disco/diag/fd_diag_tile.c Outdated
Comment thread book/api/metrics-generated.md Outdated
Comment thread src/disco/diag/fd_diag_tile.c Outdated
Comment thread src/disco/diag/fd_diag_tile.c Outdated
Comment thread src/discof/tower/fd_tower_tile.c Outdated
Comment thread src/disco/metrics/metrics.xml Outdated
Comment thread src/disco/metrics/metrics.xml Outdated
@jherrera-jump jherrera-jump force-pushed the jherrera/diag-check-engine branch from d2b4154 to 915cf51 Compare March 25, 2026 16:37
Copy link
Copy Markdown
Contributor

@mmcgee-jump mmcgee-jump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copilot comments seem mostly valid?

Copilot AI review requested due to automatic review settings March 25, 2026 20:41
@jherrera-jump jherrera-jump force-pushed the jherrera/diag-check-engine branch from 915cf51 to 37d34f6 Compare March 25, 2026 20:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 12 changed files in this pull request and generated no new comments.

mmcgee-jump
mmcgee-jump previously approved these changes Mar 25, 2026
@jherrera-jump
Copy link
Copy Markdown
Contributor Author

jherrera-jump commented Mar 25, 2026

Changed to scheme with 4 separate metrics. Will update book/guide in a future PR.

Comment thread src/disco/metrics/metrics.xml Outdated
Copilot AI review requested due to automatic review settings March 25, 2026 23:16
@jherrera-jump jherrera-jump force-pushed the jherrera/diag-check-engine branch from 3c4fb8d to 8e96271 Compare March 25, 2026 23:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 11 changed files in this pull request and generated 1 comment.

Comment thread src/disco/diag/fd_diag_tile.c
@jherrera-jump jherrera-jump merged commit 267b4a4 into firedancer-io:main Mar 27, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add check engine light for monitoring

4 participants