Skip to content

feat: Implement runtime observability metrics & dashboard specs#30861

Closed
mnkiefer wants to merge 4 commits intomainfrom
add-sashboard-specs
Closed

feat: Implement runtime observability metrics & dashboard specs#30861
mnkiefer wants to merge 4 commits intomainfrom
add-sashboard-specs

Conversation

@mnkiefer
Copy link
Copy Markdown
Collaborator

@mnkiefer mnkiefer commented May 7, 2026

  • Adds runtime_observability.cjs to collect and compute runtime metrics including token usage, error counts, and cache efficiency.
  • Enhances send_otlp_span.cjs to include observability data in OTLP spans for better monitoring.
  • Created sentry-otel-dashboard-spec.md to define the Sentry OTEL dashboard and alert model for gh-aw runtime telemetry.

@mnkiefer mnkiefer marked this pull request as ready for review May 7, 2026 16:12
Copilot AI review requested due to automatic review settings May 7, 2026 16:12
@mnkiefer mnkiefer changed the title feat: Implement runtime observability metrics and dashboard specs feat: Implement runtime observability metrics & dashboard specs May 7, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions github-actions Bot mentioned this pull request May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a runtime observability baseline for gh-aw by collecting runtime metrics from local run artifacts and attaching them to OTLP conclusion spans, alongside a Sentry dashboard/alert spec to query those attributes.

Changes:

  • Added runtime_observability.cjs to derive posture/runtime status, token/cost metrics, cache efficiency, blocked request counts, and summary markdown.
  • Enriched send_otlp_span.cjs conclusion spans with new gh-aw.observability.* and gh-aw.optimization.* attributes (with test updates).
  • Added scratchpad/sentry-otel-dashboard-spec.md and linked it from scratchpad/dev.md.
Show a summary per file
File Description
scratchpad/sentry-otel-dashboard-spec.md Defines dashboard panels, saved searches, and alert thresholds for querying conclusion-span telemetry in Sentry.
scratchpad/dev.md Adds the new dashboard spec to the Related Documentation index and changelog.
actions/setup/js/send_otlp_span.cjs Emits new runtime observability + optimization attributes on job conclusion spans.
actions/setup/js/send_otlp_span.test.cjs Extends tests to validate the newly emitted runtime observability/optimization attributes.
actions/setup/js/runtime_observability.cjs New shared collector for runtime metrics and step-summary markdown generation.
actions/setup/js/generate_observability_summary.cjs Refactors summary generation to use the new shared runtime observability collector.
actions/setup/js/generate_observability_summary.test.cjs Updates fixtures/assertions for the expanded observability summary output.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

actions/setup/js/runtime_observability.cjs:33

  • countBlockedRequests() JSON.parses every non-empty line in the JSONL files. For large gateway logs this is unnecessarily expensive; follow the pattern used in gateway_difc_filtered.cjs (skip lines that don’t contain "DIFC_FILTERED" before parsing) or reuse that parser to avoid parsing unrelated REQUEST/RESPONSE entries.
      const lines = fs.readFileSync(path, "utf8").split("\n");
      for (const raw of lines) {
        const line = raw.trim();
        if (!line) continue;
        try {
          const entry = JSON.parse(line);
          if (entry && entry.type === "DIFC_FILTERED") {
            total += 1;
          }
  • Files reviewed: 7/7 changed files
  • Comments generated: 4

Comment on lines +23 to +41
for (const path of GATEWAY_EVENT_PATHS) {
try {
const lines = fs.readFileSync(path, "utf8").split("\n");
for (const raw of lines) {
const line = raw.trim();
if (!line) continue;
try {
const entry = JSON.parse(line);
if (entry && entry.type === "DIFC_FILTERED") {
total += 1;
}
} catch {
// Skip malformed lines.
}
}
} catch {
// Missing gateway logs are normal for many runs.
}
}
createdItemTypes: uniqueCreatedItemTypes(items),
outputErrorCount: errors.length,
warningCount,
turnCount: typeof runtimeMetrics.turns === "number" ? runtimeMetrics.turns : 0,
attributes.push(buildAttr("gh-aw.observability.created_item_count", runtimeObservability.raw.createdItemCount));
attributes.push(buildAttr("gh-aw.observability.output_error_count", runtimeObservability.raw.outputErrorCount));
attributes.push(buildAttr("gh-aw.observability.warning_count", runtimeObservability.raw.warningCount));
attributes.push(buildAttr("gh-aw.observability.turn_count", runtimeObservability.raw.turnCount));
Comment thread scratchpad/dev.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🧪 Test Quality Sentinel Report

Test Quality Score: 80/100

Excellent test quality

Metric Value
New/modified tests analyzed 2 test scenarios (no new test functions; both files expanded existing tests)
✅ Design tests (behavioral contracts) 2 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 2 (100%)
Duplicate test clusters 0
Test inflation detected ⚠️ Yes — generate_observability_summary.test.cjs (+10 lines) vs generate_observability_summary.cjs (+3 lines) = 3.3:1 ratio
🚨 Coding-guideline violations None

⚠️ Note: No new test functions were added — both test files expanded assertions within existing it(...) blocks. The 2 modified test scenarios were used as the analysis scope.


Test Classification Details

Test Scenario File Classification Issues Detected
generate_observability_summary main scenario actions/setup/js/generate_observability_summary.test.cjs ✅ Design None
sendJobConclusionSpan with observability data actions/setup/js/send_otlp_span.test.cjs ✅ Design None

Analysis

generate_observability_summary.test.cjs

7 new expect(summary).toContain(...) assertions were added to the existing integration-style scenario. They verify that when agent_usage.json and agent-stdio.log are present, the generated markdown summary includes the correct computed values: runtime status, token counts, estimated cost, turn count, cache efficiency, runtime risk score, and optimization score. This is a solid behavioral contract test — it checks observable output from real data, including an error state (errors: ["validation failed"]).

send_otlp_span.test.cjs

16 new expect(attrs["gh-aw.*"]) assertions were added to verify that the new observability attributes are correctly mapped into OTLP span attributes. This includes schema version, posture, runtime status, blocked requests, output errors, all token metrics, cache efficiency, estimated cost, action minutes, intensity level, runtime risk score, and optimization score. The mock setup was also extended to include agent_usage.json and gateway.jsonl data — testing realistic data flow through the attribute-mapping logic.


⚠️ Observations (Non-Blocking)

Test Inflation: generate_observability_summary.test.cjs

The test file added 10 lines vs. only 3 lines added to generate_observability_summary.cjs. This gives a 3.3:1 ratio (threshold: 2:1). However, context matters here: the production file had 117 deletions (a large refactor), and the new test lines are meaningful assertions, not padding. The inflation flag is technically triggered but reflects normal test expansion during a refactor.

Missing Test Coverage: runtime_observability.cjs

A new file actions/setup/js/runtime_observability.cjs was added with 335 lines and no corresponding test file (runtime_observability.test.cjs). This module computes the core observability metrics (risk scores, optimization scores, token metrics). While it is exercised indirectly through the generate_observability_summary and send_otlp_span tests, direct unit tests would give higher confidence in edge cases (e.g., missing/null fields, boundary score values, zero-token scenarios).


Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 0 tests
  • 🟨 JavaScript (*.test.cjs): 2 test scenarios (vitest)

Verdict

Check passed. 0% of new/modified tests are implementation tests (threshold: 30%). Both test scenarios verify observable behavioral contracts with comprehensive assertions including error states. The main recommendation is to add a unit test file for the new runtime_observability.cjs module.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §25507834502

🧪 Test quality analysis by Test Quality Sentinel · ● 7.2M ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 80/100. Test quality is excellent — 0% of new/modified tests are implementation tests (threshold: 30%). Both test scenarios verify behavioral contracts with comprehensive assertions. Non-blocking recommendation: consider adding a runtime_observability.test.cjs unit test file for the new 335-line runtime_observability.cjs module.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /tdd and /zoom-out — this is a new-feature PR that also refactors observability data collection into a dedicated module, making both skills highly relevant.

Key Themes

Test coverage gaps (/tdd)

  • runtime_observability.cjs is a 335-line module with meaningful branching logic (risk scoring, optimization scoring, token intensity tiers, insights generation) but has no dedicated test file. The existing integration test in generate_observability_summary.test.cjs exercises only one happy path using effective_tokens: 500, leaving the inputTokens + outputTokens fallback, all score threshold branches, and readAgentRuntimeMetrics edge cases completely untested.
  • readJSONIfExists silently collapses both missing-file and malformed-JSON into null, removing the observability that the original existsSync-first approach provided.

Architecture clarity (/zoom-out)

  • The effectiveTokens three-level fallback is compressed onto one long ternary chain — extracting it to resolveEffectiveTokens() would make both the logic and tests much clearer.
  • tokenIntensity is emitted to OTLP spans but absent from the step-summary markdown — the two surfaces diverge silently.
  • countBlockedRequests() re-reads gateway JSONL files inside collectRuntimeObservabilityData even when called from sendJobConclusionSpan which has already resolved other data sources. Adding a blockedRequests option would complete the data-injection pattern already established.

Positive Highlights

  • Clean module extraction: pulling collectObservabilityData + buildObservabilitySummary into runtime_observability.cjs with a well-defined { metadata, raw, derived, insights } return shape is a solid improvement over the monolithic function.
  • Dependency injection via options: allowing callers to inject pre-resolved data (awInfo, agentOutput, agentUsage, runtimeMetrics) makes collectRuntimeObservabilityData genuinely testable.
  • Backward-compat alias: collectObservabilityData: collectRuntimeObservabilityData in the exports preserves existing consumers without a breaking change.
  • OTLP enrichment: adding structured observability attributes to job-conclusion spans will unlock meaningful dashboards without any schema migration.

Verdict

Requesting changes primarily around the missing unit tests for the scoring/threshold logic. The risk score formula and optimization score are referenced in the Sentry dashboard spec — if those coefficients drift without test coverage, the alert thresholds will silently mis-fire.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 13.5M

} catch {
return null;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] readJSONIfExists swallows both ENOENT (file not found) and SyntaxError (malformed JSON) identically — callers can't distinguish between a missing file and a corrupt one. The original code used fs.existsSync first to separate those two cases. Consider rethrowing unexpected errors or at minimum logging them so silent data corruption is surfaced:

function readJSONIfExists(path) {
  try {
    return JSON.parse(fs.readFileSync(path, 'utf8'));
  } catch (err) {
    if (err.code !== 'ENOENT') {
      // Corrupt file — surface it rather than silently returning null
      console.error(`[runtime_observability] failed to parse ${path}: ${err.message}`);
    }
    return null;
  }
}

There are also no tests for the malformed-JSON path.

score += 15;
} else if (raw.totalTokens >= 100000) {
score += 10;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] computeRuntimeRiskScore has multiple weighted threshold branches (errors ×25 capped at 50, warnings ×5 capped at 20, blocked ×0.5, turns ≥20/≥10) but there are zero unit tests for these combinations. A dedicated runtime_observability.test.cjs should cover at least: one error (score=25), two errors (score=50, capped), one warning + one blocked request, turn-count thresholds, and the 100-cap. Without these, a coefficient change silently breaks all downstream alerting thresholds.

Example test structure:

it('caps error contribution at 50', () => {
  const score = computeRuntimeRiskScore({ outputErrorCount: 4, warningCount: 0, blockedRequests: 0, turnCount: 0 });
  expect(score).toBe(50); // 4*25=100 but capped at 50
});

estimatedCostUsd: typeof runtimeMetrics.estimatedCostUsd === "number" ? runtimeMetrics.estimatedCostUsd : undefined,
actionMinutes: typeof options.durationMs === "number" ? options.durationMs / 60000 : undefined,
totalTokens,
inputTokens,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/zoom-out] The effectiveTokens resolution chain (options → agentUsage → env var) is written as a single deeply-nested ternary that's very hard to read and reason about. This is load-bearing logic — it directly affects cost and token metrics shown in dashboards. Consider extracting it to a named function:

function resolveEffectiveTokens(options, agentUsage) {
  if (typeof options.effectiveTokens === 'number') return options.effectiveTokens;
  if (typeof agentUsage.effective_tokens === 'number' && agentUsage.effective_tokens > 0) return agentUsage.effective_tokens;
  const fromEnv = parseInt(process.env.GH_AW_EFFECTIVE_TOKENS ?? '', 10);
  return Number.isFinite(fromEnv) && fromEnv > 0 ? fromEnv : undefined;
}

A dedicated test for each fallback path would also prevent regressions when the env-var fallback is added/removed.

severity: derived.runtimeRiskScore >= 50 ? "high" : "medium",
title: "Runtime risk detected",
summary: `The run scored ${derived.runtimeRiskScore} on the runtime risk scale.`,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] raw.turnCount >= 12 has no typeof guard here, unlike computeRuntimeRiskScore which uses typeof raw.turnCount === 'number'. Since turnCount is initialized to 0 in the raw object, this is safe in practice — but the inconsistency is confusing and will mislead future readers. Align this with the guarded form used elsewhere, or add a comment explaining why the guard is unnecessary.

}

lines.push(`- **posture**: ${data.derived.posture}`);
lines.push(`- **runtime status**: ${data.derived.runtimeStatus}`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/zoom-out] tokenIntensity (one of the four derived fields) is emitted into OTLP spans via send_otlp_span.cjs but is absent from buildObservabilitySummary's markdown output. Every other derived field (posture, runtimeStatus, runtimeRiskScore, optimizationScore) appears in both surfaces. This asymmetry means the step-summary and the Sentry dashboard will show different signals — operators reading the step summary won't see token intensity without opening the OTLP trace. Consider adding - **token intensity**: ${data.derived.tokenIntensity} to the markdown block, or document the intentional omission.

@@ -46,6 +46,8 @@ describe("generate_observability_summary.cjs", () => {
errors: ["validation failed"],
})
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The happy-path integration test now covers the new fields, which is great. However there's no dedicated runtime_observability.test.cjs for the new module's pure functions (computeRuntimeRiskScore, computeOptimizationScore, deriveRuntimeStatus, deriveTokenIntensity, buildRuntimeObservabilityInsights, readAgentRuntimeMetrics). Integration tests through generate_observability_summary only exercise one code path — the test data here uses effective_tokens: 500 which bypasses the inputTokens + outputTokens fallback path entirely. A runtime_observability.test.cjs with unit tests would let you exercise each branch independently without standing up the full summary pipeline.

}

const runtimeObservability = collectRuntimeObservabilityData({
awInfo,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/zoom-out] collectRuntimeObservabilityData is called with already-resolved data (awInfo, agentOutput, agentUsage, runtimeMetrics) passed in via options, which is good. However countBlockedRequests() is called inside collectRuntimeObservabilityData unconditionally — it reads the gateway JSONL files from disk again, even though sendJobConclusionSpan may have already processed those files for other purposes. Consider passing blockedRequests as an option alongside the other pre-resolved fields to avoid the redundant I/O.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

✅ smoke-ci: safeoutputs CLI comment + comment-memory run (25508172125)

Generated by Smoke CI for issue #30861 ·

Copilot AI requested a review from pelikhan May 7, 2026 18:21
} else {
core.info("Automatic guard policy determination complete for public repository");
core.info("GitHub MCP guard policy automatically applied for public repository. " + "min-integrity='approved' and repos='all' ensure only approved-integrity content is accessible.");
core.warning("GitHub MCP guard policy automatically applied for public repository. " + "min-integrity='approved' and repos='all' ensure only approved-integrity content is accessible.");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overriding a change in main. @mnkiefer maybe this PR needs to be rebuilt

@mnkiefer mnkiefer closed this May 8, 2026
@github-actions github-actions Bot added the closed:ci-failure PR was closed without merging: ci-failure label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closed:ci-failure PR was closed without merging: ci-failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants