feat: Implement runtime observability metrics & dashboard specs by mnkiefer · Pull Request #30861 · github/gh-aw

mnkiefer · 2026-05-07T16:11:26Z

Adds runtime_observability.cjs to collect and compute runtime metrics including token usage, error counts, and cache efficiency.
Enhances send_otlp_span.cjs to include observability data in OTLP spans for better monitoring.
Created sentry-otel-dashboard-spec.md to define the Sentry OTEL dashboard and alert model for gh-aw runtime telemetry.

…tions

Copilot

Pull request overview

Implements a runtime observability baseline for gh-aw by collecting runtime metrics from local run artifacts and attaching them to OTLP conclusion spans, alongside a Sentry dashboard/alert spec to query those attributes.

Changes:

Added runtime_observability.cjs to derive posture/runtime status, token/cost metrics, cache efficiency, blocked request counts, and summary markdown.
Enriched send_otlp_span.cjs conclusion spans with new gh-aw.observability.* and gh-aw.optimization.* attributes (with test updates).
Added scratchpad/sentry-otel-dashboard-spec.md and linked it from scratchpad/dev.md.

Show a summary per file

File	Description
scratchpad/sentry-otel-dashboard-spec.md	Defines dashboard panels, saved searches, and alert thresholds for querying conclusion-span telemetry in Sentry.
scratchpad/dev.md	Adds the new dashboard spec to the Related Documentation index and changelog.
actions/setup/js/send_otlp_span.cjs	Emits new runtime observability + optimization attributes on job conclusion spans.
actions/setup/js/send_otlp_span.test.cjs	Extends tests to validate the newly emitted runtime observability/optimization attributes.
actions/setup/js/runtime_observability.cjs	New shared collector for runtime metrics and step-summary markdown generation.
actions/setup/js/generate_observability_summary.cjs	Refactors summary generation to use the new shared runtime observability collector.
actions/setup/js/generate_observability_summary.test.cjs	Updates fixtures/assertions for the expanded observability summary output.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

actions/setup/js/runtime_observability.cjs:33

countBlockedRequests() JSON.parses every non-empty line in the JSONL files. For large gateway logs this is unnecessarily expensive; follow the pattern used in gateway_difc_filtered.cjs (skip lines that don’t contain "DIFC_FILTERED" before parsing) or reuse that parser to avoid parsing unrelated REQUEST/RESPONSE entries.

      const lines = fs.readFileSync(path, "utf8").split("\n");
      for (const raw of lines) {
        const line = raw.trim();
        if (!line) continue;
        try {
          const entry = JSON.parse(line);
          if (entry && entry.type === "DIFC_FILTERED") {
            total += 1;
          }

Files reviewed: 7/7 changed files
Comments generated: 4

+  for (const path of GATEWAY_EVENT_PATHS) {
+    try {
+      const lines = fs.readFileSync(path, "utf8").split("\n");
+      for (const raw of lines) {
+        const line = raw.trim();
+        if (!line) continue;
+        try {
+          const entry = JSON.parse(line);
+          if (entry && entry.type === "DIFC_FILTERED") {
+            total += 1;
+          }
+        } catch {
+          // Skip malformed lines.
+        }
+      }
+    } catch {
+      // Missing gateway logs are normal for many runs.
+    }
+  }


+    createdItemTypes: uniqueCreatedItemTypes(items),
+    outputErrorCount: errors.length,
+    warningCount,
+    turnCount: typeof runtimeMetrics.turns === "number" ? runtimeMetrics.turns : 0,


+  attributes.push(buildAttr("gh-aw.observability.created_item_count", runtimeObservability.raw.createdItemCount));
+  attributes.push(buildAttr("gh-aw.observability.output_error_count", runtimeObservability.raw.outputErrorCount));
+  attributes.push(buildAttr("gh-aw.observability.warning_count", runtimeObservability.raw.warningCount));
+  attributes.push(buildAttr("gh-aw.observability.turn_count", runtimeObservability.raw.turnCount));


github-actions · 2026-05-07T16:18:45Z

🧪 Test Quality Sentinel Report

Test Quality Score: 80/100

✅ Excellent test quality

Metric	Value
New/modified tests analyzed	2 test scenarios (no new test functions; both files expanded existing tests)
✅ Design tests (behavioral contracts)	2 (100%)
⚠️ Implementation tests (low value)	0 (0%)
Tests with error/edge cases	2 (100%)
Duplicate test clusters	0
Test inflation detected	⚠️ Yes — `generate_observability_summary.test.cjs` (+10 lines) vs `generate_observability_summary.cjs` (+3 lines) = 3.3:1 ratio
🚨 Coding-guideline violations	None

⚠️ Note: No new test functions were added — both test files expanded assertions within existing it(...) blocks. The 2 modified test scenarios were used as the analysis scope.

Test Classification Details

Test Scenario	File	Classification	Issues Detected
`generate_observability_summary` main scenario	`actions/setup/js/generate_observability_summary.test.cjs`	✅ Design	None
`sendJobConclusionSpan` with observability data	`actions/setup/js/send_otlp_span.test.cjs`	✅ Design	None

Analysis

`generate_observability_summary.test.cjs`

7 new expect(summary).toContain(...) assertions were added to the existing integration-style scenario. They verify that when agent_usage.json and agent-stdio.log are present, the generated markdown summary includes the correct computed values: runtime status, token counts, estimated cost, turn count, cache efficiency, runtime risk score, and optimization score. This is a solid behavioral contract test — it checks observable output from real data, including an error state (errors: ["validation failed"]).

`send_otlp_span.test.cjs`

16 new expect(attrs["gh-aw.*"]) assertions were added to verify that the new observability attributes are correctly mapped into OTLP span attributes. This includes schema version, posture, runtime status, blocked requests, output errors, all token metrics, cache efficiency, estimated cost, action minutes, intensity level, runtime risk score, and optimization score. The mock setup was also extended to include agent_usage.json and gateway.jsonl data — testing realistic data flow through the attribute-mapping logic.

⚠️ Observations (Non-Blocking)

Test Inflation: `generate_observability_summary.test.cjs`

The test file added 10 lines vs. only 3 lines added to generate_observability_summary.cjs. This gives a 3.3:1 ratio (threshold: 2:1). However, context matters here: the production file had 117 deletions (a large refactor), and the new test lines are meaningful assertions, not padding. The inflation flag is technically triggered but reflects normal test expansion during a refactor.

Missing Test Coverage: `runtime_observability.cjs`

A new file actions/setup/js/runtime_observability.cjs was added with 335 lines and no corresponding test file (runtime_observability.test.cjs). This module computes the core observability metrics (risk scores, optimization scores, token metrics). While it is exercised indirectly through the generate_observability_summary and send_otlp_span tests, direct unit tests would give higher confidence in edge cases (e.g., missing/null fields, boundary score values, zero-token scenarios).

Language Support

Tests analyzed:

🐹 Go (*_test.go): 0 tests
🟨 JavaScript (*.test.cjs): 2 test scenarios (vitest)

Verdict

✅ Check passed. 0% of new/modified tests are implementation tests (threshold: 30%). Both test scenarios verify observable behavioral contracts with comprehensive assertions including error states. The main recommendation is to add a unit test file for the new runtime_observability.cjs module.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

Assert on observable outputs, return values, or state changes
Cover error paths and boundary conditions
Would catch a behavioral regression if deleted
Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

Assert on internal function calls (mocking internals)
Only test the happy path with typical inputs
Break during legitimate refactoring even when behavior is correct
Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §25507834502

🧪 Test quality analysis by Test Quality Sentinel · ● 7.2M · ◷

github-actions

✅ Test Quality Sentinel: 80/100. Test quality is excellent — 0% of new/modified tests are implementation tests (threshold: 30%). Both test scenarios verify behavioral contracts with comprehensive assertions. Non-blocking recommendation: consider adding a runtime_observability.test.cjs unit test file for the new 335-line runtime_observability.cjs module.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

github-actions

Skills-Based Review 🧠

Applied /tdd and /zoom-out — this is a new-feature PR that also refactors observability data collection into a dedicated module, making both skills highly relevant.

Key Themes

Test coverage gaps (/tdd)

runtime_observability.cjs is a 335-line module with meaningful branching logic (risk scoring, optimization scoring, token intensity tiers, insights generation) but has no dedicated test file. The existing integration test in generate_observability_summary.test.cjs exercises only one happy path using effective_tokens: 500, leaving the inputTokens + outputTokens fallback, all score threshold branches, and readAgentRuntimeMetrics edge cases completely untested.
readJSONIfExists silently collapses both missing-file and malformed-JSON into null, removing the observability that the original existsSync-first approach provided.

Architecture clarity (/zoom-out)

The effectiveTokens three-level fallback is compressed onto one long ternary chain — extracting it to resolveEffectiveTokens() would make both the logic and tests much clearer.
tokenIntensity is emitted to OTLP spans but absent from the step-summary markdown — the two surfaces diverge silently.
countBlockedRequests() re-reads gateway JSONL files inside collectRuntimeObservabilityData even when called from sendJobConclusionSpan which has already resolved other data sources. Adding a blockedRequests option would complete the data-injection pattern already established.

Positive Highlights

✅ Clean module extraction: pulling collectObservabilityData + buildObservabilitySummary into runtime_observability.cjs with a well-defined { metadata, raw, derived, insights } return shape is a solid improvement over the monolithic function.
✅ Dependency injection via options: allowing callers to inject pre-resolved data (awInfo, agentOutput, agentUsage, runtimeMetrics) makes collectRuntimeObservabilityData genuinely testable.
✅ Backward-compat alias: collectObservabilityData: collectRuntimeObservabilityData in the exports preserves existing consumers without a breaking change.
✅ OTLP enrichment: adding structured observability attributes to job-conclusion spans will unlock meaningful dashboards without any schema migration.

Verdict

Requesting changes primarily around the missing unit tests for the scoring/threshold logic. The risk score formula and optimization score are referenced in the Sentry dashboard spec — if those coefficients drift without test coverage, the alert thresholds will silently mis-fire.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 13.5M

github-actions · 2026-05-07T16:20:18Z

+  } catch {
+    return null;
+  }
+}


[/tdd] readJSONIfExists swallows both ENOENT (file not found) and SyntaxError (malformed JSON) identically — callers can't distinguish between a missing file and a corrupt one. The original code used fs.existsSync first to separate those two cases. Consider rethrowing unexpected errors or at minimum logging them so silent data corruption is surfaced:

function readJSONIfExists(path) { try { return JSON.parse(fs.readFileSync(path, 'utf8')); } catch (err) { if (err.code !== 'ENOENT') { // Corrupt file — surface it rather than silently returning null console.error(`[runtime_observability] failed to parse ${path}: ${err.message}`); } return null; } }

There are also no tests for the malformed-JSON path.

github-actions · 2026-05-07T16:20:18Z

+      score += 15;
+    } else if (raw.totalTokens >= 100000) {
+      score += 10;
+    }


[/tdd] computeRuntimeRiskScore has multiple weighted threshold branches (errors ×25 capped at 50, warnings ×5 capped at 20, blocked ×0.5, turns ≥20/≥10) but there are zero unit tests for these combinations. A dedicated runtime_observability.test.cjs should cover at least: one error (score=25), two errors (score=50, capped), one warning + one blocked request, turn-count thresholds, and the 100-cap. Without these, a coefficient change silently breaks all downstream alerting thresholds.

Example test structure:

it('caps error contribution at 50', () => { const score = computeRuntimeRiskScore({ outputErrorCount: 4, warningCount: 0, blockedRequests: 0, turnCount: 0 }); expect(score).toBe(50); // 4*25=100 but capped at 50 });

github-actions · 2026-05-07T16:20:18Z

+    estimatedCostUsd: typeof runtimeMetrics.estimatedCostUsd === "number" ? runtimeMetrics.estimatedCostUsd : undefined,
+    actionMinutes: typeof options.durationMs === "number" ? options.durationMs / 60000 : undefined,
+    totalTokens,
+    inputTokens,


[/zoom-out] The effectiveTokens resolution chain (options → agentUsage → env var) is written as a single deeply-nested ternary that's very hard to read and reason about. This is load-bearing logic — it directly affects cost and token metrics shown in dashboards. Consider extracting it to a named function:

function resolveEffectiveTokens(options, agentUsage) { if (typeof options.effectiveTokens === 'number') return options.effectiveTokens; if (typeof agentUsage.effective_tokens === 'number' && agentUsage.effective_tokens > 0) return agentUsage.effective_tokens; const fromEnv = parseInt(process.env.GH_AW_EFFECTIVE_TOKENS ?? '', 10); return Number.isFinite(fromEnv) && fromEnv > 0 ? fromEnv : undefined; }

A dedicated test for each fallback path would also prevent regressions when the env-var fallback is added/removed.

github-actions · 2026-05-07T16:20:18Z

+      severity: derived.runtimeRiskScore >= 50 ? "high" : "medium",
+      title: "Runtime risk detected",
+      summary: `The run scored ${derived.runtimeRiskScore} on the runtime risk scale.`,
+    });


[/tdd] raw.turnCount >= 12 has no typeof guard here, unlike computeRuntimeRiskScore which uses typeof raw.turnCount === 'number'. Since turnCount is initialized to 0 in the raw object, this is safe in practice — but the inconsistency is confusing and will mislead future readers. Align this with the guarded form used elsewhere, or add a comment explaining why the guard is unnecessary.

github-actions · 2026-05-07T16:20:18Z

+  }
+
+  lines.push(`- **posture**: ${data.derived.posture}`);
+  lines.push(`- **runtime status**: ${data.derived.runtimeStatus}`);


[/zoom-out] tokenIntensity (one of the four derived fields) is emitted into OTLP spans via send_otlp_span.cjs but is absent from buildObservabilitySummary's markdown output. Every other derived field (posture, runtimeStatus, runtimeRiskScore, optimizationScore) appears in both surfaces. This asymmetry means the step-summary and the Sentry dashboard will show different signals — operators reading the step summary won't see token intensity without opening the OTLP trace. Consider adding - **token intensity**: ${data.derived.tokenIntensity} to the markdown block, or document the intentional omission.

github-actions · 2026-05-07T16:20:18Z

@@ -46,6 +46,8 @@ describe("generate_observability_summary.cjs", () => {
        errors: ["validation failed"],
      })
    );


[/tdd] The happy-path integration test now covers the new fields, which is great. However there's no dedicated runtime_observability.test.cjs for the new module's pure functions (computeRuntimeRiskScore, computeOptimizationScore, deriveRuntimeStatus, deriveTokenIntensity, buildRuntimeObservabilityInsights, readAgentRuntimeMetrics). Integration tests through generate_observability_summary only exercise one code path — the test data here uses effective_tokens: 500 which bypasses the inputTokens + outputTokens fallback path entirely. A runtime_observability.test.cjs with unit tests would let you exercise each branch independently without standing up the full summary pipeline.

github-actions · 2026-05-07T16:20:18Z

  }

+  const runtimeObservability = collectRuntimeObservabilityData({
+    awInfo,


[/zoom-out] collectRuntimeObservabilityData is called with already-resolved data (awInfo, agentOutput, agentUsage, runtimeMetrics) passed in via options, which is good. However countBlockedRequests() is called inside collectRuntimeObservabilityData unconditionally — it reads the gateway JSONL files from disk again, even though sendJobConclusionSpan may have already processed those files for other purposes. Consider passing blockedRequests as an option alongside the other pre-resolved fields to avoid the redundant I/O.

github-actions · 2026-05-07T16:21:16Z

✅ smoke-ci: safeoutputs CLI comment + comment-memory run (25508172125)

Generated by Smoke CI for issue #30861 · ◷

…ic repos Agent-Logs-Url: https://github.com/github/gh-aw/sessions/4d033fe9-48f7-4b0d-aed2-ff30e096bb7a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

pelikhan · 2026-05-07T20:57:29Z

    } else {
      core.info("Automatic guard policy determination complete for public repository");
-      core.info("GitHub MCP guard policy automatically applied for public repository. " + "min-integrity='approved' and repos='all' ensure only approved-integrity content is accessible.");
+      core.warning("GitHub MCP guard policy automatically applied for public repository. " + "min-integrity='approved' and repos='all' ensure only approved-integrity content is accessible.");


This is overriding a change in main. @mnkiefer maybe this PR needs to be rebuilt

feat: Implement runtime observability metrics and dashboard specifica…

28f761c

…tions

mnkiefer marked this pull request as ready for review May 7, 2026 16:12

Copilot AI review requested due to automatic review settings May 7, 2026 16:12

mnkiefer changed the title ~~feat: Implement runtime observability metrics and dashboard specs~~ feat: Implement runtime observability metrics & dashboard specs May 7, 2026

Copilot started reviewing on behalf of mnkiefer May 7, 2026 16:13 View session

This comment has been minimized.

Sign in to view

github-actions Bot mentioned this pull request May 7, 2026

[aw] No-Op Runs #29134

Open

Copilot AI reviewed May 7, 2026

View reviewed changes

github-actions Bot approved these changes May 7, 2026

View reviewed changes

Potential fix for pull request finding

70297ad

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

github-actions Bot requested changes May 7, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 7, 2026

[Contribution Check Report] Contribution Check — 2026-05-07 #30717

Closed

Copilot started work on behalf of pelikhan May 7, 2026 18:10 View session

fix: use core.warning for automatic guard policy notification in publ…

b22dcdb

…ic repos Agent-Logs-Url: https://github.com/github/gh-aw/sessions/4d033fe9-48f7-4b0d-aed2-ff30e096bb7a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot finished work on behalf of pelikhan May 7, 2026 18:21

Copilot AI requested a review from pelikhan May 7, 2026 18:21

Merge branch 'main' into add-sashboard-specs

2ec2bac

pelikhan reviewed May 7, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 8, 2026

[Contribution Check Report] Contribution Check — 2026-05-08 #30915

Open

mnkiefer closed this May 8, 2026

github-actions Bot added the closed:ci-failure PR was closed without merging: ci-failure label May 8, 2026

Conversation

mnkiefer commented May 7, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

🧪 Test Quality Sentinel Report

Test Quality Score: 80/100

Test Classification Details

Analysis

generate_observability_summary.test.cjs

send_otlp_span.test.cjs

⚠️ Observations (Non-Blocking)

Test Inflation: generate_observability_summary.test.cjs

Missing Test Coverage: runtime_observability.cjs

Language Support

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Skills-Based Review 🧠

Key Themes

Positive Highlights

Verdict

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

pelikhan May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`generate_observability_summary.test.cjs`

`send_otlp_span.test.cjs`

Test Inflation: `generate_observability_summary.test.cjs`

Missing Test Coverage: `runtime_observability.cjs`