feat: Implement runtime observability metrics & dashboard specs#30861
feat: Implement runtime observability metrics & dashboard specs#30861
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Implements a runtime observability baseline for gh-aw by collecting runtime metrics from local run artifacts and attaching them to OTLP conclusion spans, alongside a Sentry dashboard/alert spec to query those attributes.
Changes:
- Added
runtime_observability.cjsto derive posture/runtime status, token/cost metrics, cache efficiency, blocked request counts, and summary markdown. - Enriched
send_otlp_span.cjsconclusion spans with newgh-aw.observability.*andgh-aw.optimization.*attributes (with test updates). - Added
scratchpad/sentry-otel-dashboard-spec.mdand linked it fromscratchpad/dev.md.
Show a summary per file
| File | Description |
|---|---|
| scratchpad/sentry-otel-dashboard-spec.md | Defines dashboard panels, saved searches, and alert thresholds for querying conclusion-span telemetry in Sentry. |
| scratchpad/dev.md | Adds the new dashboard spec to the Related Documentation index and changelog. |
| actions/setup/js/send_otlp_span.cjs | Emits new runtime observability + optimization attributes on job conclusion spans. |
| actions/setup/js/send_otlp_span.test.cjs | Extends tests to validate the newly emitted runtime observability/optimization attributes. |
| actions/setup/js/runtime_observability.cjs | New shared collector for runtime metrics and step-summary markdown generation. |
| actions/setup/js/generate_observability_summary.cjs | Refactors summary generation to use the new shared runtime observability collector. |
| actions/setup/js/generate_observability_summary.test.cjs | Updates fixtures/assertions for the expanded observability summary output. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
actions/setup/js/runtime_observability.cjs:33
countBlockedRequests()JSON.parses every non-empty line in the JSONL files. For large gateway logs this is unnecessarily expensive; follow the pattern used ingateway_difc_filtered.cjs(skip lines that don’t contain "DIFC_FILTERED" before parsing) or reuse that parser to avoid parsing unrelated REQUEST/RESPONSE entries.
const lines = fs.readFileSync(path, "utf8").split("\n");
for (const raw of lines) {
const line = raw.trim();
if (!line) continue;
try {
const entry = JSON.parse(line);
if (entry && entry.type === "DIFC_FILTERED") {
total += 1;
}
- Files reviewed: 7/7 changed files
- Comments generated: 4
| for (const path of GATEWAY_EVENT_PATHS) { | ||
| try { | ||
| const lines = fs.readFileSync(path, "utf8").split("\n"); | ||
| for (const raw of lines) { | ||
| const line = raw.trim(); | ||
| if (!line) continue; | ||
| try { | ||
| const entry = JSON.parse(line); | ||
| if (entry && entry.type === "DIFC_FILTERED") { | ||
| total += 1; | ||
| } | ||
| } catch { | ||
| // Skip malformed lines. | ||
| } | ||
| } | ||
| } catch { | ||
| // Missing gateway logs are normal for many runs. | ||
| } | ||
| } |
| createdItemTypes: uniqueCreatedItemTypes(items), | ||
| outputErrorCount: errors.length, | ||
| warningCount, | ||
| turnCount: typeof runtimeMetrics.turns === "number" ? runtimeMetrics.turns : 0, |
| attributes.push(buildAttr("gh-aw.observability.created_item_count", runtimeObservability.raw.createdItemCount)); | ||
| attributes.push(buildAttr("gh-aw.observability.output_error_count", runtimeObservability.raw.outputErrorCount)); | ||
| attributes.push(buildAttr("gh-aw.observability.warning_count", runtimeObservability.raw.warningCount)); | ||
| attributes.push(buildAttr("gh-aw.observability.turn_count", runtimeObservability.raw.turnCount)); |
🧪 Test Quality Sentinel ReportTest Quality Score: 80/100✅ Excellent test quality
Test Classification Details
Analysis
|
There was a problem hiding this comment.
✅ Test Quality Sentinel: 80/100. Test quality is excellent — 0% of new/modified tests are implementation tests (threshold: 30%). Both test scenarios verify behavioral contracts with comprehensive assertions. Non-blocking recommendation: consider adding a runtime_observability.test.cjs unit test file for the new 335-line runtime_observability.cjs module.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /tdd and /zoom-out — this is a new-feature PR that also refactors observability data collection into a dedicated module, making both skills highly relevant.
Key Themes
Test coverage gaps (/tdd)
runtime_observability.cjsis a 335-line module with meaningful branching logic (risk scoring, optimization scoring, token intensity tiers, insights generation) but has no dedicated test file. The existing integration test ingenerate_observability_summary.test.cjsexercises only one happy path usingeffective_tokens: 500, leaving theinputTokens + outputTokensfallback, all score threshold branches, andreadAgentRuntimeMetricsedge cases completely untested.readJSONIfExistssilently collapses both missing-file and malformed-JSON intonull, removing the observability that the originalexistsSync-first approach provided.
Architecture clarity (/zoom-out)
- The
effectiveTokensthree-level fallback is compressed onto one long ternary chain — extracting it toresolveEffectiveTokens()would make both the logic and tests much clearer. tokenIntensityis emitted to OTLP spans but absent from the step-summary markdown — the two surfaces diverge silently.countBlockedRequests()re-reads gateway JSONL files insidecollectRuntimeObservabilityDataeven when called fromsendJobConclusionSpanwhich has already resolved other data sources. Adding ablockedRequestsoption would complete the data-injection pattern already established.
Positive Highlights
- ✅ Clean module extraction: pulling
collectObservabilityData+buildObservabilitySummaryintoruntime_observability.cjswith a well-defined{ metadata, raw, derived, insights }return shape is a solid improvement over the monolithic function. - ✅ Dependency injection via
options: allowing callers to inject pre-resolved data (awInfo, agentOutput, agentUsage, runtimeMetrics) makescollectRuntimeObservabilityDatagenuinely testable. - ✅ Backward-compat alias:
collectObservabilityData: collectRuntimeObservabilityDatain the exports preserves existing consumers without a breaking change. - ✅ OTLP enrichment: adding structured observability attributes to job-conclusion spans will unlock meaningful dashboards without any schema migration.
Verdict
Requesting changes primarily around the missing unit tests for the scoring/threshold logic. The risk score formula and optimization score are referenced in the Sentry dashboard spec — if those coefficients drift without test coverage, the alert thresholds will silently mis-fire.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 13.5M
| } catch { | ||
| return null; | ||
| } | ||
| } |
There was a problem hiding this comment.
[/tdd] readJSONIfExists swallows both ENOENT (file not found) and SyntaxError (malformed JSON) identically — callers can't distinguish between a missing file and a corrupt one. The original code used fs.existsSync first to separate those two cases. Consider rethrowing unexpected errors or at minimum logging them so silent data corruption is surfaced:
function readJSONIfExists(path) {
try {
return JSON.parse(fs.readFileSync(path, 'utf8'));
} catch (err) {
if (err.code !== 'ENOENT') {
// Corrupt file — surface it rather than silently returning null
console.error(`[runtime_observability] failed to parse ${path}: ${err.message}`);
}
return null;
}
}There are also no tests for the malformed-JSON path.
| score += 15; | ||
| } else if (raw.totalTokens >= 100000) { | ||
| score += 10; | ||
| } |
There was a problem hiding this comment.
[/tdd] computeRuntimeRiskScore has multiple weighted threshold branches (errors ×25 capped at 50, warnings ×5 capped at 20, blocked ×0.5, turns ≥20/≥10) but there are zero unit tests for these combinations. A dedicated runtime_observability.test.cjs should cover at least: one error (score=25), two errors (score=50, capped), one warning + one blocked request, turn-count thresholds, and the 100-cap. Without these, a coefficient change silently breaks all downstream alerting thresholds.
Example test structure:
it('caps error contribution at 50', () => {
const score = computeRuntimeRiskScore({ outputErrorCount: 4, warningCount: 0, blockedRequests: 0, turnCount: 0 });
expect(score).toBe(50); // 4*25=100 but capped at 50
});| estimatedCostUsd: typeof runtimeMetrics.estimatedCostUsd === "number" ? runtimeMetrics.estimatedCostUsd : undefined, | ||
| actionMinutes: typeof options.durationMs === "number" ? options.durationMs / 60000 : undefined, | ||
| totalTokens, | ||
| inputTokens, |
There was a problem hiding this comment.
[/zoom-out] The effectiveTokens resolution chain (options → agentUsage → env var) is written as a single deeply-nested ternary that's very hard to read and reason about. This is load-bearing logic — it directly affects cost and token metrics shown in dashboards. Consider extracting it to a named function:
function resolveEffectiveTokens(options, agentUsage) {
if (typeof options.effectiveTokens === 'number') return options.effectiveTokens;
if (typeof agentUsage.effective_tokens === 'number' && agentUsage.effective_tokens > 0) return agentUsage.effective_tokens;
const fromEnv = parseInt(process.env.GH_AW_EFFECTIVE_TOKENS ?? '', 10);
return Number.isFinite(fromEnv) && fromEnv > 0 ? fromEnv : undefined;
}A dedicated test for each fallback path would also prevent regressions when the env-var fallback is added/removed.
| severity: derived.runtimeRiskScore >= 50 ? "high" : "medium", | ||
| title: "Runtime risk detected", | ||
| summary: `The run scored ${derived.runtimeRiskScore} on the runtime risk scale.`, | ||
| }); |
There was a problem hiding this comment.
[/tdd] raw.turnCount >= 12 has no typeof guard here, unlike computeRuntimeRiskScore which uses typeof raw.turnCount === 'number'. Since turnCount is initialized to 0 in the raw object, this is safe in practice — but the inconsistency is confusing and will mislead future readers. Align this with the guarded form used elsewhere, or add a comment explaining why the guard is unnecessary.
| } | ||
|
|
||
| lines.push(`- **posture**: ${data.derived.posture}`); | ||
| lines.push(`- **runtime status**: ${data.derived.runtimeStatus}`); |
There was a problem hiding this comment.
[/zoom-out] tokenIntensity (one of the four derived fields) is emitted into OTLP spans via send_otlp_span.cjs but is absent from buildObservabilitySummary's markdown output. Every other derived field (posture, runtimeStatus, runtimeRiskScore, optimizationScore) appears in both surfaces. This asymmetry means the step-summary and the Sentry dashboard will show different signals — operators reading the step summary won't see token intensity without opening the OTLP trace. Consider adding - **token intensity**: ${data.derived.tokenIntensity} to the markdown block, or document the intentional omission.
| @@ -46,6 +46,8 @@ describe("generate_observability_summary.cjs", () => { | |||
| errors: ["validation failed"], | |||
| }) | |||
| ); | |||
There was a problem hiding this comment.
[/tdd] The happy-path integration test now covers the new fields, which is great. However there's no dedicated runtime_observability.test.cjs for the new module's pure functions (computeRuntimeRiskScore, computeOptimizationScore, deriveRuntimeStatus, deriveTokenIntensity, buildRuntimeObservabilityInsights, readAgentRuntimeMetrics). Integration tests through generate_observability_summary only exercise one code path — the test data here uses effective_tokens: 500 which bypasses the inputTokens + outputTokens fallback path entirely. A runtime_observability.test.cjs with unit tests would let you exercise each branch independently without standing up the full summary pipeline.
| } | ||
|
|
||
| const runtimeObservability = collectRuntimeObservabilityData({ | ||
| awInfo, |
There was a problem hiding this comment.
[/zoom-out] collectRuntimeObservabilityData is called with already-resolved data (awInfo, agentOutput, agentUsage, runtimeMetrics) passed in via options, which is good. However countBlockedRequests() is called inside collectRuntimeObservabilityData unconditionally — it reads the gateway JSONL files from disk again, even though sendJobConclusionSpan may have already processed those files for other purposes. Consider passing blockedRequests as an option alongside the other pre-resolved fields to avoid the redundant I/O.
…ic repos Agent-Logs-Url: https://github.com/github/gh-aw/sessions/4d033fe9-48f7-4b0d-aed2-ff30e096bb7a Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
| } else { | ||
| core.info("Automatic guard policy determination complete for public repository"); | ||
| core.info("GitHub MCP guard policy automatically applied for public repository. " + "min-integrity='approved' and repos='all' ensure only approved-integrity content is accessible."); | ||
| core.warning("GitHub MCP guard policy automatically applied for public repository. " + "min-integrity='approved' and repos='all' ensure only approved-integrity content is accessible."); |
There was a problem hiding this comment.
This is overriding a change in main. @mnkiefer maybe this PR needs to be rebuilt
runtime_observability.cjsto collect and compute runtime metrics including token usage, error counts, and cache efficiency.send_otlp_span.cjsto include observability data in OTLP spans for better monitoring.sentry-otel-dashboard-spec.mdto define the Sentry OTEL dashboard and alert model forgh-awruntime telemetry.