[observability] Agentic Observability Report — Mar 16–30, 2026 #23527

2026-03-30T08:54:45Z

github-actions[bot]
bot Mar 30, 2026

Executive Summary

The 14-day window (Mar 16–30, 2026) shows a healthy, active repository with 105 runs across 46 workflows. No episodes are marked escalation_eligible by the deterministic model, and no runs carry a risky classification label. The dominant concern is that two smoke-test workflows — Smoke Claude and Smoke Copilot — are consistently assessed as resource-heavy and poorly controlled across every run in the period, crossing the escalation thresholds for those categories. Seven runs across four workflows failed, most tied to known fragile workflows (Changeset Generator, Smoke Codex). One missing-tool event was recorded for Smoke Copilot (Serena MCP server). No MCP server failures were recorded.

Key Metrics

Metric	Value
Date range	Mar 16–30, 2026 (14 days)
Workflows analyzed	46
Runs analyzed	105
Episodes analyzed	104
High-confidence episodes	103 (99%)
Runs with `risky` classification	0
Medium/high severity assessments	9 workflows
Escalation-eligible episodes	0
Total token usage	9.4 M
Total estimated cost	$2.72
Total Actions minutes	223 min
Total errors	7
MCP failures	0
Missing-tool events	1

Highest Risk Episodes

No episodes carry escalation_eligible = true in the deterministic model. The two highest-concern standalone episodes are flagged by repeated threshold crossings at the workflow level:

Workflow	Runs	Assessment (max severity)	Classification
Smoke Claude	2	`resource_heavy_for_domain:high`, `poor_agentic_control:medium`	`changed` (turns_increase)
Smoke Copilot	2	`resource_heavy_for_domain:medium`, `poor_agentic_control:high`	`stable`

Both workflows appear in the triage domain. Their behavior fingerprints are write_heavy, heavy resource profile, and exploratory/directed style respectively. For smoke tests, these signals are structurally suspicious — smoke tests should typically be lean, directed, and read-capable.

Smoke Copilot additionally had a missing-tool event (Serena MCP server: activate_project, find_symbol) in run §23730326985. The tool was unavailable in the environment.

Episode Regressions

Issue Monster shows an increasing turn count across 3 runs (4 → 11 → 5 turns), with two runs classified as changed and a turns_increase reason code. The most recent run (§23735859039) also shows a posture_changed code alongside turns_increase, indicating the write posture shifted. Only one run carries poor_agentic_control:medium, so it does not yet cross the escalation threshold, but it is trending in the wrong direction.

One low-confidence workflow_run episode was detected for Dev Hawk (2 runs, confidence=low, all skipped/non-agentic). No agent activity; the episode reflects the workflow_run trigger chain between Dev and Dev Hawk, both of which skipped. No risk.

Recommended Actions

Smoke Claude & Smoke Copilot (escalated): Review why these smoke tests are classified in the triage domain with write_heavy / heavy resource profiles. Smoke tests should complete in lean, directed mode. Check whether the prompt or task scope is broader than intended. See linked escalation issue.
Issue Monster (watch): The posture_changed signal on the latest in-progress run is new. If the next run also shows posture_changed, reconsider the prompt or activation criteria before it crosses the escalation threshold.
Serena MCP / Smoke Copilot: The missing activate_project/find_symbol tools signal that the Serena MCP server is not always available in the Actions environment. Consider adding an availability guard or skipping the capability gracefully when the server is absent.
Failed workflow cleanup (informational): Changeset Generator and Smoke Codex each failed in 2 of 2 runs during the period — a 100% failure rate. Schema Feature Coverage Checker and Auto-Triage Issues each failed in 1 run. These may need maintenance attention independent of agentic control concerns.

Per-workflow detail: Smoke Claude

Both runs completed successfully in ~9 minutes with 28–33 turns and ~1.09 M tokens each.

Run	Duration	Turns	Tokens	Cost	Assessment
§23730204448	8.9 min	28	1,097,050	$0.90	`resource_heavy:high`, `poor_control:medium`, `partially_reducible:medium`
§23730326948	9.2 min	33	1,091,220	$0.77	`resource_heavy:high`, `poor_control:medium`, `partially_reducible:medium`

The second run matched a cohort baseline on all 7 dimensions (event, task_domain, execution_style, resource_profile, actuation_style, dispatch_mode, tool_breadth) and was still classified as changed with a turns_increase reason. The resource profile is heavy and the actuation style is write_heavy. The partially_reducible:medium assessment suggests that parts of this workflow could be offloaded to deterministic automation.

Per-workflow detail: Smoke Copilot

Both runs completed successfully, showing 0 tokens (Copilot engine token usage is not tracked via this mechanism). Duration was 6.8–7.0 minutes.

Run	Duration	Assessment
§23730204451	6.8 min	`resource_heavy_for_domain:medium`
§23730326985	7.0 min	`resource_heavy_for_domain:medium`, `poor_agentic_control:high`, missing Serena MCP

Both runs are classified in the triage domain with directed execution style, narrow tool breadth, and write_heavy actuation. agentic_fraction=0 confirms the AI engine contributed 0 tracked turns, yet the workflow still carries a heavy resource and write-heavy profile. The poor_agentic_control:high on the second run is the highest severity control assessment in the dataset.

Per-workflow detail: Issue Monster

Three runs over the period, all in the issue_response domain.

Run	Duration	Turns	Status	Classification	Assessments
§23731745737	3.7 min	4	✅	—	—
§23733499223	4.2 min	11	✅	`changed` (turns_increase)	`poor_agentic_control:medium`, `partially_reducible:medium`
§23735859039	2.4 min	5	🔄 in_progress	`changed` (turns_increase, posture_changed)	—

The posture_changed code on the latest run indicates the write posture differs from the baseline. If this persists, it may indicate prompt drift.

Per-workflow detail: Single-run high-resource workflows

These workflows each ran once and carried resource_heavy_for_domain:high assessments. They do not cross escalation thresholds (single run), but are listed here for portfolio awareness.

Workflow	Domain	Duration	Turns	Cost	Assessments
Go Fan	issue_response	8.0 min	44	$1.06	`resource_heavy:high`, `poor_control:medium`, `partially_reducible:medium`
Layout Specification Maintainer	repo_maintenance	8.4 min	43	$0.00	`resource_heavy:high`, `poor_control:medium`, `partially_reducible:medium`
Release	release_ops	18.5 min	15	$0.00	`resource_heavy:high`, `partially_reducible:medium`
Code Simplifier	code_fix	7.1 min	16	$0.00	`resource_heavy:high`, `partially_reducible:medium`
Auto-Triage Issues	triage	5.9 min	20	$0.00	`resource_heavy:high`, `poor_control:medium` — failed

Go Fan and Layout Specification Maintainer are the heaviest single-run consumers by turn count (44 and 43 turns respectively). Go Fan is also the second-costliest workflow at $1.06. The partially_reducible:medium on all of these suggests that the orchestration overhead could be trimmed.

Failed runs summary

Seven runs ended in failure, all with 1 error each. No escalation thresholds are crossed by failures alone.

Workflow	Runs	Failure count	Notes
Changeset Generator	2	2 (100%)	Failed in both runs this period
Smoke Codex	2	2 (100%)	Failed in both runs this period
The Great Escapi	3	1	One run failed, two succeeded
Auto-Triage Issues	1	1	Only run this period
Schema Feature Coverage Checker	1	1	Only run this period

Changeset Generator and Smoke Codex have a 100% failure rate for the window. This warrants investigation separately from agentic control concerns.

Optimization candidates (portfolio cleanup)

These workflows appear consistently lean, directed, and narrow, suggesting they may be candidates for deterministic automation or model downgrade. They do not require immediate action.

/cloclo — 5 runs, all skipped (0 tokens), always cohort-matched as stable
Dev / Dev Hawk — all skipped; low-confidence workflow_run chain, no agent activity
Multiple smoke tests (Smoke Gemini, Smoke Agent variants, Smoke Copilot ARM64) — 2 runs each, cohort-stable, minimal or no token usage

References:

§23730204448 — Smoke Claude run 1
§23730326985 — Smoke Copilot run 2 (missing tool)
§23733074302 — Go Fan ($1.06 cost)

AI generated by Agentic Observability Kit · history

expires on Apr 6, 2026, 8:54 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] Agentic Observability Report — Mar 16–30, 2026 #23527

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[observability] Agentic Observability Report — Mar 16–30, 2026 #23527

Uh oh!

github-actions[bot] bot Mar 30, 2026

Executive Summary

Key Metrics

Highest Risk Episodes

Episode Regressions

Recommended Actions

Replies: 0 comments

github-actions[bot]
bot Mar 30, 2026