Revise blog post draft on GitHub Agentic Workflows by idan · Pull Request #2176 · github/gh-aw-firewall

idan · 2026-04-23T23:20:27Z

Revised the blog post draft to improve clarity and consistency, including formatting changes for workflow names and refining explanations of token efficiency and optimization processes.

idan · 2026-04-23T23:21:02Z


 **The workload is a live repository.** The workflows we optimize are not operating on consistent benchmark data. A workflow that processes a 200-line PR diff one day genuinely uses more tokens than one processing a 5-line fix a few hours later. The difference is correct behavior, not inefficiency. Raw token counts can conflate workload variation with efficiency changes. We try to normalize for this by tracking LLM API call counts alongside token counts; if the number of LLM turns per run stays constant while tokens-per-call falls, that's a genuine efficiency improvement. If both fall together, it could mean less work is being done.

-**Does quality change?** This is the hardest question. A lighter model running a more constrained workflow might produce lower-quality output. We looked at the process-level signals like output tokens per LLM call, turn counts per run, and tool-call completion rates to approximate quality. For our optimized Smoke Copilot workflow all three remained stable across the optimization period even as token consumption fell. The workflow completes in exactly 5 LLM turns every run, before and after the optimizations. Of course, these are process signals, not outcome signals. We cannot directly observe whether the quality of agent output improved, degraded, or stayed flat, because we have no ground-truth labels for what "correct" output looks like. Measuring goodput—tokens per unit of correct work—requires additional instrumentation and thought.


This seemed like a search/replace typo

idan · 2026-04-23T23:21:16Z


 The tools that we use to optimize our workflows like API-level observability, automated auditing workflows, MCP tool pruning, and CLI substitution are all available today in the Github Agentic Workflows framework. The measurement methodology (workload normalization, effective tokens) is documented in the [Effective Tokens specification](https://github.com/github/gh-aw/blob/main/docs/src/content/docs/reference/effective-tokens-specification.md) and the data and analysis scripts for this study are published on the [`token-efficiency-paper`](https://github.com/github/gh-aw-firewall/tree/token-efficiency-paper) branch.

-The open questions are genuinely hard: measuring goodput requires outcome instrumentation that doesn't yet exist at scale for agentic CI workflows. We're building toward it. In the meantime, the proxy-level observability and the optimizer workflows have already changed how we develop and deploy new agentic automations—we add token monitoring from day one rather than retrofitting it later.


This also seemed like a typo

github-actions · 2026-04-23T23:23:20Z

Smoke Test Results

❌ GitHub MCP: gh CLI failed (API connectivity limitation)
✅ Playwright: github.com page title verified (contains "GitHub")
✅ File Writing: Test file created and verified
✅ Bash Tool: File read and verified successfully

Overall Status: PARTIAL — 3/4 tests passed; gh CLI unavailable

💥 [THE END] — Illustrated by Smoke Claude

github-actions · 2026-04-24T01:13:38Z

Smoke test report
PR titles:

feat: add Gemini engine smoke test workflow
chore: upgrade gh-aw to v0.69.3 and recompile workflows
T1 ✅ T2 ❌ T3 ✅ T4 ❌
T5 ✅ T6 ✅ T7 ✅ T8 ✅
Overall status: FAIL

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

Revise blog post draft on GitHub Agentic Workflows

c8215f7

Revised the blog post draft to improve clarity and consistency, including formatting changes for workflow names and refining explanations of token efficiency and optimization processes.

Copilot AI review requested due to automatic review settings April 23, 2026 23:20

idan requested a review from Mossaka as a code owner April 23, 2026 23:20

idan requested review from lpcox and removed request for Mossaka and Copilot April 23, 2026 23:20

Copilot started reviewing on behalf of idan April 23, 2026 23:21 View session

idan commented Apr 23, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 23, 2026

[aw] Build Test Suite failed #2177

Closed

This comment has been minimized.

Sign in to view

github-actions Bot mentioned this pull request Apr 23, 2026

[aw] Smoke Services failed #2178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise blog post draft on GitHub Agentic Workflows#2176

Revise blog post draft on GitHub Agentic Workflows#2176
idan wants to merge 1 commit intotoken-efficiency-paperfrom
idan/optimization-post-copyedits

idan commented Apr 23, 2026

Uh oh!

idan Apr 23, 2026

Uh oh!

idan Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		The workload is a live repository. The workflows we optimize are not operating on consistent benchmark data. A workflow that processes a 200-line PR diff one day genuinely uses more tokens than one processing a 5-line fix a few hours later. The difference is correct behavior, not inefficiency. Raw token counts can conflate workload variation with efficiency changes. We try to normalize for this by tracking LLM API call counts alongside token counts; if the number of LLM turns per run stays constant while tokens-per-call falls, that's a genuine efficiency improvement. If both fall together, it could mean less work is being done.

		Does quality change? This is the hardest question. A lighter model running a more constrained workflow might produce lower-quality output. We looked at the process-level signals like output tokens per LLM call, turn counts per run, and tool-call completion rates to approximate quality. For our optimized Smoke Copilot workflow all three remained stable across the optimization period even as token consumption fell. The workflow completes in exactly 5 LLM turns every run, before and after the optimizations. Of course, these are process signals, not outcome signals. We cannot directly observe whether the quality of agent output improved, degraded, or stayed flat, because we have no ground-truth labels for what "correct" output looks like. Measuring goodput—tokens per unit of correct work—requires additional instrumentation and thought.


		The tools that we use to optimize our workflows like API-level observability, automated auditing workflows, MCP tool pruning, and CLI substitution are all available today in the Github Agentic Workflows framework. The measurement methodology (workload normalization, effective tokens) is documented in the [Effective Tokens specification](https://github.com/github/gh-aw/blob/main/docs/src/content/docs/reference/effective-tokens-specification.md) and the data and analysis scripts for this study are published on the [`token-efficiency-paper`](https://github.com/github/gh-aw-firewall/tree/token-efficiency-paper) branch.

		The open questions are genuinely hard: measuring goodput requires outcome instrumentation that doesn't yet exist at scale for agentic CI workflows. We're building toward it. In the meantime, the proxy-level observability and the optimizer workflows have already changed how we develop and deploy new agentic automations—we add token monitoring from day one rather than retrofitting it later.

Conversation

idan commented Apr 23, 2026

Uh oh!

idan Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

idan Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 23, 2026

Smoke Test Results

Uh oh!

This comment has been minimized.

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant