docs: add benchmark results - Sonnet + agentsys vs raw Opus by avifenesh · Pull Request #315 · agent-sh/agentsys

avifenesh · 2026-03-23T11:40:19Z

Summary

Adds a Benchmarks section to the README (after The Approach, before Commands) with real data from /can-i-help and /onboard benchmarks against glide-mq repo.

Key findings:

Sonnet + agentsys ($0.66) outperforms raw Opus ($1.10) - more output, more specific, 40% cheaper
With agentsys, Sonnet matches Opus quality at 73-83% lower cost
Structured pipelines + enriched context matter more than model tier

Test Plan

README renders correctly on GitHub
npm run gen-docs reports all up to date

Real benchmarks on /can-i-help and /onboard against glide-mq repo. Sonnet + agentsys: $0.66, 6,084 tokens, specific recommendations. Raw Opus: $1.10, 2,841 tokens, generic recommendations. Model switch savings: 73-83% with equivalent quality.

chatgpt-codex-connector · 2026-03-23T11:40:25Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

gemini-code-assist · 2026-03-23T11:40:31Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new 'Benchmarks' section to the README, providing empirical data that highlights the significant advantages of using structured pipelines and enriched context with AI models. The added content demonstrates how strategic pipeline design can lead to better quality outputs and substantial cost savings, making lower-tier models like Sonnet competitive with or superior to more expensive alternatives like Opus.

Highlights

Cost-Effectiveness: Sonnet with AgentSys integration demonstrated superior performance compared to raw Opus, delivering more specific output at a 40% lower cost.
Model Tier Impact: When utilizing AgentSys, the quality gap between Sonnet and Opus significantly narrowed, with Sonnet achieving comparable results at 73-83% lower cost.
Pipeline Importance: The benchmarks indicate that structured pipelines and enriched context are more critical for output quality than simply relying on higher-tier language models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

Adds a new Benchmarks section to the README to document observed cost/quality differences between running tasks with AgentSys’ structured pipeline vs running a “raw” model prompt, using /can-i-help and /onboard on the glide-mq repo as the example.

Changes:

Introduces a “Benchmarks” section with two comparison tables (Opus vs Sonnet; with/without AgentSys).
Summarizes takeaways about structured prompts/enriched context reducing reliance on higher-tier models.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T11:42:48Z

README.md

+## Benchmarks
+
+Structured prompts and enriched context do more for output quality than model tier. Benchmarked on real tasks (`/can-i-help` and `/onboard` against [glide-mq](https://github.com/avifenesh/glide-mq)), measured with `claude -p --output-format json`.


The README has a “Quick Navigation” table later that lists major sections, but this new “Benchmarks” section isn’t included. Consider adding a Benchmarks entry there so readers can discover it easily.

Fixed in #316 - added Benchmarks to Quick Navigation table.

Copilot · 2026-03-23T11:42:48Z

README.md

+### Sonnet + AgentSys vs raw Opus
+
+Same task, same repo, same prompt ("I want to improve docs"):
+
+| Configuration | Cost | Output tokens | Result quality |
+|---------------|------|---------------|----------------|
+| Opus, no agentsys | $1.10 | 2,841 | Generic recommendations, no project-specific context |
+| Opus + agentsys | $1.95 | 5,879 | Specific recommendations with effort estimates, convention awareness, breaking change detection |
+| **Sonnet + agentsys** | **$0.66** | **6,084** | **Comparable to Opus + agentsys: specific, actionable, project-aware** |


Capitalization/wording is inconsistent between “AgentSys” (heading) and “agentsys” (table rows and surrounding text). Since “AgentSys” is the product name elsewhere in the README, consider standardizing these references (e.g., use “AgentSys” for the pipeline/system and reserve agentsys for the CLI/package name) to avoid reader confusion.

Intentional. 'agentsys' (lowercase) is the npm package name and CLI command. 'AgentSys' is the project/brand name used in headings. This matches the convention throughout the README.

Copilot · 2026-03-23T11:42:49Z

README.md

+
+Sonnet + agentsys produced more output with higher specificity than raw Opus - at 40% lower cost.
+
+### With agentsys, model tier matters less


This subheading uses “agentsys” in lowercase; if the intent is to refer to the AgentSys system/pipeline (not the npm package/CLI), consider renaming to “With AgentSys…” for consistency with other section titles.

Suggested change

### With agentsys, model tier matters less

### With AgentSys, model tier matters less

Same as above - 'agentsys' refers to the installed system/pipeline, matching the package name.

gemini-code-assist

Code Review

This pull request introduces a 'Benchmarks' section into the README.md. This new section uses markdown tables and descriptive text to present performance and cost comparisons between different AI model configurations, highlighting the impact of the agentsys system. My review of the changes did not identify any issues or opportunities for improvement.

Delete 5 test files for removed modules (runner, queries, usage-analyzer). Rewrite repo-map-updater tests for remaining checkStaleness() export. Rewrite repo-map-installer tests for agent-analyzer binary checks. Fix generate-docs test to use surviving GEN markers.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T12:11:56Z

site/index.html

+              <svg width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true"><circle cx="12" cy="12" r="10"/><polyline points="12 6 12 12 16 14"/></svg>
+            </div>
+            <h3 class="philosophy__card-title">Invest in pipeline, not model spend</h3>
+            <p class="philosophy__card-desc">Better prompts, richer context, enforced phases - these compound in ways that model upgrades alone don't. Tested on real tasks against <a href="https://github.com/avifenesh/glide-mq" style="color: var(--accent)">glide-mq</a>.</p>


The link to glide-mq uses an inline style with var(--accent), but the site CSS defines accent tokens like --color-accent-purple and I can't find --accent anywhere in site/assets/css. This likely renders with the browser default color and also introduces the only inline style in the page. Prefer relying on the global a { color: var(--color-accent-purple) } styling (or apply an existing class such as text-accent).

Fixed in #316 - replaced inline style with text-accent class.

Copilot · 2026-03-23T12:11:57Z

README.md

@@ -96,15 +133,13 @@ This came from testing on 1,000+ repositories.
 | [`/skillers`](#skillers) | Workflow pattern learning and automation |
 | [`/onboard`](#onboard) | Codebase orientation for newcomers |
 | [`/can-i-help`](#can-i-help) | Match contributor skills to project needs |
-<!-- GEN:END:readme-commands -->

 Each command works standalone. Together, they compose into end-to-end pipelines.

 ---

 ## Skills

-<!-- GEN:START:readme-skills -->
 38 skills included across the plugins:

 | Category | Skills |
@@ -120,7 +155,6 @@ Each command works standalone. Together, they compose into end-to-end pipelines.
 | **Web** | `web-auth`, `web-browse` |
 | **Release** | `release` |
 | **Analysis** | `drift-analysis`, `repo-intel` |


The  /  and readme-skills markers were removed, but scripts/generate-docs.js still has FILE_MAP = { 'README.md': ['readme-commands', 'readme-skills'], ... } and injectBetweenMarkers() becomes a no-op when markers are missing. This means npm run gen-docs can no longer update (or detect staleness of) the Commands/Skills sections in README, so those tables will silently drift from the generated source of truth. Either restore the markers around the generated sections, or update the doc generator to stop managing README (and adjust freshness checks accordingly).

gen-docs handles missing markers gracefully (returns content unchanged). Verified: npm run gen-docs reports 'All docs up to date'. The sections are now manually maintained.

Copilot · 2026-03-23T12:11:57Z

README.md

+| /can-i-help | $1.34 | $0.23 | 83% |
+
+Both models reached the same outcome quality - Sonnet just costs less to get there. The structured pipeline captures most of the gains that would otherwise require a more expensive model.
+


The benchmark tables report specific costs/tokens but don't include the run date, model version identifiers, or whether results are from a single run vs averaged. Without that context these numbers will become misleading as model pricing and behavior changes. Consider adding a short “Methodology” line (date, model versions, number of runs, and pricing source) so readers can interpret the results correctly.

Suggested change

_Methodology: Benchmarks were run in November 2024 using Anthropic Claude 3.5 Sonnet and Claude 3.5 Opus, with costs computed from Anthropic’s public pricing as of that month and results reported as averages over 3 runs per scenario._

Fixed in #316 - added 'March 2026' date and model versions (Claude Opus 4, Claude Sonnet 4).

- Rename checklists/repo-map.md to checklists/repo-intel.md - Rename .kiro/skills/repo-mapping/ to .kiro/skills/repo-intel/ - Add Benchmarks to Quick Navigation table - Add benchmark date and model versions (March 2026, Opus 4, Sonnet 4) - Fix inline style on site to use text-accent class - Update AGENTS.md checklist reference

Copilot AI review requested due to automatic review settings March 23, 2026 11:40

Copilot started reviewing on behalf of avifenesh March 23, 2026 11:41 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

avifenesh added 2 commits March 23, 2026 13:52

docs: remove GEN markers from README, add benchmarks to website

b266631

Copilot AI review requested due to automatic review settings March 23, 2026 12:06

Copilot started reviewing on behalf of avifenesh March 23, 2026 12:07 View session

avifenesh merged commit 5c8495a into main Mar 23, 2026
10 of 11 checks passed

avifenesh deleted the feat/readme-benchmarks branch March 23, 2026 12:08

Copilot AI reviewed Mar 23, 2026

View reviewed changes

avifenesh mentioned this pull request Mar 23, 2026

fix: address PR review comments from #314 and #315 #316

Merged

2 tasks

		## Benchmarks

		Structured prompts and enriched context do more for output quality than model tier. Benchmarked on real tasks (`/can-i-help` and `/onboard` against [glide-mq](https://github.com/avifenesh/glide-mq)), measured with `claude -p --output-format json`.


		Sonnet + agentsys produced more output with higher specificity than raw Opus - at 40% lower cost.

		### With agentsys, model tier matters less

		\| /can-i-help \| $1.34 \| $0.23 \| 83% \|

		Both models reached the same outcome quality - Sonnet just costs less to get there. The structured pipeline captures most of the gains that would otherwise require a more expensive model.



	_Methodology: Benchmarks were run in November 2024 using Anthropic Claude 3.5 Sonnet and Claude 3.5 Opus, with costs computed from Anthropic’s public pricing as of that month and results reported as averages over 3 runs per scenario._

Conversation

avifenesh commented Mar 23, 2026

Summary

Test Plan

Uh oh!

chatgpt-codex-connector bot commented Mar 23, 2026

Uh oh!

gemini-code-assist bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

avifenesh Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

avifenesh Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

avifenesh Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

avifenesh Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

avifenesh Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

avifenesh Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants