docs: add benchmark results - Sonnet + agentsys vs raw Opus#315
docs: add benchmark results - Sonnet + agentsys vs raw Opus#315
Conversation
Real benchmarks on /can-i-help and /onboard against glide-mq repo. Sonnet + agentsys: $0.66, 6,084 tokens, specific recommendations. Raw Opus: $1.10, 2,841 tokens, generic recommendations. Model switch savings: 73-83% with equivalent quality.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new 'Benchmarks' section to the README, providing empirical data that highlights the significant advantages of using structured pipelines and enriched context with AI models. The added content demonstrates how strategic pipeline design can lead to better quality outputs and substantial cost savings, making lower-tier models like Sonnet competitive with or superior to more expensive alternatives like Opus. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Pull request overview
Adds a new Benchmarks section to the README to document observed cost/quality differences between running tasks with AgentSys’ structured pipeline vs running a “raw” model prompt, using /can-i-help and /onboard on the glide-mq repo as the example.
Changes:
- Introduces a “Benchmarks” section with two comparison tables (Opus vs Sonnet; with/without AgentSys).
- Summarizes takeaways about structured prompts/enriched context reducing reliance on higher-tier models.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ## Benchmarks | ||
|
|
||
| Structured prompts and enriched context do more for output quality than model tier. Benchmarked on real tasks (`/can-i-help` and `/onboard` against [glide-mq](https://github.com/avifenesh/glide-mq)), measured with `claude -p --output-format json`. |
There was a problem hiding this comment.
The README has a “Quick Navigation” table later that lists major sections, but this new “Benchmarks” section isn’t included. Consider adding a Benchmarks entry there so readers can discover it easily.
There was a problem hiding this comment.
Fixed in #316 - added Benchmarks to Quick Navigation table.
| ### Sonnet + AgentSys vs raw Opus | ||
|
|
||
| Same task, same repo, same prompt ("I want to improve docs"): | ||
|
|
||
| | Configuration | Cost | Output tokens | Result quality | | ||
| |---------------|------|---------------|----------------| | ||
| | Opus, no agentsys | $1.10 | 2,841 | Generic recommendations, no project-specific context | | ||
| | Opus + agentsys | $1.95 | 5,879 | Specific recommendations with effort estimates, convention awareness, breaking change detection | | ||
| | **Sonnet + agentsys** | **$0.66** | **6,084** | **Comparable to Opus + agentsys: specific, actionable, project-aware** | |
There was a problem hiding this comment.
Capitalization/wording is inconsistent between “AgentSys” (heading) and “agentsys” (table rows and surrounding text). Since “AgentSys” is the product name elsewhere in the README, consider standardizing these references (e.g., use “AgentSys” for the pipeline/system and reserve agentsys for the CLI/package name) to avoid reader confusion.
There was a problem hiding this comment.
Intentional. 'agentsys' (lowercase) is the npm package name and CLI command. 'AgentSys' is the project/brand name used in headings. This matches the convention throughout the README.
|
|
||
| Sonnet + agentsys produced more output with higher specificity than raw Opus - at 40% lower cost. | ||
|
|
||
| ### With agentsys, model tier matters less |
There was a problem hiding this comment.
This subheading uses “agentsys” in lowercase; if the intent is to refer to the AgentSys system/pipeline (not the npm package/CLI), consider renaming to “With AgentSys…” for consistency with other section titles.
| ### With agentsys, model tier matters less | |
| ### With AgentSys, model tier matters less |
There was a problem hiding this comment.
Same as above - 'agentsys' refers to the installed system/pipeline, matching the package name.
There was a problem hiding this comment.
Code Review
This pull request introduces a 'Benchmarks' section into the README.md. This new section uses markdown tables and descriptive text to present performance and cost comparisons between different AI model configurations, highlighting the impact of the agentsys system. My review of the changes did not identify any issues or opportunities for improvement.
Delete 5 test files for removed modules (runner, queries, usage-analyzer). Rewrite repo-map-updater tests for remaining checkStaleness() export. Rewrite repo-map-installer tests for agent-analyzer binary checks. Fix generate-docs test to use surviving GEN markers.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| <svg width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true"><circle cx="12" cy="12" r="10"/><polyline points="12 6 12 12 16 14"/></svg> | ||
| </div> | ||
| <h3 class="philosophy__card-title">Invest in pipeline, not model spend</h3> | ||
| <p class="philosophy__card-desc">Better prompts, richer context, enforced phases - these compound in ways that model upgrades alone don't. Tested on real tasks against <a href="https://github.com/avifenesh/glide-mq" style="color: var(--accent)">glide-mq</a>.</p> |
There was a problem hiding this comment.
The link to glide-mq uses an inline style with var(--accent), but the site CSS defines accent tokens like --color-accent-purple and I can't find --accent anywhere in site/assets/css. This likely renders with the browser default color and also introduces the only inline style in the page. Prefer relying on the global a { color: var(--color-accent-purple) } styling (or apply an existing class such as text-accent).
There was a problem hiding this comment.
Fixed in #316 - replaced inline style with text-accent class.
| @@ -96,15 +133,13 @@ This came from testing on 1,000+ repositories. | |||
| | [`/skillers`](#skillers) | Workflow pattern learning and automation | | |||
| | [`/onboard`](#onboard) | Codebase orientation for newcomers | | |||
| | [`/can-i-help`](#can-i-help) | Match contributor skills to project needs | | |||
| <!-- GEN:END:readme-commands --> | |||
|
|
|||
| Each command works standalone. Together, they compose into end-to-end pipelines. | |||
|
|
|||
| --- | |||
|
|
|||
| ## Skills | |||
|
|
|||
| <!-- GEN:START:readme-skills --> | |||
| 38 skills included across the plugins: | |||
|
|
|||
| | Category | Skills | | |||
| @@ -120,7 +155,6 @@ Each command works standalone. Together, they compose into end-to-end pipelines. | |||
| | **Web** | `web-auth`, `web-browse` | | |||
| | **Release** | `release` | | |||
| | **Analysis** | `drift-analysis`, `repo-intel` | | |||
There was a problem hiding this comment.
The <!-- GEN:START:readme-commands --> / <!-- GEN:END:readme-commands --> and readme-skills markers were removed, but scripts/generate-docs.js still has FILE_MAP = { 'README.md': ['readme-commands', 'readme-skills'], ... } and injectBetweenMarkers() becomes a no-op when markers are missing. This means npm run gen-docs can no longer update (or detect staleness of) the Commands/Skills sections in README, so those tables will silently drift from the generated source of truth. Either restore the markers around the generated sections, or update the doc generator to stop managing README (and adjust freshness checks accordingly).
There was a problem hiding this comment.
gen-docs handles missing markers gracefully (returns content unchanged). Verified: npm run gen-docs reports 'All docs up to date'. The sections are now manually maintained.
| | /can-i-help | $1.34 | $0.23 | 83% | | ||
|
|
||
| Both models reached the same outcome quality - Sonnet just costs less to get there. The structured pipeline captures most of the gains that would otherwise require a more expensive model. | ||
|
|
There was a problem hiding this comment.
The benchmark tables report specific costs/tokens but don't include the run date, model version identifiers, or whether results are from a single run vs averaged. Without that context these numbers will become misleading as model pricing and behavior changes. Consider adding a short “Methodology” line (date, model versions, number of runs, and pricing source) so readers can interpret the results correctly.
| _Methodology: Benchmarks were run in November 2024 using Anthropic Claude 3.5 Sonnet and Claude 3.5 Opus, with costs computed from Anthropic’s public pricing as of that month and results reported as averages over 3 runs per scenario._ |
There was a problem hiding this comment.
Fixed in #316 - added 'March 2026' date and model versions (Claude Opus 4, Claude Sonnet 4).
- Rename checklists/repo-map.md to checklists/repo-intel.md - Rename .kiro/skills/repo-mapping/ to .kiro/skills/repo-intel/ - Add Benchmarks to Quick Navigation table - Add benchmark date and model versions (March 2026, Opus 4, Sonnet 4) - Fix inline style on site to use text-accent class - Update AGENTS.md checklist reference
- Rename checklists/repo-map.md to checklists/repo-intel.md - Rename .kiro/skills/repo-mapping/ to .kiro/skills/repo-intel/ - Add Benchmarks to Quick Navigation table - Add benchmark date and model versions (March 2026, Opus 4, Sonnet 4) - Fix inline style on site to use text-accent class - Update AGENTS.md checklist reference
Summary
Adds a Benchmarks section to the README (after The Approach, before Commands) with real data from
/can-i-helpand/onboardbenchmarks against glide-mq repo.Key findings:
Test Plan
npm run gen-docsreports all up to date