Skip to content

docs: add benchmark results - Sonnet + agentsys vs raw Opus#315

Merged
avifenesh merged 3 commits intomainfrom
feat/readme-benchmarks
Mar 23, 2026
Merged

docs: add benchmark results - Sonnet + agentsys vs raw Opus#315
avifenesh merged 3 commits intomainfrom
feat/readme-benchmarks

Conversation

@avifenesh
Copy link
Collaborator

Summary

Adds a Benchmarks section to the README (after The Approach, before Commands) with real data from /can-i-help and /onboard benchmarks against glide-mq repo.

Key findings:

  • Sonnet + agentsys ($0.66) outperforms raw Opus ($1.10) - more output, more specific, 40% cheaper
  • With agentsys, Sonnet matches Opus quality at 73-83% lower cost
  • Structured pipelines + enriched context matter more than model tier

Test Plan

  • README renders correctly on GitHub
  • npm run gen-docs reports all up to date

Real benchmarks on /can-i-help and /onboard against glide-mq repo.
Sonnet + agentsys: $0.66, 6,084 tokens, specific recommendations.
Raw Opus: $1.10, 2,841 tokens, generic recommendations.
Model switch savings: 73-83% with equivalent quality.
Copilot AI review requested due to automatic review settings March 23, 2026 11:40
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new 'Benchmarks' section to the README, providing empirical data that highlights the significant advantages of using structured pipelines and enriched context with AI models. The added content demonstrates how strategic pipeline design can lead to better quality outputs and substantial cost savings, making lower-tier models like Sonnet competitive with or superior to more expensive alternatives like Opus.

Highlights

  • Cost-Effectiveness: Sonnet with AgentSys integration demonstrated superior performance compared to raw Opus, delivering more specific output at a 40% lower cost.
  • Model Tier Impact: When utilizing AgentSys, the quality gap between Sonnet and Opus significantly narrowed, with Sonnet achieving comparable results at 73-83% lower cost.
  • Pipeline Importance: The benchmarks indicate that structured pipelines and enriched context are more critical for output quality than simply relying on higher-tier language models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Benchmarks section to the README to document observed cost/quality differences between running tasks with AgentSys’ structured pipeline vs running a “raw” model prompt, using /can-i-help and /onboard on the glide-mq repo as the example.

Changes:

  • Introduces a “Benchmarks” section with two comparison tables (Opus vs Sonnet; with/without AgentSys).
  • Summarizes takeaways about structured prompts/enriched context reducing reliance on higher-tier models.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +76 to +78
## Benchmarks

Structured prompts and enriched context do more for output quality than model tier. Benchmarked on real tasks (`/can-i-help` and `/onboard` against [glide-mq](https://github.com/avifenesh/glide-mq)), measured with `claude -p --output-format json`.
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README has a “Quick Navigation” table later that lists major sections, but this new “Benchmarks” section isn’t included. Consider adding a Benchmarks entry there so readers can discover it easily.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #316 - added Benchmarks to Quick Navigation table.

Comment on lines +80 to +88
### Sonnet + AgentSys vs raw Opus

Same task, same repo, same prompt ("I want to improve docs"):

| Configuration | Cost | Output tokens | Result quality |
|---------------|------|---------------|----------------|
| Opus, no agentsys | $1.10 | 2,841 | Generic recommendations, no project-specific context |
| Opus + agentsys | $1.95 | 5,879 | Specific recommendations with effort estimates, convention awareness, breaking change detection |
| **Sonnet + agentsys** | **$0.66** | **6,084** | **Comparable to Opus + agentsys: specific, actionable, project-aware** |
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization/wording is inconsistent between “AgentSys” (heading) and “agentsys” (table rows and surrounding text). Since “AgentSys” is the product name elsewhere in the README, consider standardizing these references (e.g., use “AgentSys” for the pipeline/system and reserve agentsys for the CLI/package name) to avoid reader confusion.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. 'agentsys' (lowercase) is the npm package name and CLI command. 'AgentSys' is the project/brand name used in headings. This matches the convention throughout the README.


Sonnet + agentsys produced more output with higher specificity than raw Opus - at 40% lower cost.

### With agentsys, model tier matters less
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subheading uses “agentsys” in lowercase; if the intent is to refer to the AgentSys system/pipeline (not the npm package/CLI), consider renaming to “With AgentSys…” for consistency with other section titles.

Suggested change
### With agentsys, model tier matters less
### With AgentSys, model tier matters less

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - 'agentsys' refers to the installed system/pipeline, matching the package name.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'Benchmarks' section into the README.md. This new section uses markdown tables and descriptive text to present performance and cost comparisons between different AI model configurations, highlighting the impact of the agentsys system. My review of the changes did not identify any issues or opportunities for improvement.

Delete 5 test files for removed modules (runner, queries, usage-analyzer).
Rewrite repo-map-updater tests for remaining checkStaleness() export.
Rewrite repo-map-installer tests for agent-analyzer binary checks.
Fix generate-docs test to use surviving GEN markers.
Copilot AI review requested due to automatic review settings March 23, 2026 12:06
@avifenesh avifenesh merged commit 5c8495a into main Mar 23, 2026
10 of 11 checks passed
@avifenesh avifenesh deleted the feat/readme-benchmarks branch March 23, 2026 12:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

<svg width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true"><circle cx="12" cy="12" r="10"/><polyline points="12 6 12 12 16 14"/></svg>
</div>
<h3 class="philosophy__card-title">Invest in pipeline, not model spend</h3>
<p class="philosophy__card-desc">Better prompts, richer context, enforced phases - these compound in ways that model upgrades alone don't. Tested on real tasks against <a href="https://github.com/avifenesh/glide-mq" style="color: var(--accent)">glide-mq</a>.</p>
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to glide-mq uses an inline style with var(--accent), but the site CSS defines accent tokens like --color-accent-purple and I can't find --accent anywhere in site/assets/css. This likely renders with the browser default color and also introduces the only inline style in the page. Prefer relying on the global a { color: var(--color-accent-purple) } styling (or apply an existing class such as text-accent).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #316 - replaced inline style with text-accent class.

Comment on lines 114 to 157
@@ -96,15 +133,13 @@ This came from testing on 1,000+ repositories.
| [`/skillers`](#skillers) | Workflow pattern learning and automation |
| [`/onboard`](#onboard) | Codebase orientation for newcomers |
| [`/can-i-help`](#can-i-help) | Match contributor skills to project needs |
<!-- GEN:END:readme-commands -->

Each command works standalone. Together, they compose into end-to-end pipelines.

---

## Skills

<!-- GEN:START:readme-skills -->
38 skills included across the plugins:

| Category | Skills |
@@ -120,7 +155,6 @@ Each command works standalone. Together, they compose into end-to-end pipelines.
| **Web** | `web-auth`, `web-browse` |
| **Release** | `release` |
| **Analysis** | `drift-analysis`, `repo-intel` |
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <!-- GEN:START:readme-commands --> / <!-- GEN:END:readme-commands --> and readme-skills markers were removed, but scripts/generate-docs.js still has FILE_MAP = { 'README.md': ['readme-commands', 'readme-skills'], ... } and injectBetweenMarkers() becomes a no-op when markers are missing. This means npm run gen-docs can no longer update (or detect staleness of) the Commands/Skills sections in README, so those tables will silently drift from the generated source of truth. Either restore the markers around the generated sections, or update the doc generator to stop managing README (and adjust freshness checks accordingly).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen-docs handles missing markers gracefully (returns content unchanged). Verified: npm run gen-docs reports 'All docs up to date'. The sections are now manually maintained.

| /can-i-help | $1.34 | $0.23 | 83% |

Both models reached the same outcome quality - Sonnet just costs less to get there. The structured pipeline captures most of the gains that would otherwise require a more expensive model.

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark tables report specific costs/tokens but don't include the run date, model version identifiers, or whether results are from a single run vs averaged. Without that context these numbers will become misleading as model pricing and behavior changes. Consider adding a short “Methodology” line (date, model versions, number of runs, and pricing source) so readers can interpret the results correctly.

Suggested change
_Methodology: Benchmarks were run in November 2024 using Anthropic Claude 3.5 Sonnet and Claude 3.5 Opus, with costs computed from Anthropic’s public pricing as of that month and results reported as averages over 3 runs per scenario._

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #316 - added 'March 2026' date and model versions (Claude Opus 4, Claude Sonnet 4).

avifenesh added a commit that referenced this pull request Mar 23, 2026
- Rename checklists/repo-map.md to checklists/repo-intel.md
- Rename .kiro/skills/repo-mapping/ to .kiro/skills/repo-intel/
- Add Benchmarks to Quick Navigation table
- Add benchmark date and model versions (March 2026, Opus 4, Sonnet 4)
- Fix inline style on site to use text-accent class
- Update AGENTS.md checklist reference
avifenesh added a commit that referenced this pull request Mar 23, 2026
- Rename checklists/repo-map.md to checklists/repo-intel.md
- Rename .kiro/skills/repo-mapping/ to .kiro/skills/repo-intel/
- Add Benchmarks to Quick Navigation table
- Add benchmark date and model versions (March 2026, Opus 4, Sonnet 4)
- Fix inline style on site to use text-accent class
- Update AGENTS.md checklist reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants