A benchmark suite for coding agents. Think pytest, but for evaluating AI assistants.
When you change your AI coding setup—switching models, adjusting prompts, adding MCP servers, or trying new tools—you're flying blind. Did it actually get better? Worse? Hard to say without data.
Sniffbench gives you that data. It runs your coding agent through evaluation tasks, captures your configuration, and measures what matters.
# Install globally
npm install -g sniffbench
# Or clone and build
git clone https://github.com/answerlayer/sniffbench.git
cd sniffbench && npm install && npm run build && npm link
# Check it's working
sniff --help
sniff doctorsniff interviewThis runs your agent through comprehension questions about your codebase. You grade each answer (1-10) to establish baselines.
Every interview automatically:
- Creates a run with a unique ID
- Captures your agent configuration (version, model, MCP servers, tools)
- Auto-links to matching variants (if registered)
# With an optional label for easy reference
sniff interview --run "baseline"Before making configuration changes, snapshot your current setup:
sniff variant register "control" --description "Stock Claude Code config"Make your changes (add MCP server, update CLAUDE.md, etc.), then register the new config:
sniff variant register "with-linear" --description "Added Linear MCP server"# Compare two runs
sniff compare baseline after-changes
# Or by run ID
sniff compare run-1734567890-abc123 run-1734567891-def456Shows both config diff (what changed) and metrics diff (did it help):
Configuration Changes:
MCP: Linear: none → stdio
Allowed Tools: none → 1 tools
Case Comparison:
comp-001: Tokens 10,959 → 8,234 (-25%) ✓
comp-002: Grade 7/10 → 9/10 ↑
Aggregate Summary:
Total tokens: 45,000 → 38,000 ↓ -15.6%
Total cost: $0.52 → $0.44 ↓ -15.4%
sniff interview # Run comprehension interview
sniff variant register <name> # Snapshot current config
sniff compare <run1> <run2> # Compare two runs
sniff closed-issues scan # Find repo issues to use as cases
sniff closed-issues run # Evaluate agent on real issuesSee COMMANDS.md for the full reference.
Every run captures:
| Field | Source | Example |
|---|---|---|
| Agent name | CLI detection | claude-code |
| Version | claude --version |
2.0.55 |
| Model | API response | claude-sonnet-4-20250514 |
| CLAUDE.md hash | File hash | 8b28a4e5... |
| MCP servers | ~/.claude.json |
Linear(stdio) |
| Allowed tools | ~/.claude.json |
Bash(osgrep:*) |
| Permission mode | Settings | default |
| Thinking mode | Settings | enabled |
| Metric | What it measures |
|---|---|
totalTokens |
Total tokens used |
inputTokens |
Input/prompt tokens |
cacheReadTokens |
Tokens read from cache |
cacheWriteTokens |
Tokens written to cache |
toolCount |
Number of tool calls |
readCount |
Number of file reads |
costUsd |
Estimated API cost |
explorationRatio |
Read vs write tool ratio |
cacheHitRatio |
Cache efficiency |
Variants enable scientific A/B testing of agent configurations.
Without variants, you're comparing runs but don't know why one performed differently. Variants let you:
- Document what changed: "Added Linear MCP", "Updated CLAUDE.md prompts"
- Auto-link runs: Runs automatically link to matching variants
- Compare configs: See exactly what's different between setups
For true isolation, variants can be packaged as Docker containers with your configuration baked in:
# Register and build a container image
sniff variant register "control" -d "Stock config" --build
# Make changes to CLAUDE.md, add MCP servers, etc...
# Register the treatment variant
sniff variant register "with-osgrep" -d "Added semantic search" --build
# Run interview in sandboxed container
sniff interview --use-variant control
sniff interview --use-variant with-osgrep
# Compare the runs
sniff compare <control-run> <osgrep-run>Each container includes:
- Claude Code (same version as your host)
- Your CLAUDE.md baked in
- Tool permissions configured
- Complete isolation from host config
Requirements: Docker must be installed for sandboxed execution.
# 1. Register your baseline config
sniff variant register "control" -d "Stock Claude Code"
# 2. Run some interviews
sniff interview --run "control-test-1"
sniff interview --run "control-test-2"
# 3. Make changes to your setup
# ... add MCP server, update CLAUDE.md, etc ...
# 4. Register the new config
sniff variant register "treatment" -d "Added semantic search"
# 5. Run more interviews (auto-links to "treatment")
sniff interview --run "treatment-test-1"
# 6. Compare!
sniff compare control-test-1 treatment-test-1All data is stored in .sniffbench/ in your project root:
.sniffbench/
├── runs.json # All runs with results and config
├── variants.json # Registered variants
└── baselines.json # Legacy format (auto-migrated)
| Type | Description | Status |
|---|---|---|
| Comprehension | Questions about codebase architecture | ✅ Ready |
| Bootstrap | Common tasks (fix linting, rename symbols) | 🚧 In Progress |
| Closed Issues | Real issues from your repo's history | ✅ Ready |
Sniffbench evaluates agents on behaviors that matter for real-world development:
- Comprehension - Does the agent understand the codebase?
- Efficiency - Does it explore without wasting tokens?
- Accuracy - Are its answers correct and complete?
- Consistency - Does it perform reliably across runs?
See VALUES.md for our full evaluation philosophy.
We welcome contributions! Areas that need work:
- Agent wrappers - Integrate with Cursor, Aider, or other coding agents
- Bootstrap cases - Detection and validation for common tasks
- LLM-judge - Automated answer quality evaluation
- Documentation - Examples, tutorials, case studies
See CONTRIBUTING.md to get started.
MIT - see LICENSE
