Skip to content

AnswerLayer/sniffbench

Repository files navigation

Sniffbench

npm version License: MIT GitHub stars

A benchmark suite for coding agents. Think pytest, but for evaluating AI assistants.

Demo

What is this?

When you change your AI coding setup—switching models, adjusting prompts, adding MCP servers, or trying new tools—you're flying blind. Did it actually get better? Worse? Hard to say without data.

Sniffbench gives you that data. It runs your coding agent through evaluation tasks, captures your configuration, and measures what matters.

Quick Start

# Install globally
npm install -g sniffbench

# Or clone and build
git clone https://github.com/answerlayer/sniffbench.git
cd sniffbench && npm install && npm run build && npm link

# Check it's working
sniff --help
sniff doctor

Core Workflow

1. Run a Comprehension Interview

sniff interview

This runs your agent through comprehension questions about your codebase. You grade each answer (1-10) to establish baselines.

Every interview automatically:

  • Creates a run with a unique ID
  • Captures your agent configuration (version, model, MCP servers, tools)
  • Auto-links to matching variants (if registered)
# With an optional label for easy reference
sniff interview --run "baseline"

2. Register Variants for A/B Testing

Before making configuration changes, snapshot your current setup:

sniff variant register "control" --description "Stock Claude Code config"

Make your changes (add MCP server, update CLAUDE.md, etc.), then register the new config:

sniff variant register "with-linear" --description "Added Linear MCP server"

3. Compare Results

# Compare two runs
sniff compare baseline after-changes

# Or by run ID
sniff compare run-1734567890-abc123 run-1734567891-def456

Shows both config diff (what changed) and metrics diff (did it help):

Configuration Changes:
  MCP: Linear: none → stdio
  Allowed Tools: none → 1 tools

Case Comparison:
  comp-001: Tokens 10,959 → 8,234 (-25%) ✓
  comp-002: Grade 7/10 → 9/10 ↑

Aggregate Summary:
  Total tokens: 45,000 → 38,000 ↓ -15.6%
  Total cost: $0.52 → $0.44 ↓ -15.4%

Commands

sniff interview                 # Run comprehension interview
sniff variant register <name>   # Snapshot current config
sniff compare <run1> <run2>     # Compare two runs
sniff closed-issues scan        # Find repo issues to use as cases
sniff closed-issues run         # Evaluate agent on real issues

See COMMANDS.md for the full reference.

What Gets Captured

Agent Configuration (Automatic)

Every run captures:

Field Source Example
Agent name CLI detection claude-code
Version claude --version 2.0.55
Model API response claude-sonnet-4-20250514
CLAUDE.md hash File hash 8b28a4e5...
MCP servers ~/.claude.json Linear(stdio)
Allowed tools ~/.claude.json Bash(osgrep:*)
Permission mode Settings default
Thinking mode Settings enabled

Behavior Metrics (Per Case)

Metric What it measures
totalTokens Total tokens used
inputTokens Input/prompt tokens
cacheReadTokens Tokens read from cache
cacheWriteTokens Tokens written to cache
toolCount Number of tool calls
readCount Number of file reads
costUsd Estimated API cost
explorationRatio Read vs write tool ratio
cacheHitRatio Cache efficiency

Variant System

Variants enable scientific A/B testing of agent configurations.

Why Variants?

Without variants, you're comparing runs but don't know why one performed differently. Variants let you:

  1. Document what changed: "Added Linear MCP", "Updated CLAUDE.md prompts"
  2. Auto-link runs: Runs automatically link to matching variants
  3. Compare configs: See exactly what's different between setups

Sandboxed Variant Execution

For true isolation, variants can be packaged as Docker containers with your configuration baked in:

# Register and build a container image
sniff variant register "control" -d "Stock config" --build

# Make changes to CLAUDE.md, add MCP servers, etc...

# Register the treatment variant
sniff variant register "with-osgrep" -d "Added semantic search" --build

# Run interview in sandboxed container
sniff interview --use-variant control
sniff interview --use-variant with-osgrep

# Compare the runs
sniff compare <control-run> <osgrep-run>

Each container includes:

  • Claude Code (same version as your host)
  • Your CLAUDE.md baked in
  • Tool permissions configured
  • Complete isolation from host config

Requirements: Docker must be installed for sandboxed execution.

Workflow Example (Without Containers)

# 1. Register your baseline config
sniff variant register "control" -d "Stock Claude Code"

# 2. Run some interviews
sniff interview --run "control-test-1"
sniff interview --run "control-test-2"

# 3. Make changes to your setup
# ... add MCP server, update CLAUDE.md, etc ...

# 4. Register the new config
sniff variant register "treatment" -d "Added semantic search"

# 5. Run more interviews (auto-links to "treatment")
sniff interview --run "treatment-test-1"

# 6. Compare!
sniff compare control-test-1 treatment-test-1

Storage

All data is stored in .sniffbench/ in your project root:

.sniffbench/
├── runs.json       # All runs with results and config
├── variants.json   # Registered variants
└── baselines.json  # Legacy format (auto-migrated)

Case Types

Type Description Status
Comprehension Questions about codebase architecture ✅ Ready
Bootstrap Common tasks (fix linting, rename symbols) 🚧 In Progress
Closed Issues Real issues from your repo's history ✅ Ready

What We Measure

Sniffbench evaluates agents on behaviors that matter for real-world development:

  1. Comprehension - Does the agent understand the codebase?
  2. Efficiency - Does it explore without wasting tokens?
  3. Accuracy - Are its answers correct and complete?
  4. Consistency - Does it perform reliably across runs?

See VALUES.md for our full evaluation philosophy.

Contributing

We welcome contributions! Areas that need work:

  • Agent wrappers - Integrate with Cursor, Aider, or other coding agents
  • Bootstrap cases - Detection and validation for common tasks
  • LLM-judge - Automated answer quality evaluation
  • Documentation - Examples, tutorials, case studies

See CONTRIBUTING.md to get started.

Links

License

MIT - see LICENSE

About

Evaluate coding agents. Like a sniff test, but it's a benchmark.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •