Adversarial benchmarking for AI agent skills. We measure whether a skill makes the AI measurably better than not having it.
No vibes. No star counts. Deterministic, reproducible measurement.
For each skill, two agents solve the same tasks:
- With-skill — agent has the skill's SKILL.md loaded
- Without-skill — baseline agent with no skill guidance
Both sides are graded by the same deterministic script — never by an LLM. If the skill doesn't consistently beat the baseline, it's not adding value.
| Dimension | Points | What it measures |
|---|---|---|
| Correctness | /40 | % of deterministic assertions passing (with-skill side) |
| Security | /20 | Static analysis — hardcoded keys, injection, exfiltration |
| Completeness | /20 | % of eval cases where with-skill produced output |
| Robustness | /20 | % of eval cases where with-skill beats without-skill |
75+ = Recommended | 50-74 = Acceptable | <50 = Needs Improvement
pip install -e .
skillbench scan <skill-folder> # Security scan only
skillbench run <skill-folder> # Set up benchmark round
skillbench grade <round-dir> # Grade completed round
skillbench report <round-dir> # Generate HTML + JSON reportskillbench run <skill-folder> --evals evals/biomedical/blast-alignment.yamlPre-built eval libraries for biomedical/bioinformatics skills:
evals/biomedical/blast-alignment.yaml— BLAST search and E-value interpretationevals/biomedical/variant-calling.yaml— VCF interpretation and variant filteringevals/biomedical/drug-interaction.yaml— Drug-drug interactions and pharmacokineticsevals/biomedical/protein-folding.yaml— PDB structure and AlphaFold queriesevals/biomedical/differential-expression.yaml— DESeq2 workflow and result interpretation
LLM self-grading hallucinates. When we let LLM agents grade their own outputs, they fabricated quality assessments — inventing metrics they never computed and grading identical outputs differently depending on which "side" they were evaluating. This is why SkillBench uses deterministic grading only.
| Repo | Skills | Domain |
|---|---|---|
| OpenClaw-Medical-Skills | 869 | Clinical, genomics, drug discovery |
| ClawBio | 39 | Bioinformatics, local-first |
| claude-scientific-skills | 170 | 250+ databases, PubMed/UniProt/PDB |
| LabClaw | 144 | Genomics, proteomics, clinical |
MIT