SkillBench

Adversarial benchmarking for AI agent skills. We measure whether a skill makes the AI measurably better than not having it.

No vibes. No star counts. Deterministic, reproducible measurement.

How it works

For each skill, two agents solve the same tasks:

With-skill — agent has the skill's SKILL.md loaded
Without-skill — baseline agent with no skill guidance

Both sides are graded by the same deterministic script — never by an LLM. If the skill doesn't consistently beat the baseline, it's not adding value.

Scoring (0-100)

Dimension	Points	What it measures
Correctness	/40	% of deterministic assertions passing (with-skill side)
Security	/20	Static analysis — hardcoded keys, injection, exfiltration
Completeness	/20	% of eval cases where with-skill produced output
Robustness	/20	% of eval cases where with-skill beats without-skill

75+ = Recommended | 50-74 = Acceptable | <50 = Needs Improvement

Quick start

pip install -e .
skillbench scan <skill-folder>          # Security scan only
skillbench run <skill-folder>           # Set up benchmark round
skillbench grade <round-dir>            # Grade completed round
skillbench report <round-dir>           # Generate HTML + JSON report

With eval cases

skillbench run <skill-folder> --evals evals/biomedical/blast-alignment.yaml

Biomedical eval cases

Pre-built eval libraries for biomedical/bioinformatics skills:

evals/biomedical/blast-alignment.yaml — BLAST search and E-value interpretation
evals/biomedical/variant-calling.yaml — VCF interpretation and variant filtering
evals/biomedical/drug-interaction.yaml — Drug-drug interactions and pharmacokinetics
evals/biomedical/protein-folding.yaml — PDB structure and AlphaFold queries
evals/biomedical/differential-expression.yaml — DESeq2 workflow and result interpretation

Key discovery

LLM self-grading hallucinates. When we let LLM agents grade their own outputs, they fabricated quality assessments — inventing metrics they never computed and grading identical outputs differently depending on which "side" they were evaluating. This is why SkillBench uses deterministic grading only.

Target skill repos

Repo	Skills	Domain
OpenClaw-Medical-Skills	869	Clinical, genomics, drug discovery
ClawBio	39	Bioinformatics, local-first
claude-scientific-skills	170	250+ databases, PubMed/UniProt/PDB
LabClaw	144	Genomics, proteomics, clinical

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evals/biomedical		evals/biomedical
results		results
src/skillbench		src/skillbench
tests		tests
website		website
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillBench

How it works

Scoring (0-100)

Quick start

With eval cases

Biomedical eval cases

Key discovery

Target skill repos

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkillBench

How it works

Scoring (0-100)

Quick start

With eval cases

Biomedical eval cases

Key discovery

Target skill repos

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages