Skip to content

boheling/skillbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SkillBench

Adversarial benchmarking for AI agent skills. We measure whether a skill makes the AI measurably better than not having it.

No vibes. No star counts. Deterministic, reproducible measurement.

How it works

For each skill, two agents solve the same tasks:

  1. With-skill — agent has the skill's SKILL.md loaded
  2. Without-skill — baseline agent with no skill guidance

Both sides are graded by the same deterministic script — never by an LLM. If the skill doesn't consistently beat the baseline, it's not adding value.

Scoring (0-100)

Dimension Points What it measures
Correctness /40 % of deterministic assertions passing (with-skill side)
Security /20 Static analysis — hardcoded keys, injection, exfiltration
Completeness /20 % of eval cases where with-skill produced output
Robustness /20 % of eval cases where with-skill beats without-skill

75+ = Recommended | 50-74 = Acceptable | <50 = Needs Improvement

Quick start

pip install -e .
skillbench scan <skill-folder>          # Security scan only
skillbench run <skill-folder>           # Set up benchmark round
skillbench grade <round-dir>            # Grade completed round
skillbench report <round-dir>           # Generate HTML + JSON report

With eval cases

skillbench run <skill-folder> --evals evals/biomedical/blast-alignment.yaml

Biomedical eval cases

Pre-built eval libraries for biomedical/bioinformatics skills:

  • evals/biomedical/blast-alignment.yaml — BLAST search and E-value interpretation
  • evals/biomedical/variant-calling.yaml — VCF interpretation and variant filtering
  • evals/biomedical/drug-interaction.yaml — Drug-drug interactions and pharmacokinetics
  • evals/biomedical/protein-folding.yaml — PDB structure and AlphaFold queries
  • evals/biomedical/differential-expression.yaml — DESeq2 workflow and result interpretation

Key discovery

LLM self-grading hallucinates. When we let LLM agents grade their own outputs, they fabricated quality assessments — inventing metrics they never computed and grading identical outputs differently depending on which "side" they were evaluating. This is why SkillBench uses deterministic grading only.

Target skill repos

Repo Skills Domain
OpenClaw-Medical-Skills 869 Clinical, genomics, drug discovery
ClawBio 39 Bioinformatics, local-first
claude-scientific-skills 170 250+ databases, PubMed/UniProt/PDB
LabClaw 144 Genomics, proteomics, clinical

License

MIT

About

Adversarial, deterministic benchmarking for AI agent skills — measures whether a skill makes the agent measurably better. Includes biomedical eval cases.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors