ESLint + Jest + npm publish β for AI Agent Tools.
Build, test, scan, and ship tools across OpenClaw/ClawHub, Claude Code, Cursor, and Windsurf β from a single CLI.
13,000+ tools are published on ClawHub. 13% contain critical security flaws (Snyk ToxicTools Report, Feb 2026). Tools break silently on platforms other than the one they were tested on. There is no pytest for agent tools β until now.
toolmark init my-tool --template github-api
toolmark test # LLM-as-judge evaluation
toolmark scan # prompt injection, dynamic fetch, credential leaks
toolmark compat # check all 4 platforms at once
toolmark publish # sign with Ed25519, push to ClawHub + Claude Code
pip install toolmarkRequires Python 3.12+.
# 1. Scaffold
toolmark init my-github-tool --template github-api
# 2. Edit tool.md and tests/
cd my-github-tool
# 3. Test
ANTHROPIC_API_KEY=sk-ant-... toolmark test
# 4. Scan
toolmark scan
# 5. Check platform compatibility
toolmark compat
# 6. Publish
toolmark publish --platforms clawhub,claude-code| Command | What it does |
|---|---|
toolmark init |
Scaffold a new tool from a template |
toolmark test |
LLM-as-judge evaluation against YAML test cases |
toolmark scan |
Security scanner (prompt injection, dynamic fetch, creds) |
toolmark compat |
Cross-platform compatibility check (4 platforms) |
toolmark bench |
Benchmark latency, tokens, compute quality score (0β100) |
toolmark publish |
Sign with Ed25519, publish to configured registries |
toolmark init my-tool --template github-api # GitHub REST API wrapper
toolmark init my-tool --template file-ops # Local filesystem tool
toolmark init my-tool --template mcp-integration # Wraps an MCP server tool
toolmark init my-tool --template web-search # Search API tool
toolmark init my-tool --template loom-query # Loom knowledge graph tool
toolmark init my-tool --template blank # Minimal scaffold# tests/test_search.yaml
- id: search_open_prs
input: "find my open pull requests"
expect_invoked: true
expect_tool: search_pull_requests
expect_params:
state: open
assignee: "@me"
tolerance: fuzzy # strict | fuzzy | invoked
tags: [smoke]Run: toolmark test --tags smoke
toolmark catches:
- SF001 β Dynamic fetch (
curl | bash,eval(fetch(...))) - SF002 β Hardcoded credentials (API keys, passwords)
- SF003 β Prompt injection phrases in tool descriptions
- SF004 β Undeclared network endpoints
- SNYK-* β 138 rules via Snyk agent-scan (if installed)
Every published tool is signed with Ed25519:
toolmark keygen # creates ~/.toolmark/signing.key
toolmark publish --sign # signs + publishes
toolmark verify my-tool # verify any published toolEvery toolmark init project includes a ready-to-use workflow:
# .github/workflows/toolmark.yml β already in your project
- toolmark compat # platform check
- toolmark scan # security gate
- toolmark test # LLM evaluation (needs ANTHROPIC_API_KEY secret)See how your tool ranks: toolmark.dev/leaderboard
Quality Score = test pass rate (50%) + security score (30%) + compat score (20%).
-
initβ scaffold with 6 templates -
testβ LLM-as-judge evaluation -
scanβ built-in security rules + Snyk integration -
compatβ 4-platform compatibility matrix -
benchβ composite quality score -
publishβ Ed25519 signing + ClawHub -
watchβ re-run tests on save - VS Code extension
- Rust benchmark runner
- Claude Code + Cursor + Windsurf publish
See CONTRIBUTING.md. We always have good first issues.
MIT β see LICENSE.