OpenTest is a set of AI Agent Skills that give AI coding assistants the ability to automatically test software — APIs, web frontends, accessibility, and UX — with zero manual test writing. Skills are loaded by the agent and execute entirely within a conversation.
Supported agents: Claude Code · GitHub Copilot · Antigravity
User: "Run E2E tests on https://staging.myapp.com"
Agent: [opens browser → discovers pages & flows → executes scenarios → saves report]
Found 14 pages. Ran 18 scenarios: ✅ 15 passed, ❌ 3 failed.
Report saved to .opentest-workspace/e2e/report-2026-03-18.md
User: "Test my API at https://api.myapp.com"
Agent: [parses spec → generates plan → runs Schemathesis → Score: 78/100 (C) + findings]
- Zero test writing — Claude discovers flows and executes scenarios automatically
- Actionable findings — every issue linked to a specific rule ID with root cause and fix recommendation
- Scored results — not just pass/fail; you get a grade, severity breakdown, and top priorities
- Minimal setup — point Claude at a URL or spec and get results in one conversation
- Full coverage report — gaps and untested areas are surfaced, not hidden
| Skill | Trigger | What It Does |
|---|---|---|
| api-test | "test my API", "analyze this spec" | Parse OpenAPI/Postman spec → generate test plan → run Schemathesis → scored report |
| e2e-test | "test my web app", "run E2E tests" | Open browser → discover pages & flows → execute scenarios → save findings report |
| a11y-test | "check accessibility", "WCAG audit" | Run axe-core via Playwright → map violations to WCAG → prioritized fix list |
| ux-test | "review my UX", "audit my UI" | Browse app → evaluate against Nielsen's 10 heuristics → actionable UX report |
| perf-test | "load test", "stress test", "benchmark my API" | Generate k6 script → run load scenario → p50/p95/p99 metrics + threshold report |
| security-scan | "scan for vulnerabilities", "OWASP scan" | Start ZAP via Docker → spider + passive/active scan → severity-ranked findings |
- Python 3.12+
- uv package manager
- One or more supported AI agents: Claude Code, GitHub Copilot, or Antigravity
- Node.js 18+ (for Playwright MCP — required by
e2e-testandux-test)
# Clone the repository
git clone https://github.com/coderphonui/opentest
cd opentest
# Install Python dependencies
uv sync# Run from the opentest repo — copies skills to your target project
bin/opentest-sync /path/to/your-projectSkills are stored once in .opentest-workspace/skills/ and symlinked into each agent's directory:
| Agent | Reads skills from |
|---|---|
| Claude Code | .claude/skills/ |
| GitHub Copilot | .github/skills/ |
| Antigravity | .agents/skills/ |
All three symlinks point to the same files in .opentest-workspace/skills/, so updates only need to be synced once.
# Sync only specific skills
bin/opentest-sync --skill api-test --skill e2e-test /path/to/your-project
# Preview without writing any files
bin/opentest-sync --dry-run /path/to/your-project
# List available skills
bin/opentest-sync --listClaude Code — per project (recommended — config is committed and shared with the team):
cd /path/to/your-app
claude mcp add --transport stdio --scope project playwright -- npx -y @playwright/mcp@latestClaude Code — global (available in every project on your machine):
claude mcp add --transport stdio --scope user playwright -- npx -y @playwright/mcp@latestVerify it's active by running /mcp in Claude Code — playwright should show as connected.
For GitHub Copilot and Antigravity, configure Playwright MCP according to each agent's MCP setup documentation.
Analyzes API specs, generates a test plan, executes tests via Schemathesis, and produces a scored report.
- Analyze — parse the spec, extract endpoints, detect patterns (CRUD, pagination, soft-delete), identify auth type
- Plan — apply best-practice rules + derive spec-driven test cases
- Execute — run Schemathesis against the live API
- Report — score results, group findings by severity, recommend fixes
You: I have an OpenAPI spec for my users API. Can you test it?
[paste spec or provide path]
Base URL: https://api.myapp.com
Claude: Analyzed spec — 12 endpoints, Bearer auth, CRUD pattern detected.
Score: 72/100 (C)
Critical (2): SQL injection on POST /users, auth bypass on DELETE /users/{id}
High (3): Missing 422 on invalid email, response schema mismatch on GET /users
Medium (1): PUT /users is not idempotent
Top 3 things to fix:
1. Sanitize inputs with parameterized queries (POST /users)
2. Add auth check to DELETE /users/{id}
3. Return 422 instead of 200 for invalid email format
// Bearer token
{"type": "bearer", "token": "eyJhbGciOiJIUzI1NiJ9..."}
// API Key header
{"type": "api_key", "header": "X-API-Key", "value": "key123"}
// Basic auth
{"type": "basic", "username": "user", "password": "pass"}
// OAuth2 Client Credentials
{"type": "oauth2_cc", "token_url": "https://auth.example.com/token",
"client_id": "id", "client_secret": "secret"}All files go to .opentest-workspace/api/ in your project:
| File | Contents |
|---|---|
spec_summary.json |
Parsed spec: endpoints, patterns, auth type |
test_plan.json |
Generated test cases with rule IDs |
results.json |
Raw Schemathesis output |
Explores a web app live in a browser via playwright-mcp, executes test scenarios directly, and saves a findings report. No code generation — everything happens in the browser in real time.
- Intake — collect URL, auth credentials, and any focus areas
- Auth — log in (form, cookie injection, or token injection)
- Discovery — BFS-crawl up to 25 pages; build a site map of pages, forms, and flows
- Analysis — identify critical user journeys (P0) and confirm with you before testing
- Execution — run scenarios tier by tier (P0 first), screenshot each step, record pass/fail
- Report — save findings to
.opentest-workspace/e2e/report-<YYYY-MM-DD>.md
| Priority | Type | Example |
|---|---|---|
| P0 | Core happy path | User creates and completes the main action |
| P0 | Auth boundary | Unauthenticated user redirected to login |
| P1 | Form validation | Required field blank → error shown |
| P1 | Navigation | All major sections reachable, no 404/500 |
| P2 | Empty state | Helpful empty state, not broken layout |
| P2 | Error handling | Duplicate name shows error, doesn't crash |
You: Run E2E tests on https://staging.myapp.com
Auth: email=test@example.com, password=Pass123!
Claude: [crawls app — 14 pages found]
Detected 3 critical flows: Login, Checkout, Dashboard navigation.
Running 18 scenarios...
Score: ✅ 15 passed ❌ 3 failed
🔴 [CRITICAL] Checkout fails on mobile viewport (375px)
🟠 [HIGH] Cart total not updated after removing item
🟡 [MEDIUM] No confirmation shown after form submit
Report: .opentest-workspace/e2e/report-2026-03-18.md
| File | Contents |
|---|---|
screenshots/<page>.png |
Screenshot of each discovered page |
report-<YYYY-MM-DD>.md |
Site map, issues by severity, full scenario results |
Scans web pages with axe-core via Playwright, maps violations to WCAG criteria, and produces concrete, actionable fixes.
You: Check accessibility on https://myapp.com — run WCAG 2.1 AA audit
Claude: Scanning 3 pages...
Critical (3): Missing alt text on product images, no form labels on checkout
Serious (5): Color contrast failures, missing focus styles
Moderate (4): Heading order inconsistent, missing landmark roles
Top fix: Add alt="" to 12 <img> elements in /products
Report: .opentest-workspace/a11y/results.json
| Level | Label | Meaning |
|---|---|---|
| A violations | Critical | Major barriers — some users cannot access content |
| AA violations | Serious | Required for legal compliance (ADA, EAA) |
| Best practice | Moderate | Improves assistive technology experience |
| Minor | Low | Small improvements, low user impact |
Explores a web app live in a browser, captures screenshots, and evaluates UI/UX quality against Nielsen's 10 Usability Heuristics.
You: Review UX on https://myapp.com — users are non-technical customers
Claude: [browses 10 pages, captures screenshots and UI states]
🔴 [CRITICAL] Users have no way to recover from form errors — no inline validation
🟠 [HIGH] Primary CTA buried below the fold on mobile
🟡 [MEDIUM] Inconsistent button labels across 3 pages
💡 Add empty state illustrations to /dashboard
Report: .opentest-workspace/ux/report-2026-03-18.md
| File | Contents |
|---|---|
screenshots/<page>.png |
Screenshots of each page and UI state |
report-<YYYY-MM-DD>.md |
Findings by severity, page-by-page summary, top actions |
Generates a k6 load test script from a URL or OpenAPI spec, executes it under configurable load, and reports p50/p95/p99 latency, throughput, and error rates against SLA thresholds.
| Scenario | Profile | Purpose |
|---|---|---|
smoke |
1 VU / 30s | Verify setup, baseline latency |
load |
ramp to 10 VUs / 2m | Typical production load |
stress |
ramp to 100 VUs / 5m | Find the breaking point |
soak |
5 VUs / 10m | Detect memory leaks |
spike |
0→100→0 VUs / 1m | Sudden traffic spikes |
You: Load test https://api.myapp.com — scenario: stress
Claude: Generating k6 script for stress scenario (100 VUs)...
Running: .opentest-workspace/perf/k6_script.js
❌ FAIL — p95 threshold violated
Response Times: p50: 145ms p95: 3,420ms p99: 8,200ms
Throughput: 42 req/s at peak (100 VUs)
Error rate: 2.3% ← threshold: <1% ❌
Top 3 findings:
1. p95 latency 3.4s at 100 VUs — likely DB query without index on POST /orders
2. Error rate 2.3% — connection pool exhausted; increase pool size
3. p99 spike to 8.2s — consider adding Redis caching for GET /products
Report: .opentest-workspace/perf/report-2026-03-18.md
Script saved: .opentest-workspace/perf/k6_script.js (reusable in CI)
# macOS
brew install k6
# Ubuntu/Debian
sudo apt-get install k6 # see https://k6.io/docs/get-started/installation/| File | Contents |
|---|---|
k6_script.js |
Generated k6 script (reusable in CI) |
results.json |
Raw k6 metrics (p50/p95/p99, req/s, error rate) |
report-<YYYY-MM-DD>.md |
Verdict, findings, top recommendations |
Runs OWASP ZAP against a web app or API in Docker, spiders the target, runs passive (and optionally active) scan, and produces a severity-ranked vulnerability report.
You: Security scan https://staging.myapp.com — passive scan only
Claude: Starting ZAP Docker container...
Spidering https://staging.myapp.com — 24 URLs found.
Running passive scan...
🟠 High (2): Missing Content-Security-Policy, Cookies without Secure flag
🟡 Medium (5): Missing HSTS, CORS misconfiguration, exposed server header...
🔵 Low (8): Informational headers, fingerprinting risks
Top 3 fixes:
1. Add 4 security headers via middleware (CSP, HSTS, X-Frame-Options, X-Content-Type) — ~30 min
2. Set Secure + HttpOnly on session cookie — single config change
3. Restrict CORS from * to your frontend domain
Report: .opentest-workspace/security/report-2026-03-18.md
| Mode | Duration | Best for |
|---|---|---|
passive |
~2 min | Safe for production; inspects traffic only |
active |
10–30 min | Staging/dev; sends attack probes to find more issues |
docker pull ghcr.io/zaproxy/zaproxy:stable| File | Contents |
|---|---|
results.json |
Structured findings: severity, OWASP mapping, evidence, fix |
report-<YYYY-MM-DD>.md |
Severity-ranked report with remediation actions |
| Score | Grade | Meaning |
|---|---|---|
| 90–100 | A | Excellent — production-ready |
| 80–89 | B | Good — minor issues to address |
| 70–79 | C | Acceptable — several improvements needed |
| 60–69 | D | Poor — significant gaps |
| 0–59 | F | Critical issues — do not ship |
Score deductions: Critical −15pts each (max −45), High −8pts (max −24), Medium −3pts, Low −1pt. Bonuses: >80% coverage +5pts, all P0 passed +5pts.
Rules are defined in YAML files under .claude/skills/api-test/rules/.
| File | Rules | Focus |
|---|---|---|
auth.rules.yaml |
AUTH-001 to AUTH-003 | Authentication, authorization, BOLA |
validation.rules.yaml |
VAL-001 to VAL-005 | Input validation, SQL injection, XSS |
response.rules.yaml |
RESP-001 to RESP-005 | Response format, schema, error codes |
idempotency.rules.yaml |
IDEM-001 to IDEM-003 | PUT/DELETE idempotency |
pagination.rules.yaml |
PAG-001 to PAG-003 | Pagination params, metadata |
# my_rules.yaml
custom_rules:
- id: CUSTOM-001
name: "All responses must include X-Request-ID header"
severity: high
category: api_design
applies_to: {}
additional_assertions:
- type: header
target: "X-Request-ID"
operator: existsPass it via --custom-rules my_rules.yaml in generate_plan.py.
opentest/ # This repository (skill source)
├── .claude/
│ └── skills/ # Skill definitions (source of truth)
│ ├── api-test/
│ │ ├── SKILL.md # Skill prompt loaded by the agent
│ │ ├── scripts/ # analyze_spec.py, generate_plan.py, run_tests.py, ...
│ │ └── rules/ # API best-practice rules (YAML)
│ ├── e2e-test/
│ │ ├── SKILL.md
│ │ └── rules/
│ ├── a11y-test/
│ │ ├── SKILL.md
│ │ └── scripts/ # scan_a11y.py (axe-core via Playwright)
│ ├── ux-test/
│ │ ├── SKILL.md
│ │ └── references/ # heuristics.md (Nielsen's 10 heuristics)
│ ├── perf-test/
│ │ ├── SKILL.md
│ │ └── scripts/ # run_perf.py (k6 script generator + runner)
│ └── security-scan/
│ ├── SKILL.md
│ └── scripts/ # run_security_scan.py (OWASP ZAP via Docker)
├── shared/
│ └── opentest_core/
│ ├── models/ # Pydantic data models
│ ├── auth/ # AuthManager (bearer, API key, basic, OAuth2, cookie)
│ ├── rules_engine/ # YAML rules loader, engine, scorer
│ ├── test_data/ # Boundary values, fuzz payloads, Faker wrapper
│ ├── result/ # ResultAggregator, ReportGenerator
│ └── schema/ # jsonschema validator
├── bin/
│ └── opentest-sync # Sync skills from this repo into a target project
├── docs/ # Contributing guide
└── .github/workflows/ # CI/CD pipeline
your-project/ # After running opentest-sync
├── .opentest-workspace/
│ └── skills/ # Skill files live here (single copy)
│ ├── api-test/
│ ├── e2e-test/
│ └── ...
├── .claude/skills/ # Symlinks → .opentest-workspace/skills/* (Claude Code)
├── .github/skills/ # Symlinks → .opentest-workspace/skills/* (GitHub Copilot)
└── .agents/skills/ # Symlinks → .opentest-workspace/skills/* (Antigravity)
# All unit tests
uv run pytest shared/tests -v --tb=short
# With coverage
uv run pytest shared/tests --cov=opentest_core --cov-report=term-missinguv run ruff check .
uv run ruff format --check .
uv run mypy shared/opentest_core --ignore-missing-importsAuto-fix formatting:
uv run ruff format .┌─────────────────────────────────────────────────────────────────────────┐
│ USER │
│ "Test my API" / "Run E2E on staging.myapp.com" / "Review UX" │
└──────────────────────────────────┬──────────────────────────────────────┘
│ Natural Language
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ AI AGENT (Claude Code · GitHub Copilot · Antigravity) │
│ Loads skill prompt → follows workflow → calls tools → saves report │
└──────┬────────────────┬──────────────────┬──────────────┬───────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌──────────┐
│ api-test │ │ e2e-test │ │ a11y-test │ │ ux-test │
│ │ │ │ │ │ │ │
│ analyze_spec│ │ playwright │ │ scan_a11y.py │ │playwright│
│ gen_plan │ │ mcp tools │ │ (axe-core) │ │ mcp tools│
│ run_tests │ │ navigate │ │ │ │ navigate │
│ gen_code │ │ click │ │ │ │ snapshot │
│ analyze_res │ │ screenshot │ │ │ │screenshot│
└──────┬──────┘ └─────┬──────┘ └──────┬───────┘ └────┬─────┘
│ │ │ │
▼ ▼ └───────────────┘
┌─────────────┐ ┌─────────────┐
│ Schemathesis│ │ Playwright │
│ (Python) │ │ Browser │
└─────────────┘ └─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ SHARED PYTHON INFRASTRUCTURE │
│ rules_engine/ auth_manager/ result_aggregator/ report_generator/ │
│ test_data/ schema_validator/ models/ │
└─────────────────────────────────────────────────────────────────────────┘
| Layer | Technology | License |
|---|---|---|
| Skill Runtime | Claude Code · GitHub Copilot · Antigravity | — |
| API Testing Engine | Schemathesis | MIT |
| Browser Automation | playwright-mcp (Microsoft) | Apache 2.0 |
| Accessibility Scanning | axe-core (via Playwright) | MPL-2.0 |
| UX Evaluation Framework | Nielsen's 10 Heuristics | — |
| Load Testing | k6 | AGPL-3.0 |
| Security Scanning | OWASP ZAP (Docker) | Apache 2.0 |
| Data Models | Pydantic v2 | MIT |
| HTTP Client | httpx | BSD-3 |
| OpenAPI Parsing | openapi-core | BSD-3 |
| Test Data | Faker | MIT |
| Package Manager | uv | MIT |
| Linter/Formatter | Ruff | MIT |
- Reuse First — battle-tested tools (Schemathesis, playwright-mcp, axe-core) over reinventing the wheel
- Build Only Unique Value — Rules Engine, AI Analysis, Orchestration are what we build
- Auditable — every finding is traceable to a specific rule ID or heuristic
- Self-Hostable — no vendor lock-in, runs 100% locally
- Composable — each skill is independent, use only what you need
MIT License — see LICENSE for details.
Contributions welcome! Please read the contributing guide before opening a PR.
Areas where help is most appreciated:
- Additional best practice rules (add a
.rules.yamlfile) - New spec parsers (GraphQL mutations, AsyncAPI, gRPC)
- E2E rule coverage improvements
- Integration test fixtures for more public apps (SauceDemo, TodoMVC, Conduit)