OpenTest

OpenTest is a set of AI Agent Skills that give AI coding assistants the ability to automatically test software — APIs, web frontends, accessibility, and UX — with zero manual test writing. Skills are loaded by the agent and execute entirely within a conversation.

Supported agents: Claude Code · GitHub Copilot · Antigravity

User: "Run E2E tests on https://staging.myapp.com"
Agent: [opens browser → discovers pages & flows → executes scenarios → saves report]
       Found 14 pages. Ran 18 scenarios: ✅ 15 passed, ❌ 3 failed.
       Report saved to .opentest-workspace/e2e/report-2026-03-18.md

User: "Test my API at https://api.myapp.com"
Agent: [parses spec → generates plan → runs Schemathesis → Score: 78/100 (C) + findings]

Why OpenTest?

Zero test writing — Claude discovers flows and executes scenarios automatically
Actionable findings — every issue linked to a specific rule ID with root cause and fix recommendation
Scored results — not just pass/fail; you get a grade, severity breakdown, and top priorities
Minimal setup — point Claude at a URL or spec and get results in one conversation
Full coverage report — gaps and untested areas are surfaced, not hidden

Skills

Skill	Trigger	What It Does
api-test	"test my API", "analyze this spec"	Parse OpenAPI/Postman spec → generate test plan → run Schemathesis → scored report
e2e-test	"test my web app", "run E2E tests"	Open browser → discover pages & flows → execute scenarios → save findings report
a11y-test	"check accessibility", "WCAG audit"	Run axe-core via Playwright → map violations to WCAG → prioritized fix list
ux-test	"review my UX", "audit my UI"	Browse app → evaluate against Nielsen's 10 heuristics → actionable UX report
perf-test	"load test", "stress test", "benchmark my API"	Generate k6 script → run load scenario → p50/p95/p99 metrics + threshold report
security-scan	"scan for vulnerabilities", "OWASP scan"	Start ZAP via Docker → spider + passive/active scan → severity-ranked findings

Quick Start

Prerequisites

Python 3.12+
uv package manager
One or more supported AI agents: Claude Code, GitHub Copilot, or Antigravity
Node.js 18+ (for Playwright MCP — required by e2e-test and ux-test)

Installation

# Clone the repository
git clone https://github.com/coderphonui/opentest
cd opentest

# Install Python dependencies
uv sync

Sync skills to your project

# Run from the opentest repo — copies skills to your target project
bin/opentest-sync /path/to/your-project

Skills are stored once in .opentest-workspace/skills/ and symlinked into each agent's directory:

Agent	Reads skills from
Claude Code	`.claude/skills/`
GitHub Copilot	`.github/skills/`
Antigravity	`.agents/skills/`

All three symlinks point to the same files in .opentest-workspace/skills/, so updates only need to be synced once.

# Sync only specific skills
bin/opentest-sync --skill api-test --skill e2e-test /path/to/your-project

# Preview without writing any files
bin/opentest-sync --dry-run /path/to/your-project

# List available skills
bin/opentest-sync --list

Playwright MCP (required for e2e-test, ux-test, a11y-test)

Claude Code — per project (recommended — config is committed and shared with the team):

cd /path/to/your-app
claude mcp add --transport stdio --scope project playwright -- npx -y @playwright/mcp@latest

Claude Code — global (available in every project on your machine):

claude mcp add --transport stdio --scope user playwright -- npx -y @playwright/mcp@latest

Verify it's active by running /mcp in Claude Code — playwright should show as connected.

For GitHub Copilot and Antigravity, configure Playwright MCP according to each agent's MCP setup documentation.

Skill: api-test

Analyzes API specs, generates a test plan, executes tests via Schemathesis, and produces a scored report.

Workflow

Analyze — parse the spec, extract endpoints, detect patterns (CRUD, pagination, soft-delete), identify auth type
Plan — apply best-practice rules + derive spec-driven test cases
Execute — run Schemathesis against the live API
Report — score results, group findings by severity, recommend fixes

Usage

You: I have an OpenAPI spec for my users API. Can you test it?
     [paste spec or provide path]
     Base URL: https://api.myapp.com

Claude: Analyzed spec — 12 endpoints, Bearer auth, CRUD pattern detected.

        Score: 72/100 (C)

        Critical (2): SQL injection on POST /users, auth bypass on DELETE /users/{id}
        High (3): Missing 422 on invalid email, response schema mismatch on GET /users
        Medium (1): PUT /users is not idempotent

        Top 3 things to fix:
        1. Sanitize inputs with parameterized queries (POST /users)
        2. Add auth check to DELETE /users/{id}
        3. Return 422 instead of 200 for invalid email format

Auth Config

// Bearer token
{"type": "bearer", "token": "eyJhbGciOiJIUzI1NiJ9..."}

// API Key header
{"type": "api_key", "header": "X-API-Key", "value": "key123"}

// Basic auth
{"type": "basic", "username": "user", "password": "pass"}

// OAuth2 Client Credentials
{"type": "oauth2_cc", "token_url": "https://auth.example.com/token",
 "client_id": "id", "client_secret": "secret"}

Output

All files go to .opentest-workspace/api/ in your project:

File	Contents
`spec_summary.json`	Parsed spec: endpoints, patterns, auth type
`test_plan.json`	Generated test cases with rule IDs
`results.json`	Raw Schemathesis output

Skill: e2e-test

Explores a web app live in a browser via playwright-mcp, executes test scenarios directly, and saves a findings report. No code generation — everything happens in the browser in real time.

Workflow

Intake — collect URL, auth credentials, and any focus areas
Auth — log in (form, cookie injection, or token injection)
Discovery — BFS-crawl up to 25 pages; build a site map of pages, forms, and flows
Analysis — identify critical user journeys (P0) and confirm with you before testing
Execution — run scenarios tier by tier (P0 first), screenshot each step, record pass/fail
Report — save findings to .opentest-workspace/e2e/report-<YYYY-MM-DD>.md

Scenario Tiers

Priority	Type	Example
P0	Core happy path	User creates and completes the main action
P0	Auth boundary	Unauthenticated user redirected to login
P1	Form validation	Required field blank → error shown
P1	Navigation	All major sections reachable, no 404/500
P2	Empty state	Helpful empty state, not broken layout
P2	Error handling	Duplicate name shows error, doesn't crash

Usage

You: Run E2E tests on https://staging.myapp.com
     Auth: email=test@example.com, password=Pass123!

Claude: [crawls app — 14 pages found]
        Detected 3 critical flows: Login, Checkout, Dashboard navigation.

        Running 18 scenarios...

        Score: ✅ 15 passed  ❌ 3 failed

        🔴 [CRITICAL] Checkout fails on mobile viewport (375px)
        🟠 [HIGH] Cart total not updated after removing item
        🟡 [MEDIUM] No confirmation shown after form submit

        Report: .opentest-workspace/e2e/report-2026-03-18.md

Output

File	Contents
`screenshots/<page>.png`	Screenshot of each discovered page
`report-<YYYY-MM-DD>.md`	Site map, issues by severity, full scenario results

Skill: a11y-test

Scans web pages with axe-core via Playwright, maps violations to WCAG criteria, and produces concrete, actionable fixes.

Usage

You: Check accessibility on https://myapp.com — run WCAG 2.1 AA audit

Claude: Scanning 3 pages...

        Critical (3): Missing alt text on product images, no form labels on checkout
        Serious (5): Color contrast failures, missing focus styles
        Moderate (4): Heading order inconsistent, missing landmark roles

        Top fix: Add alt="" to 12 <img> elements in /products
        Report: .opentest-workspace/a11y/results.json

WCAG Severity Levels

Level	Label	Meaning
A violations	Critical	Major barriers — some users cannot access content
AA violations	Serious	Required for legal compliance (ADA, EAA)
Best practice	Moderate	Improves assistive technology experience
Minor	Low	Small improvements, low user impact

Skill: ux-test

Explores a web app live in a browser, captures screenshots, and evaluates UI/UX quality against Nielsen's 10 Usability Heuristics.

Usage

You: Review UX on https://myapp.com — users are non-technical customers

Claude: [browses 10 pages, captures screenshots and UI states]

        🔴 [CRITICAL] Users have no way to recover from form errors — no inline validation
        🟠 [HIGH] Primary CTA buried below the fold on mobile
        🟡 [MEDIUM] Inconsistent button labels across 3 pages
        💡 Add empty state illustrations to /dashboard

        Report: .opentest-workspace/ux/report-2026-03-18.md

Output

File	Contents
`screenshots/<page>.png`	Screenshots of each page and UI state
`report-<YYYY-MM-DD>.md`	Findings by severity, page-by-page summary, top actions

Skill: perf-test

Generates a k6 load test script from a URL or OpenAPI spec, executes it under configurable load, and reports p50/p95/p99 latency, throughput, and error rates against SLA thresholds.

Scenarios

Scenario	Profile	Purpose
`smoke`	1 VU / 30s	Verify setup, baseline latency
`load`	ramp to 10 VUs / 2m	Typical production load
`stress`	ramp to 100 VUs / 5m	Find the breaking point
`soak`	5 VUs / 10m	Detect memory leaks
`spike`	0→100→0 VUs / 1m	Sudden traffic spikes

Usage

You: Load test https://api.myapp.com — scenario: stress

Claude: Generating k6 script for stress scenario (100 VUs)...
        Running: .opentest-workspace/perf/k6_script.js

        ❌ FAIL — p95 threshold violated

        Response Times:   p50: 145ms  p95: 3,420ms  p99: 8,200ms
        Throughput:       42 req/s at peak (100 VUs)
        Error rate:       2.3% ← threshold: <1% ❌

        Top 3 findings:
        1. p95 latency 3.4s at 100 VUs — likely DB query without index on POST /orders
        2. Error rate 2.3% — connection pool exhausted; increase pool size
        3. p99 spike to 8.2s — consider adding Redis caching for GET /products

        Report: .opentest-workspace/perf/report-2026-03-18.md
        Script saved: .opentest-workspace/perf/k6_script.js (reusable in CI)

Prerequisites

# macOS
brew install k6

# Ubuntu/Debian
sudo apt-get install k6   # see https://k6.io/docs/get-started/installation/

Output

File	Contents
`k6_script.js`	Generated k6 script (reusable in CI)
`results.json`	Raw k6 metrics (p50/p95/p99, req/s, error rate)
`report-<YYYY-MM-DD>.md`	Verdict, findings, top recommendations

Skill: security-scan

Runs OWASP ZAP against a web app or API in Docker, spiders the target, runs passive (and optionally active) scan, and produces a severity-ranked vulnerability report.

⚠️ Only scan systems you own or have explicit written permission to test.

Usage

You: Security scan https://staging.myapp.com — passive scan only

Claude: Starting ZAP Docker container...
        Spidering https://staging.myapp.com — 24 URLs found.
        Running passive scan...

        🟠 High (2): Missing Content-Security-Policy, Cookies without Secure flag
        🟡 Medium (5): Missing HSTS, CORS misconfiguration, exposed server header...
        🔵 Low (8): Informational headers, fingerprinting risks

        Top 3 fixes:
        1. Add 4 security headers via middleware (CSP, HSTS, X-Frame-Options, X-Content-Type) — ~30 min
        2. Set Secure + HttpOnly on session cookie — single config change
        3. Restrict CORS from * to your frontend domain

        Report: .opentest-workspace/security/report-2026-03-18.md

Scan Modes

Mode	Duration	Best for
`passive`	~2 min	Safe for production; inspects traffic only
`active`	10–30 min	Staging/dev; sends attack probes to find more issues

Prerequisites

docker pull ghcr.io/zaproxy/zaproxy:stable

Output

File	Contents
`results.json`	Structured findings: severity, OWASP mapping, evidence, fix
`report-<YYYY-MM-DD>.md`	Severity-ranked report with remediation actions

Scoring (API Testing)

Score	Grade	Meaning
90–100	A	Excellent — production-ready
80–89	B	Good — minor issues to address
70–79	C	Acceptable — several improvements needed
60–69	D	Poor — significant gaps
0–59	F	Critical issues — do not ship

Score deductions: Critical −15pts each (max −45), High −8pts (max −24), Medium −3pts, Low −1pt. Bonuses: >80% coverage +5pts, all P0 passed +5pts.

Best Practice Rules

Rules are defined in YAML files under .claude/skills/api-test/rules/.

API Rules

File	Rules	Focus
`auth.rules.yaml`	AUTH-001 to AUTH-003	Authentication, authorization, BOLA
`validation.rules.yaml`	VAL-001 to VAL-005	Input validation, SQL injection, XSS
`response.rules.yaml`	RESP-001 to RESP-005	Response format, schema, error codes
`idempotency.rules.yaml`	IDEM-001 to IDEM-003	PUT/DELETE idempotency
`pagination.rules.yaml`	PAG-001 to PAG-003	Pagination params, metadata

Custom Rules

# my_rules.yaml
custom_rules:
  - id: CUSTOM-001
    name: "All responses must include X-Request-ID header"
    severity: high
    category: api_design
    applies_to: {}
    additional_assertions:
      - type: header
        target: "X-Request-ID"
        operator: exists

Pass it via --custom-rules my_rules.yaml in generate_plan.py.

Repository Structure

opentest/                          # This repository (skill source)
├── .claude/
│   └── skills/                    # Skill definitions (source of truth)
│       ├── api-test/
│       │   ├── SKILL.md           # Skill prompt loaded by the agent
│       │   ├── scripts/           # analyze_spec.py, generate_plan.py, run_tests.py, ...
│       │   └── rules/             # API best-practice rules (YAML)
│       ├── e2e-test/
│       │   ├── SKILL.md
│       │   └── rules/
│       ├── a11y-test/
│       │   ├── SKILL.md
│       │   └── scripts/           # scan_a11y.py (axe-core via Playwright)
│       ├── ux-test/
│       │   ├── SKILL.md
│       │   └── references/        # heuristics.md (Nielsen's 10 heuristics)
│       ├── perf-test/
│       │   ├── SKILL.md
│       │   └── scripts/           # run_perf.py (k6 script generator + runner)
│       └── security-scan/
│           ├── SKILL.md
│           └── scripts/           # run_security_scan.py (OWASP ZAP via Docker)
├── shared/
│   └── opentest_core/
│       ├── models/                # Pydantic data models
│       ├── auth/                  # AuthManager (bearer, API key, basic, OAuth2, cookie)
│       ├── rules_engine/          # YAML rules loader, engine, scorer
│       ├── test_data/             # Boundary values, fuzz payloads, Faker wrapper
│       ├── result/                # ResultAggregator, ReportGenerator
│       └── schema/                # jsonschema validator
├── bin/
│   └── opentest-sync              # Sync skills from this repo into a target project
├── docs/                          # Contributing guide
└── .github/workflows/             # CI/CD pipeline

your-project/                      # After running opentest-sync
├── .opentest-workspace/
│   └── skills/                    # Skill files live here (single copy)
│       ├── api-test/
│       ├── e2e-test/
│       └── ...
├── .claude/skills/                # Symlinks → .opentest-workspace/skills/* (Claude Code)
├── .github/skills/                # Symlinks → .opentest-workspace/skills/* (GitHub Copilot)
└── .agents/skills/                # Symlinks → .opentest-workspace/skills/* (Antigravity)

Development

Run tests

# All unit tests
uv run pytest shared/tests -v --tb=short

# With coverage
uv run pytest shared/tests --cov=opentest_core --cov-report=term-missing

Lint, format, type check

uv run ruff check .
uv run ruff format --check .
uv run mypy shared/opentest_core --ignore-missing-imports

Auto-fix formatting:

uv run ruff format .

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                              USER                                        │
│  "Test my API" / "Run E2E on staging.myapp.com" / "Review UX"          │
└──────────────────────────────────┬──────────────────────────────────────┘
                                   │ Natural Language
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│            AI AGENT  (Claude Code · GitHub Copilot · Antigravity)       │
│  Loads skill prompt → follows workflow → calls tools → saves report     │
└──────┬────────────────┬──────────────────┬──────────────┬───────────────┘
       │                │                  │              │
       ▼                ▼                  ▼              ▼
┌─────────────┐  ┌────────────┐  ┌──────────────┐  ┌──────────┐
│  api-test   │  │  e2e-test  │  │  a11y-test   │  │ ux-test  │
│             │  │            │  │              │  │          │
│ analyze_spec│  │ playwright │  │ scan_a11y.py │  │playwright│
│ gen_plan    │  │ mcp tools  │  │ (axe-core)   │  │ mcp tools│
│ run_tests   │  │ navigate   │  │              │  │ navigate │
│ gen_code    │  │ click      │  │              │  │ snapshot │
│ analyze_res │  │ screenshot │  │              │  │screenshot│
└──────┬──────┘  └─────┬──────┘  └──────┬───────┘  └────┬─────┘
       │               │                │               │
       ▼               ▼                └───────────────┘
┌─────────────┐  ┌─────────────┐
│ Schemathesis│  │  Playwright │
│  (Python)   │  │   Browser   │
└─────────────┘  └─────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    SHARED PYTHON INFRASTRUCTURE                          │
│  rules_engine/  auth_manager/  result_aggregator/  report_generator/   │
│  test_data/     schema_validator/  models/                               │
└─────────────────────────────────────────────────────────────────────────┘

Technology Stack

Layer	Technology	License
Skill Runtime	Claude Code · GitHub Copilot · Antigravity	—
API Testing Engine	Schemathesis	MIT
Browser Automation	playwright-mcp (Microsoft)	Apache 2.0
Accessibility Scanning	axe-core (via Playwright)	MPL-2.0
UX Evaluation Framework	Nielsen's 10 Heuristics	—
Load Testing	k6	AGPL-3.0
Security Scanning	OWASP ZAP (Docker)	Apache 2.0
Data Models	Pydantic v2	MIT
HTTP Client	httpx	BSD-3
OpenAPI Parsing	openapi-core	BSD-3
Test Data	Faker	MIT
Package Manager	uv	MIT
Linter/Formatter	Ruff	MIT

Philosophy

Reuse First — battle-tested tools (Schemathesis, playwright-mcp, axe-core) over reinventing the wheel
Build Only Unique Value — Rules Engine, AI Analysis, Orchestration are what we build
Auditable — every finding is traceable to a specific rule ID or heuristic
Self-Hostable — no vendor lock-in, runs 100% locally
Composable — each skill is independent, use only what you need

License

MIT License — see LICENSE for details.

Contributing

Contributions welcome! Please read the contributing guide before opening a PR.

Areas where help is most appreciated:

Additional best practice rules (add a .rules.yaml file)
New spec parsers (GraphQL mutations, AsyncAPI, gRPC)
E2E rule coverage improvements
Integration test fixtures for more public apps (SauceDemo, TodoMVC, Conduit)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.claude		.claude
.github		.github
.serena		.serena
bin		bin
docs		docs
examples/petstore_api		examples/petstore_api
shared		shared
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
install.sh		install.sh
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

OpenTest

Why OpenTest?

Skills

Quick Start

Prerequisites

Installation

Sync skills to your project

Playwright MCP (required for e2e-test, ux-test, a11y-test)

Skill: api-test

Workflow

Usage

Auth Config

Output

Skill: e2e-test

Workflow

Scenario Tiers

Usage

Output

Skill: a11y-test

Usage

WCAG Severity Levels

Skill: ux-test

Usage

Output

Skill: perf-test

Scenarios

Usage

Prerequisites

Output

Skill: security-scan

Usage

Scan Modes

Prerequisites

Output

Scoring (API Testing)

Best Practice Rules

API Rules

Custom Rules

Repository Structure

Development

Run tests

Lint, format, type check

Architecture

Technology Stack

Philosophy

License

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages