Skip to content

coderphonui/opentest

OpenTest

OpenTest is a set of AI Agent Skills that give AI coding assistants the ability to automatically test software — APIs, web frontends, accessibility, and UX — with zero manual test writing. Skills are loaded by the agent and execute entirely within a conversation.

Supported agents: Claude Code · GitHub Copilot · Antigravity

User: "Run E2E tests on https://staging.myapp.com"
Agent: [opens browser → discovers pages & flows → executes scenarios → saves report]
       Found 14 pages. Ran 18 scenarios: ✅ 15 passed, ❌ 3 failed.
       Report saved to .opentest-workspace/e2e/report-2026-03-18.md

User: "Test my API at https://api.myapp.com"
Agent: [parses spec → generates plan → runs Schemathesis → Score: 78/100 (C) + findings]

Why OpenTest?

  • Zero test writing — Claude discovers flows and executes scenarios automatically
  • Actionable findings — every issue linked to a specific rule ID with root cause and fix recommendation
  • Scored results — not just pass/fail; you get a grade, severity breakdown, and top priorities
  • Minimal setup — point Claude at a URL or spec and get results in one conversation
  • Full coverage report — gaps and untested areas are surfaced, not hidden

Skills

Skill Trigger What It Does
api-test "test my API", "analyze this spec" Parse OpenAPI/Postman spec → generate test plan → run Schemathesis → scored report
e2e-test "test my web app", "run E2E tests" Open browser → discover pages & flows → execute scenarios → save findings report
a11y-test "check accessibility", "WCAG audit" Run axe-core via Playwright → map violations to WCAG → prioritized fix list
ux-test "review my UX", "audit my UI" Browse app → evaluate against Nielsen's 10 heuristics → actionable UX report
perf-test "load test", "stress test", "benchmark my API" Generate k6 script → run load scenario → p50/p95/p99 metrics + threshold report
security-scan "scan for vulnerabilities", "OWASP scan" Start ZAP via Docker → spider + passive/active scan → severity-ranked findings

Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager
  • One or more supported AI agents: Claude Code, GitHub Copilot, or Antigravity
  • Node.js 18+ (for Playwright MCP — required by e2e-test and ux-test)

Installation

# Clone the repository
git clone https://github.com/coderphonui/opentest
cd opentest

# Install Python dependencies
uv sync

Sync skills to your project

# Run from the opentest repo — copies skills to your target project
bin/opentest-sync /path/to/your-project

Skills are stored once in .opentest-workspace/skills/ and symlinked into each agent's directory:

Agent Reads skills from
Claude Code .claude/skills/
GitHub Copilot .github/skills/
Antigravity .agents/skills/

All three symlinks point to the same files in .opentest-workspace/skills/, so updates only need to be synced once.

# Sync only specific skills
bin/opentest-sync --skill api-test --skill e2e-test /path/to/your-project

# Preview without writing any files
bin/opentest-sync --dry-run /path/to/your-project

# List available skills
bin/opentest-sync --list

Playwright MCP (required for e2e-test, ux-test, a11y-test)

Claude Code — per project (recommended — config is committed and shared with the team):

cd /path/to/your-app
claude mcp add --transport stdio --scope project playwright -- npx -y @playwright/mcp@latest

Claude Code — global (available in every project on your machine):

claude mcp add --transport stdio --scope user playwright -- npx -y @playwright/mcp@latest

Verify it's active by running /mcp in Claude Code — playwright should show as connected.

For GitHub Copilot and Antigravity, configure Playwright MCP according to each agent's MCP setup documentation.


Skill: api-test

Analyzes API specs, generates a test plan, executes tests via Schemathesis, and produces a scored report.

Workflow

  1. Analyze — parse the spec, extract endpoints, detect patterns (CRUD, pagination, soft-delete), identify auth type
  2. Plan — apply best-practice rules + derive spec-driven test cases
  3. Execute — run Schemathesis against the live API
  4. Report — score results, group findings by severity, recommend fixes

Usage

You: I have an OpenAPI spec for my users API. Can you test it?
     [paste spec or provide path]
     Base URL: https://api.myapp.com

Claude: Analyzed spec — 12 endpoints, Bearer auth, CRUD pattern detected.

        Score: 72/100 (C)

        Critical (2): SQL injection on POST /users, auth bypass on DELETE /users/{id}
        High (3): Missing 422 on invalid email, response schema mismatch on GET /users
        Medium (1): PUT /users is not idempotent

        Top 3 things to fix:
        1. Sanitize inputs with parameterized queries (POST /users)
        2. Add auth check to DELETE /users/{id}
        3. Return 422 instead of 200 for invalid email format

Auth Config

// Bearer token
{"type": "bearer", "token": "eyJhbGciOiJIUzI1NiJ9..."}

// API Key header
{"type": "api_key", "header": "X-API-Key", "value": "key123"}

// Basic auth
{"type": "basic", "username": "user", "password": "pass"}

// OAuth2 Client Credentials
{"type": "oauth2_cc", "token_url": "https://auth.example.com/token",
 "client_id": "id", "client_secret": "secret"}

Output

All files go to .opentest-workspace/api/ in your project:

File Contents
spec_summary.json Parsed spec: endpoints, patterns, auth type
test_plan.json Generated test cases with rule IDs
results.json Raw Schemathesis output

Skill: e2e-test

Explores a web app live in a browser via playwright-mcp, executes test scenarios directly, and saves a findings report. No code generation — everything happens in the browser in real time.

Workflow

  1. Intake — collect URL, auth credentials, and any focus areas
  2. Auth — log in (form, cookie injection, or token injection)
  3. Discovery — BFS-crawl up to 25 pages; build a site map of pages, forms, and flows
  4. Analysis — identify critical user journeys (P0) and confirm with you before testing
  5. Execution — run scenarios tier by tier (P0 first), screenshot each step, record pass/fail
  6. Report — save findings to .opentest-workspace/e2e/report-<YYYY-MM-DD>.md

Scenario Tiers

Priority Type Example
P0 Core happy path User creates and completes the main action
P0 Auth boundary Unauthenticated user redirected to login
P1 Form validation Required field blank → error shown
P1 Navigation All major sections reachable, no 404/500
P2 Empty state Helpful empty state, not broken layout
P2 Error handling Duplicate name shows error, doesn't crash

Usage

You: Run E2E tests on https://staging.myapp.com
     Auth: email=test@example.com, password=Pass123!

Claude: [crawls app — 14 pages found]
        Detected 3 critical flows: Login, Checkout, Dashboard navigation.

        Running 18 scenarios...

        Score: ✅ 15 passed  ❌ 3 failed

        🔴 [CRITICAL] Checkout fails on mobile viewport (375px)
        🟠 [HIGH] Cart total not updated after removing item
        🟡 [MEDIUM] No confirmation shown after form submit

        Report: .opentest-workspace/e2e/report-2026-03-18.md

Output

File Contents
screenshots/<page>.png Screenshot of each discovered page
report-<YYYY-MM-DD>.md Site map, issues by severity, full scenario results

Skill: a11y-test

Scans web pages with axe-core via Playwright, maps violations to WCAG criteria, and produces concrete, actionable fixes.

Usage

You: Check accessibility on https://myapp.com — run WCAG 2.1 AA audit

Claude: Scanning 3 pages...

        Critical (3): Missing alt text on product images, no form labels on checkout
        Serious (5): Color contrast failures, missing focus styles
        Moderate (4): Heading order inconsistent, missing landmark roles

        Top fix: Add alt="" to 12 <img> elements in /products
        Report: .opentest-workspace/a11y/results.json

WCAG Severity Levels

Level Label Meaning
A violations Critical Major barriers — some users cannot access content
AA violations Serious Required for legal compliance (ADA, EAA)
Best practice Moderate Improves assistive technology experience
Minor Low Small improvements, low user impact

Skill: ux-test

Explores a web app live in a browser, captures screenshots, and evaluates UI/UX quality against Nielsen's 10 Usability Heuristics.

Usage

You: Review UX on https://myapp.com — users are non-technical customers

Claude: [browses 10 pages, captures screenshots and UI states]

        🔴 [CRITICAL] Users have no way to recover from form errors — no inline validation
        🟠 [HIGH] Primary CTA buried below the fold on mobile
        🟡 [MEDIUM] Inconsistent button labels across 3 pages
        💡 Add empty state illustrations to /dashboard

        Report: .opentest-workspace/ux/report-2026-03-18.md

Output

File Contents
screenshots/<page>.png Screenshots of each page and UI state
report-<YYYY-MM-DD>.md Findings by severity, page-by-page summary, top actions

Skill: perf-test

Generates a k6 load test script from a URL or OpenAPI spec, executes it under configurable load, and reports p50/p95/p99 latency, throughput, and error rates against SLA thresholds.

Scenarios

Scenario Profile Purpose
smoke 1 VU / 30s Verify setup, baseline latency
load ramp to 10 VUs / 2m Typical production load
stress ramp to 100 VUs / 5m Find the breaking point
soak 5 VUs / 10m Detect memory leaks
spike 0→100→0 VUs / 1m Sudden traffic spikes

Usage

You: Load test https://api.myapp.com — scenario: stress

Claude: Generating k6 script for stress scenario (100 VUs)...
        Running: .opentest-workspace/perf/k6_script.js

        ❌ FAIL — p95 threshold violated

        Response Times:   p50: 145ms  p95: 3,420ms  p99: 8,200ms
        Throughput:       42 req/s at peak (100 VUs)
        Error rate:       2.3% ← threshold: <1% ❌

        Top 3 findings:
        1. p95 latency 3.4s at 100 VUs — likely DB query without index on POST /orders
        2. Error rate 2.3% — connection pool exhausted; increase pool size
        3. p99 spike to 8.2s — consider adding Redis caching for GET /products

        Report: .opentest-workspace/perf/report-2026-03-18.md
        Script saved: .opentest-workspace/perf/k6_script.js (reusable in CI)

Prerequisites

# macOS
brew install k6

# Ubuntu/Debian
sudo apt-get install k6   # see https://k6.io/docs/get-started/installation/

Output

File Contents
k6_script.js Generated k6 script (reusable in CI)
results.json Raw k6 metrics (p50/p95/p99, req/s, error rate)
report-<YYYY-MM-DD>.md Verdict, findings, top recommendations

Skill: security-scan

Runs OWASP ZAP against a web app or API in Docker, spiders the target, runs passive (and optionally active) scan, and produces a severity-ranked vulnerability report.

⚠️ Only scan systems you own or have explicit written permission to test.

Usage

You: Security scan https://staging.myapp.com — passive scan only

Claude: Starting ZAP Docker container...
        Spidering https://staging.myapp.com — 24 URLs found.
        Running passive scan...

        🟠 High (2): Missing Content-Security-Policy, Cookies without Secure flag
        🟡 Medium (5): Missing HSTS, CORS misconfiguration, exposed server header...
        🔵 Low (8): Informational headers, fingerprinting risks

        Top 3 fixes:
        1. Add 4 security headers via middleware (CSP, HSTS, X-Frame-Options, X-Content-Type) — ~30 min
        2. Set Secure + HttpOnly on session cookie — single config change
        3. Restrict CORS from * to your frontend domain

        Report: .opentest-workspace/security/report-2026-03-18.md

Scan Modes

Mode Duration Best for
passive ~2 min Safe for production; inspects traffic only
active 10–30 min Staging/dev; sends attack probes to find more issues

Prerequisites

docker pull ghcr.io/zaproxy/zaproxy:stable

Output

File Contents
results.json Structured findings: severity, OWASP mapping, evidence, fix
report-<YYYY-MM-DD>.md Severity-ranked report with remediation actions

Scoring (API Testing)

Score Grade Meaning
90–100 A Excellent — production-ready
80–89 B Good — minor issues to address
70–79 C Acceptable — several improvements needed
60–69 D Poor — significant gaps
0–59 F Critical issues — do not ship

Score deductions: Critical −15pts each (max −45), High −8pts (max −24), Medium −3pts, Low −1pt. Bonuses: >80% coverage +5pts, all P0 passed +5pts.


Best Practice Rules

Rules are defined in YAML files under .claude/skills/api-test/rules/.

API Rules

File Rules Focus
auth.rules.yaml AUTH-001 to AUTH-003 Authentication, authorization, BOLA
validation.rules.yaml VAL-001 to VAL-005 Input validation, SQL injection, XSS
response.rules.yaml RESP-001 to RESP-005 Response format, schema, error codes
idempotency.rules.yaml IDEM-001 to IDEM-003 PUT/DELETE idempotency
pagination.rules.yaml PAG-001 to PAG-003 Pagination params, metadata

Custom Rules

# my_rules.yaml
custom_rules:
  - id: CUSTOM-001
    name: "All responses must include X-Request-ID header"
    severity: high
    category: api_design
    applies_to: {}
    additional_assertions:
      - type: header
        target: "X-Request-ID"
        operator: exists

Pass it via --custom-rules my_rules.yaml in generate_plan.py.


Repository Structure

opentest/                          # This repository (skill source)
├── .claude/
│   └── skills/                    # Skill definitions (source of truth)
│       ├── api-test/
│       │   ├── SKILL.md           # Skill prompt loaded by the agent
│       │   ├── scripts/           # analyze_spec.py, generate_plan.py, run_tests.py, ...
│       │   └── rules/             # API best-practice rules (YAML)
│       ├── e2e-test/
│       │   ├── SKILL.md
│       │   └── rules/
│       ├── a11y-test/
│       │   ├── SKILL.md
│       │   └── scripts/           # scan_a11y.py (axe-core via Playwright)
│       ├── ux-test/
│       │   ├── SKILL.md
│       │   └── references/        # heuristics.md (Nielsen's 10 heuristics)
│       ├── perf-test/
│       │   ├── SKILL.md
│       │   └── scripts/           # run_perf.py (k6 script generator + runner)
│       └── security-scan/
│           ├── SKILL.md
│           └── scripts/           # run_security_scan.py (OWASP ZAP via Docker)
├── shared/
│   └── opentest_core/
│       ├── models/                # Pydantic data models
│       ├── auth/                  # AuthManager (bearer, API key, basic, OAuth2, cookie)
│       ├── rules_engine/          # YAML rules loader, engine, scorer
│       ├── test_data/             # Boundary values, fuzz payloads, Faker wrapper
│       ├── result/                # ResultAggregator, ReportGenerator
│       └── schema/                # jsonschema validator
├── bin/
│   └── opentest-sync              # Sync skills from this repo into a target project
├── docs/                          # Contributing guide
└── .github/workflows/             # CI/CD pipeline

your-project/                      # After running opentest-sync
├── .opentest-workspace/
│   └── skills/                    # Skill files live here (single copy)
│       ├── api-test/
│       ├── e2e-test/
│       └── ...
├── .claude/skills/                # Symlinks → .opentest-workspace/skills/* (Claude Code)
├── .github/skills/                # Symlinks → .opentest-workspace/skills/* (GitHub Copilot)
└── .agents/skills/                # Symlinks → .opentest-workspace/skills/* (Antigravity)

Development

Run tests

# All unit tests
uv run pytest shared/tests -v --tb=short

# With coverage
uv run pytest shared/tests --cov=opentest_core --cov-report=term-missing

Lint, format, type check

uv run ruff check .
uv run ruff format --check .
uv run mypy shared/opentest_core --ignore-missing-imports

Auto-fix formatting:

uv run ruff format .

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                              USER                                        │
│  "Test my API" / "Run E2E on staging.myapp.com" / "Review UX"          │
└──────────────────────────────────┬──────────────────────────────────────┘
                                   │ Natural Language
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│            AI AGENT  (Claude Code · GitHub Copilot · Antigravity)       │
│  Loads skill prompt → follows workflow → calls tools → saves report     │
└──────┬────────────────┬──────────────────┬──────────────┬───────────────┘
       │                │                  │              │
       ▼                ▼                  ▼              ▼
┌─────────────┐  ┌────────────┐  ┌──────────────┐  ┌──────────┐
│  api-test   │  │  e2e-test  │  │  a11y-test   │  │ ux-test  │
│             │  │            │  │              │  │          │
│ analyze_spec│  │ playwright │  │ scan_a11y.py │  │playwright│
│ gen_plan    │  │ mcp tools  │  │ (axe-core)   │  │ mcp tools│
│ run_tests   │  │ navigate   │  │              │  │ navigate │
│ gen_code    │  │ click      │  │              │  │ snapshot │
│ analyze_res │  │ screenshot │  │              │  │screenshot│
└──────┬──────┘  └─────┬──────┘  └──────┬───────┘  └────┬─────┘
       │               │                │               │
       ▼               ▼                └───────────────┘
┌─────────────┐  ┌─────────────┐
│ Schemathesis│  │  Playwright │
│  (Python)   │  │   Browser   │
└─────────────┘  └─────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    SHARED PYTHON INFRASTRUCTURE                          │
│  rules_engine/  auth_manager/  result_aggregator/  report_generator/   │
│  test_data/     schema_validator/  models/                               │
└─────────────────────────────────────────────────────────────────────────┘

Technology Stack

Layer Technology License
Skill Runtime Claude Code · GitHub Copilot · Antigravity
API Testing Engine Schemathesis MIT
Browser Automation playwright-mcp (Microsoft) Apache 2.0
Accessibility Scanning axe-core (via Playwright) MPL-2.0
UX Evaluation Framework Nielsen's 10 Heuristics
Load Testing k6 AGPL-3.0
Security Scanning OWASP ZAP (Docker) Apache 2.0
Data Models Pydantic v2 MIT
HTTP Client httpx BSD-3
OpenAPI Parsing openapi-core BSD-3
Test Data Faker MIT
Package Manager uv MIT
Linter/Formatter Ruff MIT

Philosophy

  • Reuse First — battle-tested tools (Schemathesis, playwright-mcp, axe-core) over reinventing the wheel
  • Build Only Unique Value — Rules Engine, AI Analysis, Orchestration are what we build
  • Auditable — every finding is traceable to a specific rule ID or heuristic
  • Self-Hostable — no vendor lock-in, runs 100% locally
  • Composable — each skill is independent, use only what you need

License

MIT License — see LICENSE for details.


Contributing

Contributions welcome! Please read the contributing guide before opening a PR.

Areas where help is most appreciated:

  • Additional best practice rules (add a .rules.yaml file)
  • New spec parsers (GraphQL mutations, AsyncAPI, gRPC)
  • E2E rule coverage improvements
  • Integration test fixtures for more public apps (SauceDemo, TodoMVC, Conduit)

About

QA skill set in Claude, Antigravity, Github Copilot

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors