chore: add coverage analysis for generated tests (#275) by esraagamal6 · Pull Request #278 · camunda/api-test-generator

esraagamal6 · 2026-05-18T11:45:51Z

Status

This PR is intentionally kept open as living scaffolding — not meant to be merged into main. It will be closed (and the directory deleted) once the api-test-generator is delivered. The PR diff itself serves as the durable artifact for the assessment; the branch can be checked out to re-run coverage-analysis/build_coverage.py against the latest generator output.

Summary

Answers the questions in #275.

Adds coverage-analysis/ — a Python script and the artifacts it produces, categorising every generated test in the same shape as upstream's c8-orchestration-cluster-e2e-test-suite/coverage-analysis/, so the two suites can be diffed directly.

Test sources scanned:

playwright/<operationId>.feature.spec.ts — feature emitter (happy path + base shape)
playwright/<operationId>.variant.spec.ts — variant emitter (schema/input variations: bpmn, oneOf …, etc.)
playwright/edges/<EdgeName>.lifecycle.spec.ts — edge lifecycle template (establish → observe present → revoke → observe absent)
playwright/entities/<EntityName>.lifecycle.spec.ts — entity lifecycle template (create → present → update → present → delete → absent)
request-validation/<entity>-validation-api-tests.spec.ts — request-validation emitter (negative schema cases, all bad-request)

Outputs (regenerate per the README — requires npm run pipeline first (which chains fetch-spec, testsuite:generate, and generate:request-validation) because spec/ and generated/ are gitignored):

tests.csv — one row per test() declaration (1617 rows) with source, entity, category, operation, form_step, prerequisite, variants, test_name, etc.
coverage_matrix.csv / .md — entity × operation grid; total = unique-test count per cell, variant columns = label-occurrences (matches upstream semantics, so multi-label tests count once toward total).
gaps.md — heuristic gap report.
category_breakdown.md — per-category (A–O upstream buckets + P for agent-instance) with Form, prerequisites, observation channel split, form-step counts, variants, and per-test rows with file:line.
lifecycle_disjoint.md — manually-maintained disjoint of the 10 EntityLifecycle tests vs upstream's matching tests (answers Josh's question on Close coverage gap vs upstream e2e suite (negative-path + search-refinement emitters) #279).
README.md — explains the files, the classification rules, and how to regenerate.

Findings

Upstream snapshot: camunda/camunda#53387 (head 7cf8bc1).

	upstream	generator	diff
Unique tests	1001	1617	+616
bad-request (400)	195	1071	+876
happy-path (occurrences)	173	211	+38
pagination-sort (occurrences)	53	85	+32
filter (occurrences)	85	196	+111
observe-absence	2	48	+46
data-driven / oneOf variants	5	302	+297
unauthorized (401)	165	0	-165
not-found (404)	127	0	-127
conflict (409)	31	0	-31
forbidden (403)	29	0	-29

The generator emits more tests than upstream, dominated by the request-validation emitter (1071 bad-request tests across 17 violation kinds). The variant emitter exercises pagination (page.after cursor) and filter (filter: { ... }) request shapes on many search and batch-operation specs, so those columns are non-zero; but these tests only assert status 200 + response schema, not pagination/filter correctness. Upstream's pagination/filter tests are behaviour assertions; the generator's are request-shape assertions. The buckets where the generator emits zero — 401, 403, 404, 409 — total ~352 missing tests.

Note on the not-found count. The matrix not-found: 0 reflects upstream's
semantic taxonomy, not "generator never asserts 404". Upstream splits 404 into
observe-absence (GET after DELETE — entity was created, now gone) and
not-found (GET against a fake/never-existing ID — entity was never created).
The generator's 10 entity-lifecycle tests + 12 edge-lifecycle tests + 26 feature/variant
negative empty tests each end with expect(status).toBe(404) — these are real 404
assertions but bucketed as observe-absence (48 occurrences). The capability gap is
specifically the fake-ID pattern; that's what upstream's 127 not-found tests cover.

Follow-up emitter plan tracked in #279; methodology / coverage-analyzer discussion in #277. Verified independently against upstream's source files (not just their published tests.csv); discovered one classification bug in upstream's build_coverage.py and filed it as camunda/camunda#53387 comment.

Test plan

Re-run python3 coverage-analysis/build_coverage.py and confirm it writes the 5 generated artifacts without errors.
Spot-check a row in tests.csv against the corresponding spec file.
Compare the TOC of category_breakdown.md against Categorise existing OCA test coverage #275's request for "categorisation + form + variants + counts + which tests".
Confirm coverage_matrix.csv total column equals unique rows per (entity, operation) in tests.csv.

🤖 Generated with Claude Code

Adds coverage-analysis/ which categorises the tests emitted under generated/camunda-oca/playwright/ and produces a matrix in the same shape as upstream's c8-orchestration-cluster-e2e-test-suite/coverage-analysis, so the two suites can be diffed directly. Outputs (regenerate with `python3 coverage-analysis/build_coverage.py`): - tests.csv: per-test labels (file, line, entity, category, operation, form_step, prerequisite, variants, test_name) across 518 declarations. - coverage_matrix.csv / .md: entity x operation grid with variant counts. - gaps.md: heuristic gap report (missing 401/403/400/404/409 coverage, missing observe-after-delete, search ops without pagination/filter). - category_breakdown.md: per-category (A-O upstream buckets + P for agent-instance) with Form, prerequisites, observation channel split, form-step counts, variants, and per-test rows with file:line. Answers the questions in #275 for the generator. The findings: the ~483-test gap vs upstream is concentrated in negative-path tests (575 missing across 400/401/403/404/409) and search refinement (138 missing across pagination-sort/filter); the generator already exceeds upstream on input-shape variants (data-driven +290) and observe-absence (+24). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Makes it explicit in the README that this directory is not part of the product surface — it exists to assess what the generator emits during implementation, and can be deleted once the generator is delivered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upstream regenerated coverage_matrix.csv so the total column now equals unique-test count (previously double-counted multi-labeled tests). The overall 483-test gap is unchanged, but the per-bucket numbers tighten: - bad-request: 232 -> 195 - unauthorized: 163 -> 165 - not-found: 123 -> 127 - forbidden: 28 -> 29 - conflict: 29 -> 31 - pagination-sort + filter: 138 (unchanged) Also distinguish "label occurrences" (a test with two negative labels counts twice) from "unique tests with any negative label" (543), which is the more useful number for emitter planning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hot note Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous build_coverage.py only scanned playwright/*.feature.spec.ts and *.variant.spec.ts (518 tests). It missed two other generator outputs: - request-validation/*.spec.ts (1037 bad-request tests across 17 violation kinds: additional-prop, constraint-violation, enum-violation, missing-body, missing-required, oneof-ambiguous/cross-bleed/none-match, param-missing, type-mismatch, union, unique-items-violation, etc.) - playwright/edges/*.lifecycle.spec.ts (12 edge lifecycle tests, each exercising establish -> observe present -> revoke -> observe absent) Corrected totals: generator emits 1567 unique tests (was misreported as 518) across 4 sources (feature 227, variant 291, lifecycle 12, request-validation 1037). The generator emits 566 more tests than upstream's 1001, not 483 fewer. The real gap is in 401/403/404/409 + pagination/filter (~490 tests upstream has that the generator does not). Adds 'source' column to tests.csv so each test row identifies which emitter produced it. Adds 'lifecycle' operation kind and 'negative-*' form steps to the form-step ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-ran the full generator pipeline (npm run pipeline + generate:request-validation) against latest main, then re-ran build_coverage.py. Updated totals: 1607 unique tests (1567 -> 1607, delta +40 from running against current generator code vs. the May 18 snapshot). Source breakdown: feature 231, variant 293, lifecycle 12, request-validation 1071. Also mapped /forms/{formKey} (new endpoint) into the user-task entity so the two getFormByKey tests land in F. User-Task Lifecycle instead of Z. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a coverage-analysis/ scaffold to summarize and compare generated Camunda OCA test coverage against the upstream suite for issue #275.

Changes:

Adds a Python coverage builder that scans generated Playwright and request-validation specs.
Commits generated CSV/Markdown coverage artifacts, including matrix, gaps, and category breakdown.
Documents classification rules and regeneration workflow.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`coverage-analysis/build_coverage.py`	Generates coverage rows, matrix, gaps, and category breakdown from generated specs.
`coverage-analysis/README.md`	Explains the coverage analysis artifacts, comparison, and regeneration command.
`coverage-analysis/tests.csv`	Per-test declaration coverage inventory.
`coverage-analysis/coverage_matrix.csv`	Machine-readable entity × operation × variant matrix.
`coverage-analysis/coverage_matrix.md`	Markdown rendering of the coverage matrix.
`coverage-analysis/gaps.md`	Heuristic gap report by coverage category.
`coverage-analysis/category_breakdown.md`	Per-category and per-entity detailed test listing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- matrix total now equals unique-test count per (entity, operation), not the sum of variant labels (matches upstream camunda/camunda#53387 semantics where multi-label tests count once toward total) - category_breakdown variant counts now split multi-label variants ('happy-path|observe-absence') so the counts reconcile with matrix columns and per-test rows - README regenerate section now lists the full setup (npm install, npm run pipeline, npm run generate:request-validation) since spec/ and generated/ are gitignored - README limitations no longer claim 'zero error-path tests' — the generator emits 1000+ bad-request tests via request-validation; only 401/403/404/409 + pagination/filter are zero Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

esraagamal6 · 2026-05-19T11:32:24Z

@copilot please re-review — addressed all 5 comments in commit 9d4e305 (see individual replies).

Copilot · 2026-05-19T11:34:25Z

@copilot please re-review — addressed all 5 comments in commit 9d4e305 (see individual replies).

Re-reviewed commit 9d4e305 and confirmed the 5 addressed items are in place: matrix total uses unique-test counts, multi-label variants are split for aggregation, README regenerate steps include spec/generated setup, and the 400-coverage wording/pr summary were updated accordingly. No additional code changes needed from my side.

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.

- prerequisite_of now converts CamelCase entity names to kebab-case via a small helper, so MappingRule -> mapping-rule (was emitting 'mappingrule') in membership prerequisite strings — keeps slugs consistent across tests.csv and category_breakdown.md - category_breakdown.md table cells now escape multi-label variants: 'happy-path|observe-absence' is written as 'happy-path, observe-absence' so the literal | doesn't get parsed as a markdown column separator (lifecycle rows were rendering with an extra column) - README comparison table refreshed to match the regenerated artifacts: 1607 -> 1617 unique tests, +606 -> +616 vs upstream (was stale relative to coverage_matrix.csv after the entities/ scanner landed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 4 comments.

The classifier previously only inspected test names. Variant emitter tests exercise pagination (page.after cursor) and filter (filter: { ... } body) on many search and batch-operation specs, but their names are generic (variant-N - X - path #1) so they were bucketed as data-driven/unlabeled instead of pagination-sort / filter. Added body-shape detection: if a test() block contains 'page: {' or 'sort: [' in the request body, add 'pagination-sort' to its variants; if it contains 'filter: {', add 'filter'. Matches the field-assignment form so response-access expressions (json?.page?.startCursor) don't false-positive. Effect on the matrix: - pagination-sort: 0 -> 85 (upstream 53) - filter: 0 -> 196 (upstream 85) - unlabeled: 12 -> 1 README updated to flag the semantic distinction: these counts are request-shape coverage, not behaviour coverage. Upstream's hand-written pagination/filter tests assert results; generator's only assert status code + response schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- build_coverage.py: move `reset` from delete-regex to update-regex so resetClock (POST /clock/reset, admin state reset) classifies as update, not delete. Eliminates the false positive "clock — has create+delete but no observe-absence" in gaps.md. - build_coverage.py: docstring "Test sources scanned" list now includes generated/camunda-oca/playwright/entities/*.lifecycle.spec.ts (the 10 EntityLifecycle rows the script already scans). - README.md: "three locations" → "five locations" to match the five-row source table below it. - lifecycle_disjoint.md: use the entity-lifecycle terminology (create → present → update → present → delete → absent) for the generator's EntityLifecycle tests, not the edge-lifecycle (establish/revoke) terms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

…able It is manually maintained (not regenerated by build_coverage.py), so the row is flagged accordingly so readers don't expect the script to keep it in sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 1 comment.

`npm run pipeline` already chains `fetch-spec`, `testsuite:generate`, and `generate:request-validation`. Calling `generate:request-validation` a second time runs the request-validation emitter twice unnecessarily. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

- coverage_matrix.md no longer claims variants are 'first-match labels from test names'. New blurb names the three label sources (name suffix, body shape, fixed emitter labels) and notes that matrix columns are not mutually exclusive — a multi-label test counts in both columns but only once in 'total'. - gaps.md 'Search ops with no pagination/sort or filter coverage' section now emits '- _(none)_' when no entries match, matching the 'delete-then-observe-absence' section. Previously the empty list was ambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 1 comment.

`unlabeled` describes the NAME-classification ('no info derivable from the test name'). Body-shape detection (pagination/filter) is a separate axis, so a dynamic 'variant-N - scenario' test with a filter body is both name-unlabeled AND body-filter — it should carry both labels. Previously the augment logic dropped 'unlabeled' when extras were added, which under-reported dynamic scenarios in the inventory. After: 7 rows carry `unlabeled|filter` (was 1 alone) — these are searchAuditLogs / searchAuditLogs.variant scenarios that were being mis-counted only under `filter`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated no new comments.

…antic split The matrix shows not-found=0 for the generator, but this is upstream's taxonomy split, not "generator never asserts 404". Spelled out: - observe-absence = GET after DELETE (entity was created, now gone). Generator has 48 of these. - not-found = GET against a fake/never-existing ID (never created). Generator has 0 of these — this is the actual capability gap. Every entity-lifecycle test ends with expect(status).toBe(404), so the generator IS asserting 404 in 10 places; they're bucketed as observe-absence because they exercise the post-delete path, not the fake-ID path. Upstream's 127 not-found tests are mostly fake-ID tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

esraagamal6 marked this pull request as draft May 18, 2026 11:59

This was referenced May 18, 2026

Close coverage gap vs upstream e2e suite (negative-path + search-refinement emitters) #279

Open

Categorise existing OCA test coverage #275

Open

esraagamal6 self-assigned this May 19, 2026

esraagamal6 and others added 6 commits May 19, 2026 12:48

docs(coverage-analysis): drop historical phrasing from upstream snaps…

287f85b

…hot note Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

esraagamal6 force-pushed the chore/coverage-analysis-275 branch from 0ef221d to 381aca8 Compare May 19, 2026 10:55

esraagamal6 requested a review from Copilot May 19, 2026 11:22

Copilot started reviewing on behalf of esraagamal6 May 19, 2026 11:23 View session