chore: add coverage analysis for generated tests (#275)#278
chore: add coverage analysis for generated tests (#275)#278esraagamal6 wants to merge 15 commits into
Conversation
Adds coverage-analysis/ which categorises the tests emitted under generated/camunda-oca/playwright/ and produces a matrix in the same shape as upstream's c8-orchestration-cluster-e2e-test-suite/coverage-analysis, so the two suites can be diffed directly. Outputs (regenerate with `python3 coverage-analysis/build_coverage.py`): - tests.csv: per-test labels (file, line, entity, category, operation, form_step, prerequisite, variants, test_name) across 518 declarations. - coverage_matrix.csv / .md: entity x operation grid with variant counts. - gaps.md: heuristic gap report (missing 401/403/400/404/409 coverage, missing observe-after-delete, search ops without pagination/filter). - category_breakdown.md: per-category (A-O upstream buckets + P for agent-instance) with Form, prerequisites, observation channel split, form-step counts, variants, and per-test rows with file:line. Answers the questions in #275 for the generator. The findings: the ~483-test gap vs upstream is concentrated in negative-path tests (575 missing across 400/401/403/404/409) and search refinement (138 missing across pagination-sort/filter); the generator already exceeds upstream on input-shape variants (data-driven +290) and observe-absence (+24). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Makes it explicit in the README that this directory is not part of the product surface — it exists to assess what the generator emits during implementation, and can be deleted once the generator is delivered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream regenerated coverage_matrix.csv so the total column now equals unique-test count (previously double-counted multi-labeled tests). The overall 483-test gap is unchanged, but the per-bucket numbers tighten: - bad-request: 232 -> 195 - unauthorized: 163 -> 165 - not-found: 123 -> 127 - forbidden: 28 -> 29 - conflict: 29 -> 31 - pagination-sort + filter: 138 (unchanged) Also distinguish "label occurrences" (a test with two negative labels counts twice) from "unique tests with any negative label" (543), which is the more useful number for emitter planning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hot note Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous build_coverage.py only scanned playwright/*.feature.spec.ts and *.variant.spec.ts (518 tests). It missed two other generator outputs: - request-validation/*.spec.ts (1037 bad-request tests across 17 violation kinds: additional-prop, constraint-violation, enum-violation, missing-body, missing-required, oneof-ambiguous/cross-bleed/none-match, param-missing, type-mismatch, union, unique-items-violation, etc.) - playwright/edges/*.lifecycle.spec.ts (12 edge lifecycle tests, each exercising establish -> observe present -> revoke -> observe absent) Corrected totals: generator emits 1567 unique tests (was misreported as 518) across 4 sources (feature 227, variant 291, lifecycle 12, request-validation 1037). The generator emits 566 more tests than upstream's 1001, not 483 fewer. The real gap is in 401/403/404/409 + pagination/filter (~490 tests upstream has that the generator does not). Adds 'source' column to tests.csv so each test row identifies which emitter produced it. Adds 'lifecycle' operation kind and 'negative-*' form steps to the form-step ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-ran the full generator pipeline (npm run pipeline + generate:request-validation)
against latest main, then re-ran build_coverage.py.
Updated totals: 1607 unique tests (1567 -> 1607, delta +40 from running against
current generator code vs. the May 18 snapshot). Source breakdown: feature 231,
variant 293, lifecycle 12, request-validation 1071.
Also mapped /forms/{formKey} (new endpoint) into the user-task entity so the
two getFormByKey tests land in F. User-Task Lifecycle instead of Z.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0ef221d to
381aca8
Compare
There was a problem hiding this comment.
Pull request overview
Adds a coverage-analysis/ scaffold to summarize and compare generated Camunda OCA test coverage against the upstream suite for issue #275.
Changes:
- Adds a Python coverage builder that scans generated Playwright and request-validation specs.
- Commits generated CSV/Markdown coverage artifacts, including matrix, gaps, and category breakdown.
- Documents classification rules and regeneration workflow.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
coverage-analysis/build_coverage.py |
Generates coverage rows, matrix, gaps, and category breakdown from generated specs. |
coverage-analysis/README.md |
Explains the coverage analysis artifacts, comparison, and regeneration command. |
coverage-analysis/tests.csv |
Per-test declaration coverage inventory. |
coverage-analysis/coverage_matrix.csv |
Machine-readable entity × operation × variant matrix. |
coverage-analysis/coverage_matrix.md |
Markdown rendering of the coverage matrix. |
coverage-analysis/gaps.md |
Heuristic gap report by coverage category. |
coverage-analysis/category_breakdown.md |
Per-category and per-entity detailed test listing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- matrix total now equals unique-test count per (entity, operation), not the sum of variant labels (matches upstream camunda/camunda#53387 semantics where multi-label tests count once toward total) - category_breakdown variant counts now split multi-label variants ('happy-path|observe-absence') so the counts reconcile with matrix columns and per-test rows - README regenerate section now lists the full setup (npm install, npm run pipeline, npm run generate:request-validation) since spec/ and generated/ are gitignored - README limitations no longer claim 'zero error-path tests' — the generator emits 1000+ bad-request tests via request-validation; only 401/403/404/409 + pagination/filter are zero Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-reviewed commit |
- prerequisite_of now converts CamelCase entity names to kebab-case via a small helper, so MappingRule -> mapping-rule (was emitting 'mappingrule') in membership prerequisite strings — keeps slugs consistent across tests.csv and category_breakdown.md - category_breakdown.md table cells now escape multi-label variants: 'happy-path|observe-absence' is written as 'happy-path, observe-absence' so the literal | doesn't get parsed as a markdown column separator (lifecycle rows were rendering with an extra column) - README comparison table refreshed to match the regenerated artifacts: 1607 -> 1617 unique tests, +606 -> +616 vs upstream (was stale relative to coverage_matrix.csv after the entities/ scanner landed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- prerequisite_of now converts CamelCase entity names to kebab-case via a small helper, so MappingRule -> mapping-rule (was emitting 'mappingrule') in membership prerequisite strings — keeps slugs consistent across tests.csv and category_breakdown.md - category_breakdown.md table cells now escape multi-label variants: 'happy-path|observe-absence' is written as 'happy-path, observe-absence' so the literal | doesn't get parsed as a markdown column separator (lifecycle rows were rendering with an extra column) - README comparison table refreshed to match the regenerated artifacts: 1607 -> 1617 unique tests, +606 -> +616 vs upstream (was stale relative to coverage_matrix.csv after the entities/ scanner landed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
133e1fb to
04f2140
Compare
The classifier previously only inspected test names. Variant emitter tests
exercise pagination (page.after cursor) and filter (filter: { ... } body)
on many search and batch-operation specs, but their names are generic
(variant-N - X - path #1) so they were bucketed as data-driven/unlabeled
instead of pagination-sort / filter.
Added body-shape detection: if a test() block contains 'page: {' or
'sort: [' in the request body, add 'pagination-sort' to its variants; if
it contains 'filter: {', add 'filter'. Matches the field-assignment
form so response-access expressions (json?.page?.startCursor) don't
false-positive.
Effect on the matrix:
- pagination-sort: 0 -> 85 (upstream 53)
- filter: 0 -> 196 (upstream 85)
- unlabeled: 12 -> 1
README updated to flag the semantic distinction: these counts are
request-shape coverage, not behaviour coverage. Upstream's hand-written
pagination/filter tests assert results; generator's only assert status
code + response schema.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_coverage.py: move `reset` from delete-regex to update-regex so resetClock (POST /clock/reset, admin state reset) classifies as update, not delete. Eliminates the false positive "clock — has create+delete but no observe-absence" in gaps.md. - build_coverage.py: docstring "Test sources scanned" list now includes generated/camunda-oca/playwright/entities/*.lifecycle.spec.ts (the 10 EntityLifecycle rows the script already scans). - README.md: "three locations" → "five locations" to match the five-row source table below it. - lifecycle_disjoint.md: use the entity-lifecycle terminology (create → present → update → present → delete → absent) for the generator's EntityLifecycle tests, not the edge-lifecycle (establish/revoke) terms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…able It is manually maintained (not regenerated by build_coverage.py), so the row is flagged accordingly so readers don't expect the script to keep it in sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`npm run pipeline` already chains `fetch-spec`, `testsuite:generate`, and `generate:request-validation`. Calling `generate:request-validation` a second time runs the request-validation emitter twice unnecessarily. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- coverage_matrix.md no longer claims variants are 'first-match labels from test names'. New blurb names the three label sources (name suffix, body shape, fixed emitter labels) and notes that matrix columns are not mutually exclusive — a multi-label test counts in both columns but only once in 'total'. - gaps.md 'Search ops with no pagination/sort or filter coverage' section now emits '- _(none)_' when no entries match, matching the 'delete-then-observe-absence' section. Previously the empty list was ambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`unlabeled` describes the NAME-classification ('no info derivable from
the test name'). Body-shape detection (pagination/filter) is a separate
axis, so a dynamic 'variant-N - scenario' test with a filter body is
both name-unlabeled AND body-filter — it should carry both labels.
Previously the augment logic dropped 'unlabeled' when extras were added,
which under-reported dynamic scenarios in the inventory.
After: 7 rows carry `unlabeled|filter` (was 1 alone) — these are
searchAuditLogs / searchAuditLogs.variant scenarios that were being
mis-counted only under `filter`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…antic split The matrix shows not-found=0 for the generator, but this is upstream's taxonomy split, not "generator never asserts 404". Spelled out: - observe-absence = GET after DELETE (entity was created, now gone). Generator has 48 of these. - not-found = GET against a fake/never-existing ID (never created). Generator has 0 of these — this is the actual capability gap. Every entity-lifecycle test ends with expect(status).toBe(404), so the generator IS asserting 404 in 10 places; they're bucketed as observe-absence because they exercise the post-delete path, not the fake-ID path. Upstream's 127 not-found tests are mostly fake-ID tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Status
This PR is intentionally kept open as living scaffolding — not meant to be merged into
main. It will be closed (and the directory deleted) once the api-test-generator is delivered. The PR diff itself serves as the durable artifact for the assessment; the branch can be checked out to re-runcoverage-analysis/build_coverage.pyagainst the latest generator output.Summary
Answers the questions in #275.
Adds
coverage-analysis/— a Python script and the artifacts it produces, categorising every generated test in the same shape as upstream'sc8-orchestration-cluster-e2e-test-suite/coverage-analysis/, so the two suites can be diffed directly.Test sources scanned:
playwright/<operationId>.feature.spec.ts— feature emitter (happy path + base shape)playwright/<operationId>.variant.spec.ts— variant emitter (schema/input variations:bpmn,oneOf …, etc.)playwright/edges/<EdgeName>.lifecycle.spec.ts— edge lifecycle template (establish → observe present → revoke → observe absent)playwright/entities/<EntityName>.lifecycle.spec.ts— entity lifecycle template (create → present → update → present → delete → absent)request-validation/<entity>-validation-api-tests.spec.ts— request-validation emitter (negative schema cases, all bad-request)Outputs (regenerate per the README — requires
npm run pipelinefirst (which chains fetch-spec, testsuite:generate, and generate:request-validation) becausespec/andgenerated/are gitignored):tests.csv— one row pertest()declaration (1617 rows) withsource, entity, category, operation, form_step, prerequisite, variants, test_name, etc.coverage_matrix.csv/.md—entity × operationgrid;total= unique-test count per cell, variant columns = label-occurrences (matches upstream semantics, so multi-label tests count once toward total).gaps.md— heuristic gap report.category_breakdown.md— per-category (A–O upstream buckets + P foragent-instance) with Form, prerequisites, observation channel split, form-step counts, variants, and per-test rows withfile:line.lifecycle_disjoint.md— manually-maintained disjoint of the 10 EntityLifecycle tests vs upstream's matching tests (answers Josh's question on Close coverage gap vs upstream e2e suite (negative-path + search-refinement emitters) #279).README.md— explains the files, the classification rules, and how to regenerate.Findings
Upstream snapshot: camunda/camunda#53387 (head
7cf8bc1).The generator emits more tests than upstream, dominated by the
request-validationemitter (1071 bad-request tests across 17 violation kinds). The variant emitter exercises pagination (page.aftercursor) and filter (filter: { ... }) request shapes on many search and batch-operation specs, so those columns are non-zero; but these tests only assert status 200 + response schema, not pagination/filter correctness. Upstream's pagination/filter tests are behaviour assertions; the generator's are request-shape assertions. The buckets where the generator emits zero — 401, 403, 404, 409 — total ~352 missing tests.Follow-up emitter plan tracked in #279; methodology / coverage-analyzer discussion in #277. Verified independently against upstream's source files (not just their published
tests.csv); discovered one classification bug in upstream'sbuild_coverage.pyand filed it as camunda/camunda#53387 comment.Test plan
python3 coverage-analysis/build_coverage.pyand confirm it writes the 5 generated artifacts without errors.tests.csvagainst the corresponding spec file.category_breakdown.mdagainst Categorise existing OCA test coverage #275's request for "categorisation + form + variants + counts + which tests".coverage_matrix.csvtotalcolumn equals unique rows per (entity, operation) intests.csv.🤖 Generated with Claude Code