Skip to content

chore: add coverage analysis for generated tests (#275)#278

Draft
esraagamal6 wants to merge 15 commits into
mainfrom
chore/coverage-analysis-275
Draft

chore: add coverage analysis for generated tests (#275)#278
esraagamal6 wants to merge 15 commits into
mainfrom
chore/coverage-analysis-275

Conversation

@esraagamal6
Copy link
Copy Markdown
Contributor

@esraagamal6 esraagamal6 commented May 18, 2026

Status

This PR is intentionally kept open as living scaffolding — not meant to be merged into main. It will be closed (and the directory deleted) once the api-test-generator is delivered. The PR diff itself serves as the durable artifact for the assessment; the branch can be checked out to re-run coverage-analysis/build_coverage.py against the latest generator output.

Summary

Answers the questions in #275.

Adds coverage-analysis/ — a Python script and the artifacts it produces, categorising every generated test in the same shape as upstream's c8-orchestration-cluster-e2e-test-suite/coverage-analysis/, so the two suites can be diffed directly.

Test sources scanned:

  • playwright/<operationId>.feature.spec.ts — feature emitter (happy path + base shape)
  • playwright/<operationId>.variant.spec.ts — variant emitter (schema/input variations: bpmn, oneOf …, etc.)
  • playwright/edges/<EdgeName>.lifecycle.spec.ts — edge lifecycle template (establish → observe present → revoke → observe absent)
  • playwright/entities/<EntityName>.lifecycle.spec.ts — entity lifecycle template (create → present → update → present → delete → absent)
  • request-validation/<entity>-validation-api-tests.spec.ts — request-validation emitter (negative schema cases, all bad-request)

Outputs (regenerate per the README — requires npm run pipeline first (which chains fetch-spec, testsuite:generate, and generate:request-validation) because spec/ and generated/ are gitignored):

  • tests.csv — one row per test() declaration (1617 rows) with source, entity, category, operation, form_step, prerequisite, variants, test_name, etc.
  • coverage_matrix.csv / .mdentity × operation grid; total = unique-test count per cell, variant columns = label-occurrences (matches upstream semantics, so multi-label tests count once toward total).
  • gaps.md — heuristic gap report.
  • category_breakdown.md — per-category (A–O upstream buckets + P for agent-instance) with Form, prerequisites, observation channel split, form-step counts, variants, and per-test rows with file:line.
  • lifecycle_disjoint.md — manually-maintained disjoint of the 10 EntityLifecycle tests vs upstream's matching tests (answers Josh's question on Close coverage gap vs upstream e2e suite (negative-path + search-refinement emitters) #279).
  • README.md — explains the files, the classification rules, and how to regenerate.

Findings

Upstream snapshot: camunda/camunda#53387 (head 7cf8bc1).

upstream generator diff
Unique tests 1001 1617 +616
bad-request (400) 195 1071 +876
happy-path (occurrences) 173 211 +38
pagination-sort (occurrences) 53 85 +32
filter (occurrences) 85 196 +111
observe-absence 2 48 +46
data-driven / oneOf variants 5 302 +297
unauthorized (401) 165 0 -165
not-found (404) 127 0 -127
conflict (409) 31 0 -31
forbidden (403) 29 0 -29

The generator emits more tests than upstream, dominated by the request-validation emitter (1071 bad-request tests across 17 violation kinds). The variant emitter exercises pagination (page.after cursor) and filter (filter: { ... }) request shapes on many search and batch-operation specs, so those columns are non-zero; but these tests only assert status 200 + response schema, not pagination/filter correctness. Upstream's pagination/filter tests are behaviour assertions; the generator's are request-shape assertions. The buckets where the generator emits zero — 401, 403, 404, 409 — total ~352 missing tests.

Note on the not-found count. The matrix not-found: 0 reflects upstream's
semantic taxonomy, not "generator never asserts 404". Upstream splits 404 into
observe-absence (GET after DELETE — entity was created, now gone) and
not-found (GET against a fake/never-existing ID — entity was never created).
The generator's 10 entity-lifecycle tests + 12 edge-lifecycle tests + 26 feature/variant
negative empty tests each end with expect(status).toBe(404) — these are real 404
assertions but bucketed as observe-absence (48 occurrences). The capability gap is
specifically the fake-ID pattern; that's what upstream's 127 not-found tests cover.

Follow-up emitter plan tracked in #279; methodology / coverage-analyzer discussion in #277. Verified independently against upstream's source files (not just their published tests.csv); discovered one classification bug in upstream's build_coverage.py and filed it as camunda/camunda#53387 comment.

Test plan

  • Re-run python3 coverage-analysis/build_coverage.py and confirm it writes the 5 generated artifacts without errors.
  • Spot-check a row in tests.csv against the corresponding spec file.
  • Compare the TOC of category_breakdown.md against Categorise existing OCA test coverage #275's request for "categorisation + form + variants + counts + which tests".
  • Confirm coverage_matrix.csv total column equals unique rows per (entity, operation) in tests.csv.

🤖 Generated with Claude Code

esraagamal6 and others added 6 commits May 19, 2026 12:48
Adds coverage-analysis/ which categorises the tests emitted under
generated/camunda-oca/playwright/ and produces a matrix in the same shape
as upstream's c8-orchestration-cluster-e2e-test-suite/coverage-analysis,
so the two suites can be diffed directly.

Outputs (regenerate with `python3 coverage-analysis/build_coverage.py`):
- tests.csv: per-test labels (file, line, entity, category, operation,
  form_step, prerequisite, variants, test_name) across 518 declarations.
- coverage_matrix.csv / .md: entity x operation grid with variant counts.
- gaps.md: heuristic gap report (missing 401/403/400/404/409 coverage,
  missing observe-after-delete, search ops without pagination/filter).
- category_breakdown.md: per-category (A-O upstream buckets + P for
  agent-instance) with Form, prerequisites, observation channel split,
  form-step counts, variants, and per-test rows with file:line.

Answers the questions in #275 for the generator. The findings: the
~483-test gap vs upstream is concentrated in negative-path tests (575
missing across 400/401/403/404/409) and search refinement (138 missing
across pagination-sort/filter); the generator already exceeds upstream
on input-shape variants (data-driven +290) and observe-absence (+24).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Makes it explicit in the README that this directory is not part of the
product surface — it exists to assess what the generator emits during
implementation, and can be deleted once the generator is delivered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream regenerated coverage_matrix.csv so the total column now equals
unique-test count (previously double-counted multi-labeled tests). The
overall 483-test gap is unchanged, but the per-bucket numbers tighten:

- bad-request: 232 -> 195
- unauthorized: 163 -> 165
- not-found: 123 -> 127
- forbidden: 28 -> 29
- conflict: 29 -> 31
- pagination-sort + filter: 138 (unchanged)

Also distinguish "label occurrences" (a test with two negative labels
counts twice) from "unique tests with any negative label" (543), which
is the more useful number for emitter planning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hot note

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous build_coverage.py only scanned playwright/*.feature.spec.ts
and *.variant.spec.ts (518 tests). It missed two other generator outputs:

- request-validation/*.spec.ts (1037 bad-request tests across 17 violation
  kinds: additional-prop, constraint-violation, enum-violation, missing-body,
  missing-required, oneof-ambiguous/cross-bleed/none-match, param-missing,
  type-mismatch, union, unique-items-violation, etc.)
- playwright/edges/*.lifecycle.spec.ts (12 edge lifecycle tests, each
  exercising establish -> observe present -> revoke -> observe absent)

Corrected totals: generator emits 1567 unique tests (was misreported as 518)
across 4 sources (feature 227, variant 291, lifecycle 12, request-validation
1037). The generator emits 566 more tests than upstream's 1001, not 483
fewer. The real gap is in 401/403/404/409 + pagination/filter (~490 tests
upstream has that the generator does not).

Adds 'source' column to tests.csv so each test row identifies which emitter
produced it. Adds 'lifecycle' operation kind and 'negative-*' form steps to
the form-step ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-ran the full generator pipeline (npm run pipeline + generate:request-validation)
against latest main, then re-ran build_coverage.py.

Updated totals: 1607 unique tests (1567 -> 1607, delta +40 from running against
current generator code vs. the May 18 snapshot). Source breakdown: feature 231,
variant 293, lifecycle 12, request-validation 1071.

Also mapped /forms/{formKey} (new endpoint) into the user-task entity so the
two getFormByKey tests land in F. User-Task Lifecycle instead of Z.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a coverage-analysis/ scaffold to summarize and compare generated Camunda OCA test coverage against the upstream suite for issue #275.

Changes:

  • Adds a Python coverage builder that scans generated Playwright and request-validation specs.
  • Commits generated CSV/Markdown coverage artifacts, including matrix, gaps, and category breakdown.
  • Documents classification rules and regeneration workflow.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
coverage-analysis/build_coverage.py Generates coverage rows, matrix, gaps, and category breakdown from generated specs.
coverage-analysis/README.md Explains the coverage analysis artifacts, comparison, and regeneration command.
coverage-analysis/tests.csv Per-test declaration coverage inventory.
coverage-analysis/coverage_matrix.csv Machine-readable entity × operation × variant matrix.
coverage-analysis/coverage_matrix.md Markdown rendering of the coverage matrix.
coverage-analysis/gaps.md Heuristic gap report by coverage category.
coverage-analysis/category_breakdown.md Per-category and per-entity detailed test listing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread coverage-analysis/build_coverage.py
Comment thread coverage-analysis/build_coverage.py
Comment thread coverage-analysis/README.md Outdated
Comment thread coverage-analysis/README.md Outdated
Comment thread coverage-analysis/README.md Outdated
- matrix total now equals unique-test count per (entity, operation), not
  the sum of variant labels (matches upstream camunda/camunda#53387
  semantics where multi-label tests count once toward total)
- category_breakdown variant counts now split multi-label variants
  ('happy-path|observe-absence') so the counts reconcile with matrix
  columns and per-test rows
- README regenerate section now lists the full setup (npm install,
  npm run pipeline, npm run generate:request-validation) since spec/
  and generated/ are gitignored
- README limitations no longer claim 'zero error-path tests' — the
  generator emits 1000+ bad-request tests via request-validation; only
  401/403/404/409 + pagination/filter are zero

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@esraagamal6
Copy link
Copy Markdown
Contributor Author

@copilot please re-review — addressed all 5 comments in commit 9d4e305 (see individual replies).

Copy link
Copy Markdown
Contributor

Copilot AI commented May 19, 2026

@copilot please re-review — addressed all 5 comments in commit 9d4e305 (see individual replies).

Re-reviewed commit 9d4e305 and confirmed the 5 addressed items are in place: matrix total uses unique-test counts, multi-label variants are split for aggregation, README regenerate steps include spec/generated setup, and the 400-coverage wording/pr summary were updated accordingly. No additional code changes needed from my side.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.

Comment thread coverage-analysis/README.md Outdated
Comment thread coverage-analysis/build_coverage.py Outdated
Comment thread coverage-analysis/build_coverage.py Outdated
esraagamal6 added a commit that referenced this pull request May 19, 2026
- prerequisite_of now converts CamelCase entity names to kebab-case via a
  small helper, so MappingRule -> mapping-rule (was emitting 'mappingrule')
  in membership prerequisite strings — keeps slugs consistent across
  tests.csv and category_breakdown.md
- category_breakdown.md table cells now escape multi-label variants:
  'happy-path|observe-absence' is written as 'happy-path, observe-absence'
  so the literal | doesn't get parsed as a markdown column separator
  (lifecycle rows were rendering with an extra column)
- README comparison table refreshed to match the regenerated artifacts:
  1607 -> 1617 unique tests, +606 -> +616 vs upstream (was stale relative
  to coverage_matrix.csv after the entities/ scanner landed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- prerequisite_of now converts CamelCase entity names to kebab-case via a
  small helper, so MappingRule -> mapping-rule (was emitting 'mappingrule')
  in membership prerequisite strings — keeps slugs consistent across
  tests.csv and category_breakdown.md
- category_breakdown.md table cells now escape multi-label variants:
  'happy-path|observe-absence' is written as 'happy-path, observe-absence'
  so the literal | doesn't get parsed as a markdown column separator
  (lifecycle rows were rendering with an extra column)
- README comparison table refreshed to match the regenerated artifacts:
  1607 -> 1617 unique tests, +606 -> +616 vs upstream (was stale relative
  to coverage_matrix.csv after the entities/ scanner landed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 4 comments.

Comment thread coverage-analysis/build_coverage.py Outdated
Comment thread coverage-analysis/build_coverage.py Outdated
Comment thread coverage-analysis/README.md Outdated
Comment thread coverage-analysis/lifecycle_disjoint.md Outdated
esraagamal6 and others added 2 commits May 19, 2026 13:54
The classifier previously only inspected test names. Variant emitter tests
exercise pagination (page.after cursor) and filter (filter: { ... } body)
on many search and batch-operation specs, but their names are generic
(variant-N - X - path #1) so they were bucketed as data-driven/unlabeled
instead of pagination-sort / filter.

Added body-shape detection: if a test() block contains 'page: {' or
'sort: [' in the request body, add 'pagination-sort' to its variants; if
it contains 'filter: {', add 'filter'. Matches the field-assignment
form so response-access expressions (json?.page?.startCursor) don't
false-positive.

Effect on the matrix:
- pagination-sort: 0 -> 85 (upstream 53)
- filter:           0 -> 196 (upstream 85)
- unlabeled:       12 -> 1

README updated to flag the semantic distinction: these counts are
request-shape coverage, not behaviour coverage. Upstream's hand-written
pagination/filter tests assert results; generator's only assert status
code + response schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_coverage.py: move `reset` from delete-regex to update-regex so
  resetClock (POST /clock/reset, admin state reset) classifies as
  update, not delete. Eliminates the false positive "clock — has
  create+delete but no observe-absence" in gaps.md.
- build_coverage.py: docstring "Test sources scanned" list now includes
  generated/camunda-oca/playwright/entities/*.lifecycle.spec.ts (the
  10 EntityLifecycle rows the script already scans).
- README.md: "three locations" → "five locations" to match the
  five-row source table below it.
- lifecycle_disjoint.md: use the entity-lifecycle terminology
  (create → present → update → present → delete → absent) for the
  generator's EntityLifecycle tests, not the edge-lifecycle
  (establish/revoke) terms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

Comment thread coverage-analysis/README.md
Comment thread coverage-analysis/README.md
…able

It is manually maintained (not regenerated by build_coverage.py), so the
row is flagged accordingly so readers don't expect the script to keep it
in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 1 comment.

Comment thread coverage-analysis/README.md Outdated
`npm run pipeline` already chains `fetch-spec`, `testsuite:generate`, and
`generate:request-validation`. Calling `generate:request-validation` a
second time runs the request-validation emitter twice unnecessarily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

Comment thread coverage-analysis/build_coverage.py
Comment thread coverage-analysis/build_coverage.py Outdated
- coverage_matrix.md no longer claims variants are 'first-match labels
  from test names'. New blurb names the three label sources
  (name suffix, body shape, fixed emitter labels) and notes that
  matrix columns are not mutually exclusive — a multi-label test
  counts in both columns but only once in 'total'.
- gaps.md 'Search ops with no pagination/sort or filter coverage'
  section now emits '- _(none)_' when no entries match, matching the
  'delete-then-observe-absence' section. Previously the empty list
  was ambiguous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 1 comment.

Comment thread coverage-analysis/build_coverage.py Outdated
`unlabeled` describes the NAME-classification ('no info derivable from
the test name'). Body-shape detection (pagination/filter) is a separate
axis, so a dynamic 'variant-N - scenario' test with a filter body is
both name-unlabeled AND body-filter — it should carry both labels.

Previously the augment logic dropped 'unlabeled' when extras were added,
which under-reported dynamic scenarios in the inventory.

After: 7 rows carry `unlabeled|filter` (was 1 alone) — these are
searchAuditLogs / searchAuditLogs.variant scenarios that were being
mis-counted only under `filter`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated no new comments.

…antic split

The matrix shows not-found=0 for the generator, but this is upstream's
taxonomy split, not "generator never asserts 404". Spelled out:

- observe-absence = GET after DELETE (entity was created, now gone).
  Generator has 48 of these.
- not-found = GET against a fake/never-existing ID (never created).
  Generator has 0 of these — this is the actual capability gap.

Every entity-lifecycle test ends with expect(status).toBe(404), so the
generator IS asserting 404 in 10 places; they're bucketed as
observe-absence because they exercise the post-delete path, not the
fake-ID path. Upstream's 127 not-found tests are mostly fake-ID tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants