Skip to content

Issue #5: CI test redesign — nested lifecycle, dynamic IDs, log collector, LLM judge#6

Merged
dogkeeper886 merged 13 commits intotestlink_1_9_20_fixedfrom
issue-5-ci-test-redesign
Apr 19, 2026
Merged

Issue #5: CI test redesign — nested lifecycle, dynamic IDs, log collector, LLM judge#6
dogkeeper886 merged 13 commits intotestlink_1_9_20_fixedfrom
issue-5-ci-test-redesign

Conversation

@dogkeeper886
Copy link
Copy Markdown
Owner

Closes #5.

Summary

End-to-end rework of the CI test framework along the lines of the issue and cicd/TESTING_GUIDELINES.md:

  • Lifecycle. Session setup/teardown lives in cicd/scripts/run-tests.sh with trap EXIT, so containers/volumes are torn down on any exit path (pass, fail, Ctrl-C, crash). Per-test/step lifecycle is enforced in YAML.
  • Dynamic IDs. Every entity ID flows through capture: from XML-RPC responses. No numeric ID literal appears in test params (only in response-shape assertions).
  • Unique entity names. Every created entity carries a {{runId}} / {{testId}} suffix, so the suite is idempotent across re-runs.
  • Log collector. Now points at cicd/docker-compose.ci.yml explicitly with cwd=projectRoot; per-test logs land in cicd/results/<run>/<testId>.log.
  • Test flow layers. testcases/ is reorganized into smoke/ → auth/ → crud/ → workflow/. Build suite stays separate (no stack lifecycle).
  • Port collision. CI now binds ${TL_PORT:-8091}:80 (dev still uses 8090). All scripts and YAML steps reference $TL_URL. Single env override changes every consumer.
  • LLM judge redesign. Switched Ollama /api/generate/api/chat. Made LLM_JUDGE_URL / LLM_JUDGE_MODEL env-driven via cicd/tests/.env. Replaced rigid heuristics with role/task/behavior/output prompt structure. Added two YAML fields the test author owns (objective, judgeContext) so per-test situational framing reaches the model. Tuned Ollama options (num_ctx 8192, num_predict 512) so small models (gemma3:4b) don't silently truncate prompts or runaway-generate.

Issue #5 was updated (with consent) to bring LLM-judge work in scope and to mark TL_PORT as a Must.

Acceptance criteria check

  • ✅ Fresh checkout runs the full suite cleanly: 11/11 simple, 11/11 LLM, exit 0 (~47s).
  • ✅ Any single suite runs in isolation (--suite build skips ci-up; others use the wrapper).
  • ✅ Re-runs are idempotent (entity names carry runId; teardown deletes by captured IDs/prefixes).
  • ✅ Per-test logs are non-empty in cicd/results/<run>/.
  • ✅ No numeric ID literals in test params (only inside <int>1</int> step-number assertions in createTestCase payloads).
  • ✅ Every testcase satisfies the §10 checklist in TESTING_GUIDELINES.md.

Test plan

  • bash cicd/scripts/run-tests.sh against a fresh checkout — should land 11/11 in ~50s with exit 0.
  • bash cicd/scripts/run-tests.sh --suite build — should run 3/3 without bringing up the compose stack.
  • Run twice in a row — second run should pass identically (idempotency).
  • Verify the dev compose can run on host port 8090 while CI runs on 8091 simultaneously.
  • cat cicd/results/<run>/TC-CRUD-001.log — should contain docker compose log lines for that test window.
  • Optional: try a different LLM by overriding LLM_JUDGE_MODEL in cicd/tests/.env.

Out of scope (followups)

  • Replacement of the YAML runner with Vitest or another standard framework.
  • Dedicated negative suite (TC-AUTH-002 is the one negative test today; lives in auth/).
  • Run the test runner itself inside a container (would mount /var/run/docker.sock).

🤖 Generated with Claude Code

dogkeeper886 and others added 13 commits April 19, 2026 10:27
…e separation)

Canonical design reference for the CI test suite: four nested lifecycle
scopes (session/suite/test/step) with guaranteed teardown, dynamic ID
capture rules, layered test flow, unique-names convention, and the
rationale for keeping docker-compose.ci.yml separate from the dev
docker-compose.yml.

Refs #5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The log collector was pointed at `<projectRoot>/docker/`, a directory
that contains no compose file. `docker compose logs` in that cwd failed
to start and the collector silently disabled itself, so per-test log
extraction never produced output.

Replace `RunConfig.dockerComposePath` (directory) with `composeFile`
(absolute path to the compose file). Invoke
`docker compose -f <composeFile> logs --follow --timestamps` with cwd
set to the project root. Default target is the CI compose at
`cicd/docker-compose.ci.yml`.

Refs #5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Session setup (ci-up.sh) and teardown (ci-down.sh) used to live inside
test-case steps — TC-INTEGRATION-001 started the stack, TC-E2E-001's
last step tore it down. That made teardown conditional on the e2e test
reaching its final step: any earlier failure left containers and
volumes behind. Running only one suite was inconsistent too (build
skipped lifecycle entirely, integration started but never stopped).

Introduce cicd/scripts/run-tests.sh as the single entry point. It:
- Runs ci-up.sh before tests (except for --suite build, which only
  exercises the image artifact).
- Traps EXIT/INT/TERM to invoke ci-down.sh regardless of outcome —
  pass, fail, Ctrl-C, or crash all produce a clean teardown.
- Passes remaining args through to the tsx CLI.

Call sites updated:
- TC-INTEGRATION-001 and TC-E2E-001 no longer invoke ci-up/ci-down.
- TC-INTEGRATION-001's spurious dependency on TC-BUILD-002 removed;
  the compose image is built by ci-up.sh, and TC-BUILD-002 builds a
  separate testlink-ci-test tag only used by TC-BUILD-001.
- package.json npm scripts route through the wrapper.
- .github/workflows/test-suite.yml calls the wrapper from repo root.

Refs #5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three pieces of framework scaffolding for the test restructure:

1. Executor auto-populates `{{runId}}` (Date.now()) and `{{testId}}`
   (test case ID) as captured variables at the start of each test.
   Tests can use these to generate unique entity names without
   bespoke bash.

2. Add cicd/scripts/xmlrpc-capture.sh — reads an XML-RPC methodCall
   document from stdin, POSTs to the TestLink API, mirrors the
   response to stderr for expectPatterns, and emits structured JSON
   on stdout for the framework's capture: mechanism. Replaces ~6
   lines of inline bash extraction in every CRUD step.

3. SUITES list updated to match the guidelines:
   build, smoke, auth, crud, workflow, negative, regression.

Refs #5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reorganize cicd/tests/testcases/ per the guidelines: drop the
integration/e2e bucket names, organize by the test-flow pyramid
(smoke → auth → crud → workflow). Rewrite each test to follow the
three core rules:

- Unique names — every created entity embeds {{testId}} and {{runId}}
  (auto-populated by the executor) so rerunning against the same DB
  never collides with residue.
- Dynamic ID capture — the xmlrpc-capture.sh helper extracts the
  created entity's id from the XML response and emits JSON; the
  framework's capture: mechanism pipes it into {{projectId}},
  {{suiteId}}, {{caseId}} for subsequent steps.
- Per-test data ownership — every test creates its parent entities
  in setup steps and deletes them in reverse order as teardown steps.
  No test leaks data that another test depends on.

Layout:
  smoke/     TC-SMOKE-001   login page responds
             TC-SMOKE-002   XML-RPC tl.ping
  auth/      TC-AUTH-001    valid API key accepted
             TC-AUTH-002    invalid API key rejected
  crud/      TC-CRUD-001    project CRUD
             TC-CRUD-002    test suite CRUD
             TC-CRUD-003    test case CRUD
  workflow/  TC-WORKFLOW-001  full entity-graph round-trip

The old e2e/TC-E2E-001 is superseded by TC-WORKFLOW-001, which
exercises the same graph but with captured IDs (the original
hardcoded testprojectid=1, testsuiteid=2, external-id CIT-1).

test-pipeline.yml updated to run the new suites in dependency order:
build → smoke → auth → crud → workflow. Lower-layer failure
short-circuits the higher layers.

Refs #5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced when running the CRUD and workflow suites against a
live TestLink:

1. xmlrpc-capture.sh pulled the "first <int> in the response" as the
   created id. But TestLink's XML-RPC emits the `id` field as
   <string>N</string> for several entity types (createTestProject
   among them), so the capture emitted `{"ok": true}` with no id and
   every downstream step referenced an empty {{projectId}}. Match the
   `id` member explicitly and accept both <int> and <string> wrappers;
   do the same for faultCode.

2. The executor substituted {{runId}} / {{testId}} / captured vars into
   step.command but not into step.expectPatterns. "Read back" steps
   used expectPatterns like "case-{{testId}}-{{runId}}" to verify the
   round-trip, which never matched because the regex was the literal
   template string. Run the same substitution over expect/reject
   patterns before checking.

Test expect-patterns for creation steps updated from "<int>" (never
going to match <string>) to "<name>id</name>", which is stable across
both id representations.

Verified against a fresh CI stack:
- smoke:    2/2 pass
- auth:     2/2 pass
- crud:     3/3 pass
- workflow: 1/1 pass
- workflow re-run in same session: 1/1 pass (idempotent)

Refs #5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI compose file used to publish the app on host port 8090, the
same port as the dev compose, so the two stacks could not coexist.
Move the port behind ${TL_PORT:-8091} and have all CI scripts and
test helpers reference $TL_URL. run-tests.sh sources cicd/tests/.env
(gitignored) so a single override changes every consumer.

Why: a developer who left the dev stack running would see CI fail to
bind on every test invocation. Splitting the port lets both run side
by side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

Three changes to the judge:

1. Switch from Ollama /api/generate to /api/chat with a system + user
   message split. Make LLM_JUDGE_URL and LLM_JUDGE_MODEL env-driven via
   cicd/tests/.env so the host running tests can target any Ollama
   instance and model without touching code.

2. Redesign the prompt along role/task/behavior/output lines. Drop the
   hardcoded heuristics ("exit code 0 = pass", etc.) that were fighting
   negative tests. Add two optional YAML fields the test author owns:
     - objective    — what the test proves and why it matters
     - judgeContext — what evidence each step produces and what silent
                      failures look like in this domain
   The judge reads OBJECTIVE → CONTEXT → CRITERIA → OBSERVATIONS in that
   order, so the per-test situational framing reaches the model before
   the raw evidence.

3. Tune Ollama options for small (~4B) models:
     - num_ctx 8192     — prompts can hit ~5-7k chars; 4096 default was
                          silently truncating OBSERVATIONS for
                          multi-step tests, producing empty/garbage JSON
     - num_predict 512  — enough headroom for FAIL evidence quotes
                          while still catching runaway generation

Why: gemma3:4b was returning empty {} or off-schema JSON on roughly
20-30% of tests with the old prompt and default options. The combo of
per-test framing and proper context size brings 11/11 testcases through
the judge cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-BUILD-003

Each YAML now declares:
  - objective    — the plain-English purpose of the test
  - judgeContext — what each step does, what evidence it emits, and what
                   silent failures look like in that test's domain

This is the per-test framing the redesigned LLM judge consumes. Negative
tests (TC-AUTH-002) explicitly call out that they EXPECT a fault
response, which fixes the prior misjudgment where the judge labeled the
expected fault as a failure.

TC-BUILD-003 is also rewritten to validate composer.json from inside
the testlink-ci-test image (depending on TC-BUILD-002 to build it)
instead of shelling out to a host python3 that may not exist.

TC-SMOKE-001 picks up the $TL_URL change from the prior port-config
commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enerate

- run-tests.sh header still listed --suite integration and TC-E2E-001 in
  the usage examples, both deleted in the suite restructure. Update to
  --suite crud and --id TC-WORKFLOW-001.
- LLMJudge.unloadModel was POSTing to /api/chat with messages: [], which
  is awkward — the unload doesn't need conversational semantics, and an
  empty messages array isn't well-defined for the chat endpoint. Use
  /api/generate with keep_alive: 0 (Ollama's documented unload pattern).
  /api/chat stays in place for the actual judging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace 24 hardcoded literals of the admin API key in YAML test steps
with the framework's {{devKey}} substitution. The key now flows from a
single source:

  cicd/tests/.env (TL_DEV_KEY)
    -> run-tests.sh / ci-up.sh source it
    -> executor.ts injects it into TestExecutor.variables as {{devKey}}
    -> ci-up.sh forwards it via `docker compose exec -e TL_DEV_KEY`
       into init-db.sh, which seeds the matching value into the users
       table

Each consumer falls back to the previous hardcoded default
(a1b2c3d4...) when TL_DEV_KEY is unset, so existing setups keep working
without an .env file.

Why: rotating the seeded admin key used to require editing init-db.sh,
ci-up.sh, and 5 YAML files. Now it's one .env line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docker-publish.yml: switch to workflow_dispatch only with a job-level
  guard restricting it to refs/heads/testlink_1_9_20_fixed, so accidental
  manual dispatches from a feature branch can't publish. Forks without
  GHCR write auth no longer fail on every branch push.
- test-pipeline.yml: drop the pull_request trigger; PRs no longer fire
  CI automatically. Use workflow_dispatch from the Actions tab to run
  against a feature branch when needed. Push to main still runs it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the push-to-main trigger. Both workflows (CI Pipeline and Docker)
now only fire via workflow_dispatch. Run from the Actions tab against
whichever ref you want to validate or release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dogkeeper886 dogkeeper886 merged commit d505af6 into testlink_1_9_20_fixed Apr 19, 2026
@dogkeeper886 dogkeeper886 deleted the issue-5-ci-test-redesign branch April 19, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[refactor] CI test suite: nested lifecycle, dynamic ID capture, log-collector fix

1 participant