Issue #5: CI test redesign — nested lifecycle, dynamic IDs, log collector, LLM judge by dogkeeper886 · Pull Request #6 · dogkeeper886/testlink-code

dogkeeper886 · 2026-04-19T16:05:07Z

Closes #5.

Summary

End-to-end rework of the CI test framework along the lines of the issue and cicd/TESTING_GUIDELINES.md:

Lifecycle. Session setup/teardown lives in cicd/scripts/run-tests.sh with trap EXIT, so containers/volumes are torn down on any exit path (pass, fail, Ctrl-C, crash). Per-test/step lifecycle is enforced in YAML.
Dynamic IDs. Every entity ID flows through capture: from XML-RPC responses. No numeric ID literal appears in test params (only in response-shape assertions).
Unique entity names. Every created entity carries a {{runId}} / {{testId}} suffix, so the suite is idempotent across re-runs.
Log collector. Now points at cicd/docker-compose.ci.yml explicitly with cwd=projectRoot; per-test logs land in cicd/results/<run>/<testId>.log.
Test flow layers. testcases/ is reorganized into smoke/ → auth/ → crud/ → workflow/. Build suite stays separate (no stack lifecycle).
Port collision. CI now binds ${TL_PORT:-8091}:80 (dev still uses 8090). All scripts and YAML steps reference $TL_URL. Single env override changes every consumer.
LLM judge redesign. Switched Ollama /api/generate → /api/chat. Made LLM_JUDGE_URL / LLM_JUDGE_MODEL env-driven via cicd/tests/.env. Replaced rigid heuristics with role/task/behavior/output prompt structure. Added two YAML fields the test author owns (objective, judgeContext) so per-test situational framing reaches the model. Tuned Ollama options (num_ctx 8192, num_predict 512) so small models (gemma3:4b) don't silently truncate prompts or runaway-generate.

Issue #5 was updated (with consent) to bring LLM-judge work in scope and to mark TL_PORT as a Must.

Acceptance criteria check

✅ Fresh checkout runs the full suite cleanly: 11/11 simple, 11/11 LLM, exit 0 (~47s).
✅ Any single suite runs in isolation (--suite build skips ci-up; others use the wrapper).
✅ Re-runs are idempotent (entity names carry runId; teardown deletes by captured IDs/prefixes).
✅ Per-test logs are non-empty in cicd/results/<run>/.
✅ No numeric ID literals in test params (only inside <int>1</int> step-number assertions in createTestCase payloads).
✅ Every testcase satisfies the §10 checklist in TESTING_GUIDELINES.md.

Test plan

bash cicd/scripts/run-tests.sh against a fresh checkout — should land 11/11 in ~50s with exit 0.
bash cicd/scripts/run-tests.sh --suite build — should run 3/3 without bringing up the compose stack.
Run twice in a row — second run should pass identically (idempotency).
Verify the dev compose can run on host port 8090 while CI runs on 8091 simultaneously.
cat cicd/results/<run>/TC-CRUD-001.log — should contain docker compose log lines for that test window.
Optional: try a different LLM by overriding LLM_JUDGE_MODEL in cicd/tests/.env.

Out of scope (followups)

Replacement of the YAML runner with Vitest or another standard framework.
Dedicated negative suite (TC-AUTH-002 is the one negative test today; lives in auth/).
Run the test runner itself inside a container (would mount /var/run/docker.sock).

🤖 Generated with Claude Code

…e separation) Canonical design reference for the CI test suite: four nested lifecycle scopes (session/suite/test/step) with guaranteed teardown, dynamic ID capture rules, layered test flow, unique-names convention, and the rationale for keeping docker-compose.ci.yml separate from the dev docker-compose.yml. Refs #5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The log collector was pointed at `<projectRoot>/docker/`, a directory that contains no compose file. `docker compose logs` in that cwd failed to start and the collector silently disabled itself, so per-test log extraction never produced output. Replace `RunConfig.dockerComposePath` (directory) with `composeFile` (absolute path to the compose file). Invoke `docker compose -f <composeFile> logs --follow --timestamps` with cwd set to the project root. Default target is the CI compose at `cicd/docker-compose.ci.yml`. Refs #5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Session setup (ci-up.sh) and teardown (ci-down.sh) used to live inside test-case steps — TC-INTEGRATION-001 started the stack, TC-E2E-001's last step tore it down. That made teardown conditional on the e2e test reaching its final step: any earlier failure left containers and volumes behind. Running only one suite was inconsistent too (build skipped lifecycle entirely, integration started but never stopped). Introduce cicd/scripts/run-tests.sh as the single entry point. It: - Runs ci-up.sh before tests (except for --suite build, which only exercises the image artifact). - Traps EXIT/INT/TERM to invoke ci-down.sh regardless of outcome — pass, fail, Ctrl-C, or crash all produce a clean teardown. - Passes remaining args through to the tsx CLI. Call sites updated: - TC-INTEGRATION-001 and TC-E2E-001 no longer invoke ci-up/ci-down. - TC-INTEGRATION-001's spurious dependency on TC-BUILD-002 removed; the compose image is built by ci-up.sh, and TC-BUILD-002 builds a separate testlink-ci-test tag only used by TC-BUILD-001. - package.json npm scripts route through the wrapper. - .github/workflows/test-suite.yml calls the wrapper from repo root. Refs #5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three pieces of framework scaffolding for the test restructure: 1. Executor auto-populates `{{runId}}` (Date.now()) and `{{testId}}` (test case ID) as captured variables at the start of each test. Tests can use these to generate unique entity names without bespoke bash. 2. Add cicd/scripts/xmlrpc-capture.sh — reads an XML-RPC methodCall document from stdin, POSTs to the TestLink API, mirrors the response to stderr for expectPatterns, and emits structured JSON on stdout for the framework's capture: mechanism. Replaces ~6 lines of inline bash extraction in every CRUD step. 3. SUITES list updated to match the guidelines: build, smoke, auth, crud, workflow, negative, regression. Refs #5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reorganize cicd/tests/testcases/ per the guidelines: drop the integration/e2e bucket names, organize by the test-flow pyramid (smoke → auth → crud → workflow). Rewrite each test to follow the three core rules: - Unique names — every created entity embeds {{testId}} and {{runId}} (auto-populated by the executor) so rerunning against the same DB never collides with residue. - Dynamic ID capture — the xmlrpc-capture.sh helper extracts the created entity's id from the XML response and emits JSON; the framework's capture: mechanism pipes it into {{projectId}}, {{suiteId}}, {{caseId}} for subsequent steps. - Per-test data ownership — every test creates its parent entities in setup steps and deletes them in reverse order as teardown steps. No test leaks data that another test depends on. Layout: smoke/ TC-SMOKE-001 login page responds TC-SMOKE-002 XML-RPC tl.ping auth/ TC-AUTH-001 valid API key accepted TC-AUTH-002 invalid API key rejected crud/ TC-CRUD-001 project CRUD TC-CRUD-002 test suite CRUD TC-CRUD-003 test case CRUD workflow/ TC-WORKFLOW-001 full entity-graph round-trip The old e2e/TC-E2E-001 is superseded by TC-WORKFLOW-001, which exercises the same graph but with captured IDs (the original hardcoded testprojectid=1, testsuiteid=2, external-id CIT-1). test-pipeline.yml updated to run the new suites in dependency order: build → smoke → auth → crud → workflow. Lower-layer failure short-circuits the higher layers. Refs #5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two bugs surfaced when running the CRUD and workflow suites against a live TestLink: 1. xmlrpc-capture.sh pulled the "first <int> in the response" as the created id. But TestLink's XML-RPC emits the `id` field as <string>N</string> for several entity types (createTestProject among them), so the capture emitted `{"ok": true}` with no id and every downstream step referenced an empty {{projectId}}. Match the `id` member explicitly and accept both <int> and <string> wrappers; do the same for faultCode. 2. The executor substituted {{runId}} / {{testId}} / captured vars into step.command but not into step.expectPatterns. "Read back" steps used expectPatterns like "case-{{testId}}-{{runId}}" to verify the round-trip, which never matched because the regex was the literal template string. Run the same substitution over expect/reject patterns before checking. Test expect-patterns for creation steps updated from "<int>" (never going to match <string>) to "<name>id</name>", which is stable across both id representations. Verified against a fresh CI stack: - smoke: 2/2 pass - auth: 2/2 pass - crud: 3/3 pass - workflow: 1/1 pass - workflow re-run in same session: 1/1 pass (idempotent) Refs #5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The CI compose file used to publish the app on host port 8090, the same port as the dev compose, so the two stacks could not coexist. Move the port behind ${TL_PORT:-8091} and have all CI scripts and test helpers reference $TL_URL. run-tests.sh sources cicd/tests/.env (gitignored) so a single override changes every consumer. Why: a developer who left the dev stack running would see CI fail to bind on every test invocation. Splitting the port lets both run side by side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ning Three changes to the judge: 1. Switch from Ollama /api/generate to /api/chat with a system + user message split. Make LLM_JUDGE_URL and LLM_JUDGE_MODEL env-driven via cicd/tests/.env so the host running tests can target any Ollama instance and model without touching code. 2. Redesign the prompt along role/task/behavior/output lines. Drop the hardcoded heuristics ("exit code 0 = pass", etc.) that were fighting negative tests. Add two optional YAML fields the test author owns: - objective — what the test proves and why it matters - judgeContext — what evidence each step produces and what silent failures look like in this domain The judge reads OBJECTIVE → CONTEXT → CRITERIA → OBSERVATIONS in that order, so the per-test situational framing reaches the model before the raw evidence. 3. Tune Ollama options for small (~4B) models: - num_ctx 8192 — prompts can hit ~5-7k chars; 4096 default was silently truncating OBSERVATIONS for multi-step tests, producing empty/garbage JSON - num_predict 512 — enough headroom for FAIL evidence quotes while still catching runaway generation Why: gemma3:4b was returning empty {} or off-schema JSON on roughly 20-30% of tests with the old prompt and default options. The combo of per-test framing and proper context size brings 11/11 testcases through the judge cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-BUILD-003 Each YAML now declares: - objective — the plain-English purpose of the test - judgeContext — what each step does, what evidence it emits, and what silent failures look like in that test's domain This is the per-test framing the redesigned LLM judge consumes. Negative tests (TC-AUTH-002) explicitly call out that they EXPECT a fault response, which fixes the prior misjudgment where the judge labeled the expected fault as a failure. TC-BUILD-003 is also rewritten to validate composer.json from inside the testlink-ci-test image (depending on TC-BUILD-002 to build it) instead of shelling out to a host python3 that may not exist. TC-SMOKE-001 picks up the $TL_URL change from the prior port-config commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…enerate - run-tests.sh header still listed --suite integration and TC-E2E-001 in the usage examples, both deleted in the suite restructure. Update to --suite crud and --id TC-WORKFLOW-001. - LLMJudge.unloadModel was POSTing to /api/chat with messages: [], which is awkward — the unload doesn't need conversational semantics, and an empty messages array isn't well-defined for the chat endpoint. Use /api/generate with keep_alive: 0 (Ollama's documented unload pattern). /api/chat stays in place for the actual judging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace 24 hardcoded literals of the admin API key in YAML test steps with the framework's {{devKey}} substitution. The key now flows from a single source: cicd/tests/.env (TL_DEV_KEY) -> run-tests.sh / ci-up.sh source it -> executor.ts injects it into TestExecutor.variables as {{devKey}} -> ci-up.sh forwards it via `docker compose exec -e TL_DEV_KEY` into init-db.sh, which seeds the matching value into the users table Each consumer falls back to the previous hardcoded default (a1b2c3d4...) when TL_DEV_KEY is unset, so existing setups keep working without an .env file. Why: rotating the seeded admin key used to require editing init-db.sh, ci-up.sh, and 5 YAML files. Now it's one .env line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- docker-publish.yml: switch to workflow_dispatch only with a job-level guard restricting it to refs/heads/testlink_1_9_20_fixed, so accidental manual dispatches from a feature branch can't publish. Forks without GHCR write auth no longer fail on every branch push. - test-pipeline.yml: drop the pull_request trigger; PRs no longer fire CI automatically. Use workflow_dispatch from the Actions tab to run against a feature branch when needed. Push to main still runs it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the push-to-main trigger. Both workflows (CI Pipeline and Docker) now only fire via workflow_dispatch. Run from the Actions tab against whichever ref you want to validate or release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dogkeeper886 and others added 13 commits April 19, 2026 10:27

dogkeeper886 merged commit d505af6 into testlink_1_9_20_fixed Apr 19, 2026

dogkeeper886 deleted the issue-5-ci-test-redesign branch April 19, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #5: CI test redesign — nested lifecycle, dynamic IDs, log collector, LLM judge#6

Issue #5: CI test redesign — nested lifecycle, dynamic IDs, log collector, LLM judge#6
dogkeeper886 merged 13 commits intotestlink_1_9_20_fixedfrom
issue-5-ci-test-redesign

dogkeeper886 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dogkeeper886 commented Apr 19, 2026

Summary

Acceptance criteria check

Test plan

Out of scope (followups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant