[MISC] Add tox-driven test rig foundation (unit + integration + e2e)#1971
[MISC] Add tox-driven test rig foundation (unit + integration + e2e)#1971hari-kuriakose wants to merge 11 commits into
Conversation
Introduces a single tox-driven entry point that brings every existing test suite under one rig, adds an e2e tier on top of the existing docker-compose stack, and surfaces critical-path gaps + regressions in CI reports. - tests/groups.yaml declares 15 groups across unit/integration/e2e with a depends_on graph; tests/critical_paths.yaml registers 10 critical flows. - tests/rig is the dispatcher: python -m tests.rig run [groups|all|--from-file] [--tier] [--changed-only] [--coverage/--no-coverage] [--parallel]. - E2E supports docker compose (CI default) or testcontainers + local subprocesses (dev default), selectable via UNSTRACT_E2E_RUNTIME. - Reports: per-group junit.xml + md, plus reports/summary.md with regressions and critical-path gaps; htmlcov + coverage.xml when --coverage. - New ci-test-e2e.yaml lane runs on main, run-e2e-labeled PRs, and nightly; ci-test.yaml rewritten to call the rig and update the main baseline cache. - Pre-commit hook fires only when developers create .test-selection. - Existing per-service tests stay co-located; legacy tox envs (runner, sdk1, prompt-service) preserved as thin aliases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Python "tests.rig" test rig: group and critical-path manifests and loaders, selection resolution, E2E runtime drivers, a CLI orchestrator to run groups, reporting and coverage combining, CI/workflow integration, pre-commit hook, helper combine script, extensive docs, and self-tests. ChangesTest Rig Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 19
🧹 Nitpick comments (1)
tests/rig/groups.py (1)
91-94: ⚡ Quick winValidate manifest schema version during load.
The manifest is versioned (
version: 1) butload_groups()does not enforce it. Failing fast on unknown versions avoids silent misparsing when schema evolves.Suggested fix
def load_groups(path: Path | None = None) -> GroupManifest: @@ raw = yaml.safe_load(manifest_path.read_text()) if not isinstance(raw, dict) or "groups" not in raw: raise ValueError(f"{manifest_path}: expected top-level `groups:` mapping") + if raw.get("version") != 1: + raise ValueError( + f"{manifest_path}: unsupported manifest version {raw.get('version')!r}; expected 1" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/groups.py` around lines 91 - 94, load_groups() currently accepts any manifest without checking its version; add an explicit validation after the top-level raw check to read raw.get("version") and if it's missing or not equal to 1 raise a ValueError referencing manifest_path (e.g. "{manifest_path}: unsupported or missing manifest version, expected version: 1"). Make the check inside the same function (load_groups) right before reading defaults/raw["groups"] so unknown schema versions fail fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci-test-e2e.yaml:
- Around line 39-45: The cache step "Restore main-branch test baseline" uses a
fixed key ("unstract-test-baseline-main-e2e") and a wrong restore-key
("unstract-test-baseline-main"); change the key to include a dynamic identifier
(e.g., github.ref or github.sha) so updates create new cache entries (for
example interpolate "${{ github.ref }}-unstract-test-baseline-main-e2e" or "${{
github.sha }}-unstract-test-baseline-main-e2e"), and make the restore-keys use
the correct e2e namespace (e.g., "unstract-test-baseline-main-e2e" as the
prefix) to avoid cross-lane contamination; apply the same change to the other
equivalent step referenced (lines 51-53).
- Around line 57-63: The "Capture docker compose logs on failure" step redirects
output to reports/docker-compose-logs.txt but doesn't ensure the reports/
directory exists; update the failing-step's run sequence to create the reports
directory (e.g., run mkdir -p reports) before executing the docker compose ... >
reports/docker-compose-logs.txt command so log redirection never fails and logs
are preserved.
In @.github/workflows/ci-test.yaml:
- Around line 44-50: The cache key "unstract-test-baseline-main" in the "Restore
main-branch test baseline (for regression detection)" step is constant and
prevents new baseline uploads; change the key to include a run-unique or
branch-specific variable (for example use unstract-test-baseline-main-${{
github.run_id }} or unstract-test-baseline-${{ github.ref_name }}-${{
github.run_id }}) so updates (--update-baseline) will write a new cache, and
keep the existing restore-keys prefix "unstract-test-baseline-" so prior
baselines can still be discovered; apply the same change to the equivalent step
referenced around lines 60-62.
In `@tests/e2e/smoke/test_smoke.py`:
- Around line 17-18: The current smoke test uses a lax assertion
(response.status_code < 500) which allows non-success codes like 404; update the
assertion after the requests.get call that fetches
f"{platform.backend_url.rstrip('/')}/health/" to require a successful response
(e.g., assert response.status_code == 200 or assert response.ok) so the health
endpoint must return an explicit success status to pass the test.
In `@tests/README.md`:
- Around line 13-37: The three fenced code blocks showing the test tree and
CLI/help snippets are missing language identifiers; update the three opening
backtick fences that precede the blocks starting with "tests/" (the directory
tree), "positional GROUPS ∪ --from-file ..." (the CLI/usage snippet), and
"reports/" (the reports directory tree) to use a language tag (use "text") so
the fences read ```text instead of just ``` to satisfy markdownlint MD040.
- Around line 174-180: Update the opening sentence that currently reads
"reports/summary.md has three sections:" to match the enumerated list by
changing "three" to "four" (or alternatively remove the numbered count), so the
header and the listed items (Per-group results, ❌ Regressions, ⚠️ Critical paths
not yet covered, ✅ Covered critical paths) are consistent; locate the phrase
"reports/summary.md has three sections:" in the README and adjust it
accordingly.
In `@tests/rig/cli.py`:
- Around line 265-269: When the JUnit parsing returns None the current logic
(using result, group_results, and overall_exit) silently ignores the failure;
update the block handling result so that if result is None you still treat the
group as failed: append a placeholder/failure entry to group_results (or
otherwise record the failed parse) and set overall_exit = overall_exit or 1 (or
another non-zero exit) so the failure is propagated; keep existing behavior for
valid result objects (preserve the check against result.exit_code).
- Around line 207-297: cmd_run is too large and should be split into three clear
phases to reduce cognitive complexity: extract a plan_phase that performs
manifest/registry validation (using load_groups, cp.load_critical_paths,
cp.validate_registry_against_manifest), resolves groups (resolve,
_is_missing_placeholder) and prepares runnable/skipped lists and runtime
selection (pick_runtime) but does not start the runtime; extract an
execute_phase that receives runnable, manifest, endpoints, reports_dir and runs
each group via _execute_group collecting GroupResult and overall_exit
(preserving the dry_run behavior and parallel/timeout/workers args); extract a
finalize_phase that handles runtime teardown (runtime.down) in a single place,
coverage combine_and_report, baseline loading via cp.load_baseline, evaluation
via cp.evaluate, write_summary, reporting regressions/gaps and optionally
cp.emit_baseline, and returns overall_exit; ensure try/finally still guarantees
runtime.down only if runtime was started and not dry_run, thread through needed
values (runtime, endpoints, group_results, reports_dir, args) and keep existing
exit-code logic and logging messages unchanged.
- Line 60: Create a module-level constant (e.g., PREVIOUS_SUMMARY_FILENAME =
"previous-summary.json") in tests/rig/cli.py and replace the repeated literal
occurrences (referenced in the add_argument call where
pr.add_argument("--baseline", type=Path, default=REPO_ROOT / "reports" /
"previous-summary.json") and the other two occurrences at the same file) with
Path(REPO_ROOT / "reports" / PREVIOUS_SUMMARY_FILENAME) or Path(REPO_ROOT,
"reports", PREVIOUS_SUMMARY_FILENAME) so all uses reference the single constant
and avoid duplication/drift.
- Around line 252-253: The dry-run path currently does a per-plan "continue"
(args.dry_run) but still falls through to the post-plan evaluation that computes
gaps/regressions (the block that inspects the runs/gaps and returns non-zero),
causing false failures; change the control flow so that after planning is
complete (i.e., after the code that builds the runs/plan set) if args.dry_run is
true the function immediately short-circuits to success (return 0 or
sys.exit(0)) and does not execute the gaps/regressions evaluation block (the
code that inspects runs and computes regressions), and remove/replace the
per-iteration continue so dry-run does not leave an empty runs set for later
checks.
In `@tests/rig/coverage.py`:
- Around line 30-36: The subprocess calls that run "coverage combine/xml/html"
should be made path-safe and fail-fast: build the combine arguments using
absolute paths from reports_dir (e.g., list(map(str, [p.resolve() for p in
reports_dir.glob(".coverage.*")]))) and only call subprocess.run for combine
when that list is non-empty; remove cwd=reports_dir and instead pass absolute
output paths like str((reports_dir / "coverage.xml").resolve()) and
str((reports_dir / "htmlcov").resolve()); also set check=True on all
subprocess.run calls so failures raise instead of being silently ignored. Ensure
these changes are applied to the subprocess.run invocations shown (the calls to
subprocess.run for "coverage combine", "coverage xml", and "coverage html").
In `@tests/rig/critical_paths.py`:
- Around line 91-93: The baseline handling for regression evaluation is fragile:
update the previously_covered computation to defensively validate baseline and
its "covered_paths" before using it — check that baseline is a dict, read
covered = baseline.get("covered_paths", []) only if isinstance(baseline, dict),
ensure covered is an iterable/list (otherwise treat as empty), then build
previously_covered = set(str(p) for p in covered if isinstance(p, (str, int)))
so non-list shapes or non-string entries won't raise or silently miscompute
regressions; refer to the previously_covered variable and the
baseline["covered_paths"] access in tests/rig/critical_paths.py when making the
change.
- Around line 50-55: The code currently converts p.get("covered_by") directly to
tuple when building the CriticalPath, which turns a scalar string into a tuple
of characters; before converting, check the type of p.get("covered_by") (call it
covered = p.get("covered_by")) and if covered is not None and not an iterable
sequence like list/tuple, raise a clear ValueError (or TypeError) stating that
"covered_by must be a sequence", otherwise set covered_by=tuple(covered) (or
tuple() when covered is falsy); update the CriticalPath construction site to use
this validated/converted value so downstream validation receives a proper
sequence.
In `@tests/rig/groups.py`:
- Around line 159-165: The validation currently only checks existence for wd
(from g.absolute_workdir()) and each p in g.absolute_paths(), but doesn't ensure
they are inside the repository root; update the checks to resolve symlinks and
enforce repository boundaries: resolve wd and each p via
Path.resolve(strict=False) (to handle non-existing candidates if needed),
resolve the repository root once (repo_root.resolve()), then verify that
resolved_wd and each resolved_p are children of repo_root (using
Path.is_relative_to or comparing parents in a loop); if any resolved path is
outside repo_root, raise the same ValueError(f"group {name!r}: test path does
not exist: {p}") or a clearer message; keep the existing existence check
(p.exists()) but perform the containment check after resolving symlinks.
In `@tests/rig/reporting.py`:
- Around line 48-50: The current check returns None when junit_path doesn't
exist, which hides groups that have exit.txt; change the branch so that if
junit_path.exists() is False but exit_path.exists() is True you return a
zero-count GroupResult (preserving the group's name/status) instead of None.
Locate the code that references junit_path and exit_path and instantiate/return
a GroupResult (using the same constructor/signature used elsewhere in this file)
with an empty test list or zero counts so the group remains visible in
summaries.
- Line 18: Replace the stdlib XML parser import "import xml.etree.ElementTree as
ET" with the hardened parser "from defusedxml import ElementTree as ET" and
ensure all subsequent uses of ET (e.g., ET.parse, ET.fromstring) in the module
use this defusedxml.ElementTree API so JUnit artifacts are parsed with
XXE/XML-bomb protections; install defusedxml as a dependency if not present.
In `@tests/rig/runtime.py`:
- Around line 208-216: The helper _responds currently returns True when the
requests package is missing, which falsely marks services healthy; update the
ImportError branch in the _responds function so that if importing requests fails
it returns False (or raises a clear error) instead of True, keeping the existing
behavior for requests.get and requests.RequestException handling intact;
reference the _responds function to locate and change the ImportError handling.
- Around line 214-216: The health check currently treats any response with
resp.status_code < 500 as healthy, which wrongly accepts 4xx errors; update the
check so only successful responses are considered healthy (e.g., use resp.ok or
test 200 <= resp.status_code < 400) after the requests.get call and before
returning; keep the existing exception handling for requests.RequestException
unchanged but return False on exceptions so failures are not reported as
healthy.
In `@tests/rig/selection.py`:
- Around line 68-69: Replace the silent swallow in the except block that
currently returns an empty set when git diff fails (the except
(subprocess.CalledProcessError, FileNotFoundError): return set() in
tests/rig/selection.py) with raising a RuntimeError that includes the command
context and original exception message so callers can distinguish "no changes"
from a git error; ensure the raised error mentions the --changed-only scenario
and the underlying exception (e.g., include subprocess.CalledProcessError or
FileNotFoundError info) so callers like cmd_expand and cmd_run in cli.py will
surface a clear terminal error.
---
Nitpick comments:
In `@tests/rig/groups.py`:
- Around line 91-94: load_groups() currently accepts any manifest without
checking its version; add an explicit validation after the top-level raw check
to read raw.get("version") and if it's missing or not equal to 1 raise a
ValueError referencing manifest_path (e.g. "{manifest_path}: unsupported or
missing manifest version, expected version: 1"). Make the check inside the same
function (load_groups) right before reading defaults/raw["groups"] so unknown
schema versions fail fast.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 562d1a0f-86eb-454d-a64d-6eb6c1276fef
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (28)
.github/scripts/combine-test-reports.sh.github/workflows/ci-test-e2e.yaml.github/workflows/ci-test.yaml.gitignore.pre-commit-config.yamlpyproject.tomltests/README.mdtests/__init__.pytests/compose/docker-compose.test.yamltests/conftest.pytests/critical_paths.yamltests/e2e/__init__.pytests/e2e/conftest.pytests/e2e/smoke/__init__.pytests/e2e/smoke/test_smoke.pytests/fixtures/.gitkeeptests/groups.yamltests/integration/__init__.pytests/rig/__init__.pytests/rig/__main__.pytests/rig/cli.pytests/rig/coverage.pytests/rig/critical_paths.pytests/rig/groups.pytests/rig/reporting.pytests/rig/runtime.pytests/rig/selection.pytox.ini
|
| Filename | Overview |
|---|---|
| tests/rig/cli.py | Core rig dispatcher; scope_groups computation before optional-placeholder filtering can cause false regression classification and permanently block --update-baseline. |
| tests/rig/selection.py | Group resolution logic; union semantics when --tier is combined with named positionals can produce surprising selection set, though the practical effect is masked by optional-group skipping. |
| tests/groups.yaml | Group manifest; e2e-smoke and integration-workflow-execution both carry cross-tier depends_on [unit-sdk1, unit-workers], pulling unit groups into e2e and integration tier runs. |
| tests/rig/runtime.py | Platform runtime drivers; x2text_url is now correctly included in both the subprocess env and the readiness probe; minio uses unpinned latest tag (flagged in prior threads). |
| .github/workflows/ci-test.yaml | CI unit+integration workflow; --update-baseline only fires on main but is blocked whenever --fail-on-critical-gap causes overall_exit=1 (known 6-gap situation), so the baseline cache stays empty. |
| .github/workflows/ci-test-e2e.yaml | E2e CI workflow; no longer passes the all positional (prior thread fixed); correctly relies on --tier e2e to scope the run. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[tox -e unit/integration/e2e] --> B[tests.rig run --tier X]
B --> C[resolve: select groups]
C --> D[manifest.expand: add transitive deps]
D --> E[scope_groups = set of ordered groups]
E --> F[filter runnable vs skipped optional]
F --> G{needs_platform?}
G -- yes --> H[pick_runtime / ComposeRuntime.up]
G -- no --> I[execute groups sequentially]
H --> I
I --> J[_execute_group: uv run pytest / hurl]
J --> K[writes junit.xml, report.md, exit.txt]
K --> L[parse_junit → GroupResult]
L --> M[_green_group_names: pass or empty]
M --> N[evaluate: covered / gap / regression]
N --> O[write_summary: summary.md + json]
O --> P{update_baseline AND overall_exit==0?}
P -- yes --> Q[merge_into_baseline]
P -- no --> R[baseline unchanged]
Q --> S[end]
R --> S
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
tests/rig/cli.py:345-347
`scope_groups` is set from `ordered` **before** optional-placeholder groups are filtered out into `skipped`. If a group was previously run green (and thus appears in the baseline), is now marked `optional: true`, and its paths have since been removed, it will remain in `scope_groups` but never enter `green`. On the next CI run that includes this group via dep-expansion, `evaluate()` sees: `path.id in previously_covered AND in_scope AND not covered` → `state = "regression"`. Because regressions set `overall_exit = 1`, `--update-baseline` is permanently blocked (its guard is `overall_exit == 0`), creating a stuck state that can only be resolved by manually deleting the cache. Filtering `scope_groups` to exclude the groups the rig actually intends to run fixes this.
```suggestion
# scope_groups must exclude optional placeholders that were skipped:
# if a skipped group was previously in the baseline, keeping it in scope
# would classify its covered paths as regressions even though the group
# never ran, blocking --update-baseline and creating a stuck state.
runnable_pre = {n for n in ordered if not _is_missing_placeholder(manifest.get(n))}
scope_groups = runnable_pre
reports_dir: Path = args.reports_dir
```
### Issue 2 of 2
tests/rig/selection.py:52-64
**`--tier` acts as expansion, not filter, when named positionals are provided**
When `--tier` is set and the positionals are specific group names (not `"all"`), the code always calls `selected.update(tier_set)`, which unions all tier groups into the selection rather than intersecting. Concretely, `tox -e e2e -- e2e-smoke` resolves to every e2e-tier group, not just `e2e-smoke`. The PR description lists this exact invocation as "smoke against docker compose", so users following it will silently run more than expected. Today the extra groups are optional/missing and get skipped, so the output happens to be correct — but any future non-optional e2e group would be unexpectedly included.
Reviews (9): Last reviewed commit: "[MISC] Forward + probe x2text_url (Grept..." | Re-trigger Greptile
Addresses the 5-agent review of PR #1971. Highlights: Critical fixes: - Wire --cov per group via new `coverage_source` field; pytest-cov now actually produces htmlcov/ and coverage.xml. - Switch CI to run unit and integration tiers as separate steps and merge (rather than overwrite) baselines so the second tier doesn't erase the first's covered paths. - Include github.run_id in baseline cache key so actions/cache writes a fresh entry every main build (it only saves on miss). - Hurl exit 5 (no tests collected) now synthesises failures=0 instead of 1. - Move runtime.up() inside try/finally so partial container stacks get torn down on failure. - Always fold subprocess exit codes into overall_exit, even when junit.xml is missing (segfault / OOM / missing binary). Important fixes: - Shim script uses python3 (CI runners don't ship python). - Drop `all` from ci-test-e2e.yaml so the env name matches what runs; resolve() now intersects `all` with --tier rather than unioning. - Disable --md-report under xdist (incompatible). - Add [tool.coverage.paths] mapping so combined XML/HTML resolves correctly across per-workdir runs. - README: testcontainers stub status is now stated honestly; section count fixed. - Honest covered_by: [] on adapter-register-llm. - Add `tox -e rig -- validate` step to both workflows. - Smoke test asserts UNSTRACT_BACKEND_URL matches the fixture URL. - Surface git stderr on --changed-only failures. - Guard parse_junit against malformed/truncated XML and missing counters. - Log testcontainers down() failures instead of swallowing. - ruff format + check clean on tests/. Type / design: - Literal types for Tier, Runner, CriticalPathState, ResultStatus. - ClassVar[str] on PlatformRuntime; freeze GroupResult. - Drop PlatformEndpoints.extras (never read) for typed InfraEndpoints. - Merge PlatformAccess into PlatformEndpoints (single source of truth). - by_id cache in CriticalPathRegistry.__post_init__. - Tighten unit-backend.paths from `[.]` to explicit per-app tests/ dirs. Self-tests: - New unit-rig group covering tests/rig/* logic (25 tests, ~0.1s): cycle detection, unknown dep, topo expansion, tier/runner validation, platform-smoke-dep invariant, real-manifest sanity, evaluate state machine, baseline merge (not overwrite), junit edge cases (testsuites wrapper, exit 5, missing counters, malformed XML), selection precedence (all+tier intersect, empty selection, from-file merging). Verified locally: - `python -m tests.rig validate` → OK (16 groups, 10 critical paths). - `python -m tests.rig run --no-parallel unit-rig` → 25 passed; emits htmlcov/, coverage.xml (42.5%), summary.{md,json}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (7)
tests/rig/runtime.py (2)
265-267:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTighten health criteria to avoid false-positive readiness.
Line 266 accepts 4xx responses as healthy; a misrouted or unauthorized
/healthendpoint can incorrectly pass readiness.💡 Suggested fix
def _responds(url: str, requests_mod) -> bool: try: resp = requests_mod.get(url, timeout=2) - return resp.status_code < 500 + return 200 <= resp.status_code < 300 except requests_mod.RequestException: return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 265 - 267, The health check currently treats any status_code < 500 as healthy which allows 4xx responses; in the try block where you call requests_mod.get(url, timeout=2) and assign resp, tighten the success criterion to only accept 2xx responses (e.g., check resp.status_code >= 200 and resp.status_code < 300) and return False for other statuses; keep the existing except requests_mod.RequestException handler unchanged so network errors still return False.
243-247:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t skip readiness checks when
requestsis unavailable.Line 246 currently turns a broken rig environment into a soft pass for readiness, which can mask startup failures until much later.
💡 Suggested fix
try: import requests - except ImportError: - log.warning("`requests` not installed; skipping platform readiness probe") - return + except ImportError as exc: + raise RuntimeError( + "requests is required for platform readiness probes in ComposeRuntime" + ) from exc🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 243 - 247, The readiness probe swallows a missing `requests` dependency by logging a warning and returning, which masks startup failures; in the except ImportError block around the `import requests` in tests/rig/runtime.py (the try/except that currently calls `log.warning("`requests` not installed; skipping platform readiness probe")`), change behavior to fail fast: replace the soft return with an explicit failure (e.g., log an error and raise ImportError or RuntimeError with a clear message instructing to install `requests`) so the readiness check cannot be skipped silently.tests/rig/coverage.py (1)
61-63:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake
coverage combinepath handling robust for relativereports_dir.When
reports_diris relative, the currentcwd=reports_dir+ prefixed file args can point to the wrong location (e.g.,reports/reports/.coverage.*), so combine may fail and coverage artifacts are dropped.💡 Suggested fix
result = subprocess.run( - [*base, "combine", *[str(p) for p in files]], - cwd=reports_dir, + [*base, "combine", *[str(p.resolve()) for p in files]], + cwd=REPO_ROOT, capture_output=True, text=True, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/coverage.py` around lines 61 - 63, The coverage combine invocation can end up with duplicated/incorrect paths when reports_dir is relative; update the subprocess.run call around result, base, files, and reports_dir so the file arguments are either basenames when using cwd=reports_dir or are absolute paths and cwd is omitted. Concretely: either resolve reports_dir to an absolute Path and pass file args as [p.name for p in files] when calling subprocess.run([*base, "combine", *...], cwd=reports_dir), or keep cwd None and pass absolute paths for each file (e.g., map each file through Path(reports_dir).joinpath(p.name).resolve()) so subprocess.run(..., cwd=...) no longer produces wrong combined paths. Ensure changes reference subprocess.run, result, base, files, and reports_dir.tests/rig/reporting.py (2)
18-18:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a hardened XML parser for JUnit artifact parsing.
xml.etree.ElementTreeis not the right default for untrusted CI artifacts. Switch todefusedxml.ElementTreehere.💡 Proposed fix
-import xml.etree.ElementTree as ET +from defusedxml import ElementTree as ET#!/bin/bash # Verify XML parser usage and whether defusedxml is already declared. set -euo pipefail rg -n "xml\.etree\.ElementTree|defusedxml" tests/rig/reporting.py pyproject.toml requirements*.txt tox.ini 2>/dev/null || trueAlso applies to: 73-74
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/reporting.py` at line 18, Replace the insecure import "import xml.etree.ElementTree as ET" with the hardened parser "import defusedxml.ElementTree as ET" (and update any other occurrences around the other import at lines ~73-74) so JUnit artifact parsing uses defusedxml; also ensure defusedxml is added to project dependencies (pyproject.toml/requirements) so the import resolves at runtime.
67-68:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep groups visible when
exit.txtexists butjunit.xmlis missing.Returning
Nonehere drops the group from summaries, which can hide failed/aborted runs.💡 Proposed fix
junit_path = reports_dir / group_name / "junit.xml" exit_path = reports_dir / group_name / "exit.txt" if not junit_path.exists(): - return None + if not exit_path.exists(): + return None + return GroupResult( + name=group_name, + tier=tier, + exit_code=_read_exit_code(exit_path), + passed=0, + failed=0, + errors=0, + skipped=0, + duration_seconds=0.0, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/reporting.py` around lines 67 - 68, The current check returns None when junit_path doesn't exist, which drops the group from summaries; instead, update the code that reads JUnit to preserve the group when exit.txt exists by not returning None for a missing junit file—use a placeholder/empty JUnit result or a minimal object indicating “no junit” so downstream summary logic still includes the group; modify the branch around junit_path.exists() (the junit_path check and the function that returns its result) to return an empty/placeholder test-report structure or a sentinel that keeps the group visible rather than None.tests/README.md (1)
13-13:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language identifiers to fenced code blocks.
These fences should use an explicit language (e.g.,
text) to satisfy markdownlint MD040.💡 Proposed fix
-``` +```text tests/ ... -``` +``` -``` +```text positional GROUPS ∪ --from-file lines ∪ --tier filter ∪ --changed-only diff -``` +``` -``` +```text reports/ ... -``` +```Also applies to: 131-131, 162-162
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/README.md` at line 13, Fenced code blocks in tests/README.md are missing language identifiers (markdownlint MD040); add the explicit language specifier `text` to each triple-backtick fence that wraps the examples (the blocks containing "tests/", the one with "positional GROUPS ∪ --from-file lines ∪ --tier filter ∪ --changed-only diff", and the block with "reports/"), and apply the same change to the other occurrences noted (around the blocks at the other two locations referenced in the comment).tests/rig/cli.py (1)
319-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winShort-circuit
--dry-runbefore critical-path evaluation.Plan-only mode currently can still return failure from regression/gap evaluation despite executing nothing.
💡 Proposed fix
finally: if runtime is not None and not args.dry_run: print(f"[rig] tearing platform down (runtime={runtime.name})") runtime.down() + if args.dry_run: + return 0 + if args.coverage and not args.dry_run: combine_and_report(reports_dir)Also applies to: 347-379
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/cli.py` around lines 319 - 320, The loop currently checks args.dry_run only after starting critical-path checks, which allows regression/gap evaluation to set failure state; move the args.dry_run short-circuit to the top of the loop (or before any calls that perform regression/gap evaluation) so the iteration immediately continues/returns without invoking those evaluators, and ensure any status-aggregating variables (exit codes, failure flags) are not modified when args.dry_run is true; apply the same change for the other similar block covering the regression/gap evaluation region (the section corresponding to lines 347-379).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/README.md`:
- Line 13: Fenced code blocks in tests/README.md are missing language
identifiers (markdownlint MD040); add the explicit language specifier `text` to
each triple-backtick fence that wraps the examples (the blocks containing
"tests/", the one with "positional GROUPS ∪ --from-file lines ∪ --tier
filter ∪ --changed-only diff", and the block with "reports/"), and apply the
same change to the other occurrences noted (around the blocks at the other two
locations referenced in the comment).
In `@tests/rig/cli.py`:
- Around line 319-320: The loop currently checks args.dry_run only after
starting critical-path checks, which allows regression/gap evaluation to set
failure state; move the args.dry_run short-circuit to the top of the loop (or
before any calls that perform regression/gap evaluation) so the iteration
immediately continues/returns without invoking those evaluators, and ensure any
status-aggregating variables (exit codes, failure flags) are not modified when
args.dry_run is true; apply the same change for the other similar block covering
the regression/gap evaluation region (the section corresponding to lines
347-379).
In `@tests/rig/coverage.py`:
- Around line 61-63: The coverage combine invocation can end up with
duplicated/incorrect paths when reports_dir is relative; update the
subprocess.run call around result, base, files, and reports_dir so the file
arguments are either basenames when using cwd=reports_dir or are absolute paths
and cwd is omitted. Concretely: either resolve reports_dir to an absolute Path
and pass file args as [p.name for p in files] when calling
subprocess.run([*base, "combine", *...], cwd=reports_dir), or keep cwd None and
pass absolute paths for each file (e.g., map each file through
Path(reports_dir).joinpath(p.name).resolve()) so subprocess.run(..., cwd=...) no
longer produces wrong combined paths. Ensure changes reference subprocess.run,
result, base, files, and reports_dir.
In `@tests/rig/reporting.py`:
- Line 18: Replace the insecure import "import xml.etree.ElementTree as ET" with
the hardened parser "import defusedxml.ElementTree as ET" (and update any other
occurrences around the other import at lines ~73-74) so JUnit artifact parsing
uses defusedxml; also ensure defusedxml is added to project dependencies
(pyproject.toml/requirements) so the import resolves at runtime.
- Around line 67-68: The current check returns None when junit_path doesn't
exist, which drops the group from summaries; instead, update the code that reads
JUnit to preserve the group when exit.txt exists by not returning None for a
missing junit file—use a placeholder/empty JUnit result or a minimal object
indicating “no junit” so downstream summary logic still includes the group;
modify the branch around junit_path.exists() (the junit_path check and the
function that returns its result) to return an empty/placeholder test-report
structure or a sentinel that keeps the group visible rather than None.
In `@tests/rig/runtime.py`:
- Around line 265-267: The health check currently treats any status_code < 500
as healthy which allows 4xx responses; in the try block where you call
requests_mod.get(url, timeout=2) and assign resp, tighten the success criterion
to only accept 2xx responses (e.g., check resp.status_code >= 200 and
resp.status_code < 300) and return False for other statuses; keep the existing
except requests_mod.RequestException handler unchanged so network errors still
return False.
- Around line 243-247: The readiness probe swallows a missing `requests`
dependency by logging a warning and returning, which masks startup failures; in
the except ImportError block around the `import requests` in
tests/rig/runtime.py (the try/except that currently calls
`log.warning("`requests` not installed; skipping platform readiness probe")`),
change behavior to fail fast: replace the soft return with an explicit failure
(e.g., log an error and raise ImportError or RuntimeError with a clear message
instructing to install `requests`) so the readiness check cannot be skipped
silently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 74802a43-b372-4ed8-a7bd-f3c2cd741b61
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (26)
.github/scripts/combine-test-reports.sh.github/workflows/ci-test-e2e.yaml.github/workflows/ci-test.yamlpyproject.tomltests/README.mdtests/compose/docker-compose.test.yamltests/conftest.pytests/critical_paths.yamltests/e2e/conftest.pytests/e2e/smoke/test_smoke.pytests/groups.yamltests/rig/__init__.pytests/rig/__main__.pytests/rig/cli.pytests/rig/coverage.pytests/rig/critical_paths.pytests/rig/groups.pytests/rig/reporting.pytests/rig/runtime.pytests/rig/selection.pytests/rig/tests/__init__.pytests/rig/tests/conftest.pytests/rig/tests/test_critical_paths.pytests/rig/tests/test_groups.pytests/rig/tests/test_reporting.pytests/rig/tests/test_selection.py
✅ Files skipped from review due to trivial changes (2)
- tests/rig/tests/conftest.py
- tests/conftest.py
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/rig/main.py
- tests/e2e/smoke/test_smoke.py
- tests/compose/docker-compose.test.yaml
- .github/workflows/ci-test.yaml
- .github/workflows/ci-test-e2e.yaml
- tests/critical_paths.yaml
- pyproject.toml
- tests/rig/groups.py
- tests/rig/critical_paths.py
…egrity Addresses the 5 new findings from the second review pass: N1+N2 (critical) — cross-tier regression false positives: - Namespace baseline caches per workflow (`unstract-test-baseline-ut-*` for ci-test.yaml; `unstract-test-baseline-e2e-*` for ci-test-e2e.yaml). Previously, ci-test.yaml's restore-keys prefix matched the e2e cache, pulling e2e-tier coverage into unit/integration where those groups never ran — every e2e-covered path lit up as a regression. - Add `scope_groups` to evaluate(); paths whose covered_by is fully outside the current invocation's scope are classified as `gap` (out of scope), not `regression`. cmd_run passes ordered groups as the scope. N3+N4 (critical) — silent baseline corruption: - merge_into_baseline and load_baseline now raise BaselineCorruptError on parse failure instead of treating the file as empty. The old behavior would erase the other tier's coverage on the next merge and demote real regressions to gaps on the build that most needed detection. N5 — smoke test tautology: - Old assertion compared platform.backend_url against the env var it was built from — guaranteed to pass. Replaced with a session-sentinel check: the rig stamps UNSTRACT_RIG_SESSION_ID (uuid4, set unconditionally so setdefault can't let a stale value win) into every group's env, and the smoke test asserts its presence. N6 — unit-rig measuring its own tests as covered: - Add [tool.coverage.run].omit for `**/test_*.py`, `**/*_tests.py`, `**/tests.py`. Keeps the source modules under tests/rig/ measurable while excluding the test files themselves. N7 — gate validator silently disabled if smoke renamed: - `_validate_platform_groups_depend_on_gate` now raises when platform groups exist but the gate is undefined. Gate name overridable via `defaults.platform_gate_group` for forks. N8 — private cross-package import: - Promote `_endpoints_from_env` to `PlatformEndpoints.from_env()` classmethod; conftest.py imports the public name. N9 — teardown exceptions masking the original: - Wrap `runtime.down()` in cmd_run's finally with its own try/except so a teardown failure no longer hides the upstream platform-startup error. N10 — InfraEndpoints invariant: - __post_init__ asserts redis_host/redis_port (and rabbitmq_*) are set together; partial specs raise. N11 — comment polish: - Lift "smoke gates every platform group" rationale to a section banner above the e2e tier in groups.yaml. - Remove dead `emit_baseline` alias. - Rewrite unit-backend paths comment as a positive rule. Self-tests: 25 → 31 passing in 0.05s. New tests cover baseline corruption detection (both load and merge), scope-aware regression demotion, gate validation error paths, and custom gate names via defaults. Verified locally: - `python -m tests.rig validate` → OK. - `python -m tests.rig run --no-parallel unit-rig` → 31 pass; coverage XML + HTML emitted at ~32.5% on the rig itself (down from 42% because broader source files are now in scope). - `uvx ruff check --line-length 90 tests/` → clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (6)
tests/e2e/smoke/test_smoke.py (1)
33-35:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire a success response for the smoke health check.
This currently passes on 4xx responses, which can mask a broken deployment path in e2e smoke.
Suggested fix
def test_backend_health(platform) -> None: response = requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10) - assert response.status_code < 500, f"backend /health returned {response.status_code}" + assert response.status_code == 200, f"backend /health returned {response.status_code}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/smoke/test_smoke.py` around lines 33 - 35, The health-check assertion in test_backend_health currently allows 4xx responses; change it to require a successful 2xx response by asserting response.status_code is in the 200–299 range (e.g., response.status_code >= 200 and response.status_code < 300) when calling requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10) inside the test_backend_health function so failures surface correctly.tests/rig/groups.py (1)
185-190:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce repository-boundary checks for workdir and test paths.
Existence-only validation still allows resolved paths outside
REPO_ROOT.Suggested fix
def _validate_paths(groups: dict[str, GroupDefinition]) -> None: + repo_root = REPO_ROOT.resolve() for name, g in groups.items(): @@ - wd = g.absolute_workdir() + wd = g.absolute_workdir().resolve() + if not wd.is_relative_to(repo_root): + raise ValueError(f"group {name!r}: workdir escapes repo root: {wd}") if not wd.exists(): raise ValueError(f"group {name!r}: workdir does not exist: {wd}") - for p in g.absolute_paths(): + for p in g.absolute_paths(): + p = p.resolve() + if not p.is_relative_to(repo_root): + raise ValueError(f"group {name!r}: test path escapes repo root: {p}") if not p.exists(): raise ValueError(f"group {name!r}: test path does not exist: {p}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/groups.py` around lines 185 - 190, The current validation in the block using g.absolute_workdir() and g.absolute_paths() only checks existence and allows paths outside REPO_ROOT; update the checks to also ensure the resolved workdir and each resolved test path is inside the repository root (e.g., use Path.resolve() and Path.is_relative_to(REPO_ROOT) or compare pathlib.Path(REPO_ROOT).resolve() with p.resolve().absolute() via commonpath), and if a path is outside REPO_ROOT raise a ValueError with a clear message referencing the offending path and group name; apply this logic for wd (from g.absolute_workdir()) and every p from g.absolute_paths() to enforce repository-boundary constraints.tests/rig/cli.py (1)
345-346:⚠️ Potential issue | 🟠 Major | ⚡ Quick winShort-circuit
--dry-runbefore baseline evaluation and gating.Plan-only runs currently fall through into regression/gap checks and can return non-zero without executing any group.
Suggested fix
finally: if runtime is not None and not args.dry_run: @@ except Exception as exc: print( f"[rig] teardown failed (runtime={runtime.name}): {exc}", file=sys.stderr, ) + + if args.dry_run: + return 0 if args.coverage and not args.dry_run: combine_and_report(reports_dir)Also applies to: 382-421
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/cli.py` around lines 345 - 346, Short-circuit handling of the --dry-run flag: in the main CLI loop (where args.dry_run is checked) move/extend the check so that if args.dry_run is true you skip baseline evaluation and gating logic (the calls to evaluate_baseline and gate_groups) and exit successfully (return/exit 0) instead of falling through; ensure you reference and guard both places mentioned (around the current args.dry_run check and the block that calls evaluate_baseline and gate_groups) so plan-only runs never perform regression/gap checks or return non-zero.tests/rig/critical_paths.py (2)
69-75:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate
covered_bytype before tuple conversion.If YAML provides a scalar (e.g.,
"unit-api"), this becomes a tuple of characters and corrupts coverage semantics.Suggested fix
from typing import Any, Literal @@ def load_critical_paths(path: Path | None = None) -> CriticalPathRegistry: @@ CriticalPath( id=str(p["id"]), description=str(p.get("description", "")), entry=str(p.get("entry", "")), - covered_by=tuple(p.get("covered_by") or ()), + covered_by=_parse_covered_by(p.get("covered_by"), path_id=str(p.get("id", "<unknown>"))), ) for p in raw["paths"] ) ) + + +def _parse_covered_by(value: Any, *, path_id: str) -> tuple[str, ...]: + if value is None: + return () + if not isinstance(value, (list, tuple)): + raise ValueError(f"critical path {path_id!r}: `covered_by` must be a list") + return tuple(str(v) for v in value)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/critical_paths.py` around lines 69 - 75, The generator building CriticalPath instances incorrectly converts a scalar covered_by (e.g., "unit-api") into a tuple of characters; update the covered_by expression in the CriticalPath construction to validate the type: if p.get("covered_by") is already a list/tuple use tuple(value), if it's a non-empty scalar (like str) wrap it as (str(value),), and if None/empty produce an empty tuple. Locate the generator over raw["paths"] and change the covered_by handling accordingly so only iterable lists/tuples are expanded and scalars become single-element tuples.
121-123:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDefensively validate baseline shape before consuming
covered_paths.Current logic assumes a dict with list-like
covered_paths; malformed but parseable JSON can miscompute regressions or raise unexpected exceptions during merge.Suggested fix
def evaluate( @@ - previously_covered: set[str] = set( - (baseline or {}).get("covered_paths", []) if baseline else [] - ) + covered_raw: list[Any] = [] + if isinstance(baseline, dict): + candidate = baseline.get("covered_paths", []) + if isinstance(candidate, list): + covered_raw = candidate + previously_covered: set[str] = {p for p in covered_raw if isinstance(p, str)} @@ def merge_into_baseline(statuses: list[CriticalPathStatus], destination: Path) -> None: @@ - parsed = json.loads(destination.read_text()) - existing = set(parsed.get("covered_paths") or []) + parsed = json.loads(destination.read_text()) + if not isinstance(parsed, dict): + raise BaselineCorruptError( + f"refusing to merge into corrupt baseline {destination}: " + "expected JSON object with `covered_paths` list. Delete cache and re-run." + ) + raw = parsed.get("covered_paths") or [] + if not isinstance(raw, list): + raise BaselineCorruptError( + f"refusing to merge into corrupt baseline {destination}: " + "`covered_paths` must be a list. Delete cache and re-run." + ) + existing = {p for p in raw if isinstance(p, str)}Also applies to: 169-171
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/critical_paths.py` around lines 121 - 123, The code assumes `baseline` is a dict with a list-like `covered_paths` which can raise or corrupt results; update the logic around the `previously_covered` assignment to defensively validate the shape: check `isinstance(baseline, dict)` and `isinstance(baseline.get("covered_paths"), list)` before using it, coerce into a set filtering only string entries (e.g., {p for p in baseline["covered_paths"] if isinstance(p, str)}) and otherwise use an empty set; apply the same defensive pattern where `covered_paths` is consumed again (the same block referenced at lines 169-171) to avoid exceptions or bad data propagation.tests/rig/runtime.py (1)
264-268:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake readiness probing fail-closed instead of pass-when-broken.
Skipping probes when
requestsis unavailable and accepting 4xx as healthy both create false-positive “ready” states.Suggested fix
def _wait_ready(endpoints: PlatformEndpoints, *, timeout_seconds: int = 300) -> None: @@ try: import requests except ImportError: - log.warning("`requests` not installed; skipping platform readiness probe") - return + raise RuntimeError( + "`requests` is required for platform readiness checks in e2e runtime" + ) @@ def _responds(url: str, requests_mod) -> bool: try: resp = requests_mod.get(url, timeout=2) - return resp.status_code < 500 + return 200 <= resp.status_code < 300 except requests_mod.RequestException: return FalseAlso applies to: 287-287
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 264 - 268, The readiness probe currently skips when the requests package is unavailable and treats 4xx responses as healthy; change it to fail-closed by making the ImportError path error out (log.error and return/raise a non-ready result) instead of returning early, and update the HTTP-status handling so only 2xx responses are considered healthy (treat 3xx/4xx/5xx as unhealthy). Specifically, modify the try/except that imports requests to report failure and propagate non-ready status, and adjust the probe logic that currently accepts 4xx responses so it only accepts status codes in the 200-299 range.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/e2e/smoke/test_smoke.py`:
- Around line 33-35: The health-check assertion in test_backend_health currently
allows 4xx responses; change it to require a successful 2xx response by
asserting response.status_code is in the 200–299 range (e.g.,
response.status_code >= 200 and response.status_code < 300) when calling
requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10) inside
the test_backend_health function so failures surface correctly.
In `@tests/rig/cli.py`:
- Around line 345-346: Short-circuit handling of the --dry-run flag: in the main
CLI loop (where args.dry_run is checked) move/extend the check so that if
args.dry_run is true you skip baseline evaluation and gating logic (the calls to
evaluate_baseline and gate_groups) and exit successfully (return/exit 0) instead
of falling through; ensure you reference and guard both places mentioned (around
the current args.dry_run check and the block that calls evaluate_baseline and
gate_groups) so plan-only runs never perform regression/gap checks or return
non-zero.
In `@tests/rig/critical_paths.py`:
- Around line 69-75: The generator building CriticalPath instances incorrectly
converts a scalar covered_by (e.g., "unit-api") into a tuple of characters;
update the covered_by expression in the CriticalPath construction to validate
the type: if p.get("covered_by") is already a list/tuple use tuple(value), if
it's a non-empty scalar (like str) wrap it as (str(value),), and if None/empty
produce an empty tuple. Locate the generator over raw["paths"] and change the
covered_by handling accordingly so only iterable lists/tuples are expanded and
scalars become single-element tuples.
- Around line 121-123: The code assumes `baseline` is a dict with a list-like
`covered_paths` which can raise or corrupt results; update the logic around the
`previously_covered` assignment to defensively validate the shape: check
`isinstance(baseline, dict)` and `isinstance(baseline.get("covered_paths"),
list)` before using it, coerce into a set filtering only string entries (e.g.,
{p for p in baseline["covered_paths"] if isinstance(p, str)}) and otherwise use
an empty set; apply the same defensive pattern where `covered_paths` is consumed
again (the same block referenced at lines 169-171) to avoid exceptions or bad
data propagation.
In `@tests/rig/groups.py`:
- Around line 185-190: The current validation in the block using
g.absolute_workdir() and g.absolute_paths() only checks existence and allows
paths outside REPO_ROOT; update the checks to also ensure the resolved workdir
and each resolved test path is inside the repository root (e.g., use
Path.resolve() and Path.is_relative_to(REPO_ROOT) or compare
pathlib.Path(REPO_ROOT).resolve() with p.resolve().absolute() via commonpath),
and if a path is outside REPO_ROOT raise a ValueError with a clear message
referencing the offending path and group name; apply this logic for wd (from
g.absolute_workdir()) and every p from g.absolute_paths() to enforce
repository-boundary constraints.
In `@tests/rig/runtime.py`:
- Around line 264-268: The readiness probe currently skips when the requests
package is unavailable and treats 4xx responses as healthy; change it to
fail-closed by making the ImportError path error out (log.error and return/raise
a non-ready result) instead of returning early, and update the HTTP-status
handling so only 2xx responses are considered healthy (treat 3xx/4xx/5xx as
unhealthy). Specifically, modify the try/except that imports requests to report
failure and propagate non-ready status, and adjust the probe logic that
currently accepts 4xx responses so it only accepts status codes in the 200-299
range.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cacbb976-fbc3-43a1-8c48-556550c3014f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/ci-test-e2e.yaml.github/workflows/ci-test.yamlpyproject.tomltests/e2e/conftest.pytests/e2e/smoke/test_smoke.pytests/groups.yamltests/rig/cli.pytests/rig/critical_paths.pytests/rig/groups.pytests/rig/runtime.pytests/rig/tests/test_critical_paths.pytests/rig/tests/test_groups.py
🚧 Files skipped from review as they are similar to previous changes (4)
- .github/workflows/ci-test-e2e.yaml
- pyproject.toml
- tests/groups.yaml
- .github/workflows/ci-test.yaml
R1 (critical) — cmd_run returned early on BaselineCorruptError BEFORE write_summary, so a green build with one corrupt baseline lost all per-group visibility. Restructured so write_summary always runs; baseline-corrupt now just flips overall_exit and falls through. R2 (important) — TestcontainersRuntime.up() built InfraEndpoints AFTER the cleanup try/except, so a __post_init__ failure leaked the four started containers. Moved construction inside the try (and the return with it) so the class invariant is self-cleaning at its construction site. R3 (important) — added tests/rig/tests/test_cli.py covering three previously- untested behaviors of cmd_run: - scope_groups is the dep-expanded selection (guards N2 wiring) - down() failure doesn't mask up() exception (guards N9 try/except) - write_summary runs even when load_baseline raises (guards R1) R4 (important) — evaluate's groups_run_green and scope_groups annotations loosened from set[str] to Collection[str]. Function only reads them (`g in ...`), so accepting frozenset/list/tuple from callers is fine. Hot path still pays the O(1) tax via internal set() conversion. R5 (suggestion, deferred) — InfraEndpoints shape asymmetry (redis/rabbitmq host+port pairs vs. postgres/minio collapsed URL/endpoint strings). Picking a single shape would be cleaner but neither option has a clear win without a real second consumer of the infra block; revisit when one appears. R7 (suggestion) — fixed misleading comment: scope_groups includes dep-expanded groups, not just "what the user asked for". R8 (suggestion) — InfraEndpoints docstring now honestly describes which fields the __post_init__ invariant applies to. R9 (suggestion) — strengthened scope test with a "straddling" path (covered_by=[unit, e2e]) so a weaker implementation checking against groups_run_green instead of covered_by would be caught. R10 (suggestion) — InfraEndpoints redis_port/rabbitmq_port: str → int. Cast at the testcontainers boundary instead of pre-stringifying. Verified: - 34/34 self-tests pass (up from 31). - `python -m tests.rig validate` → OK. - `python -m tests.rig run --no-parallel unit-rig` → 34 pass; coverage emitted. - `uvx ruff check --line-length 90 tests/` → clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/rig/runtime.py (1)
288-293: 💤 Low valueHealth check accepts 4xx responses as healthy.
resp.status_code < 500treats 4xx (e.g., 404, 401) as ready. Consider tightening toresp.okor200 <= resp.status_code < 300to catch endpoint misconfiguration early.def _responds(url: str, requests_mod) -> bool: try: resp = requests_mod.get(url, timeout=2) - return resp.status_code < 500 + return resp.ok except requests_mod.RequestException: return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/rig/runtime.py` around lines 288 - 293, The health check in _responds currently treats any status_code < 500 as healthy, which accepts 4xx responses; update the function (inside _responds where requests_mod.get is called and exceptions are caught) to return only true for successful 2xx responses—use resp.ok or explicitly check 200 <= resp.status_code < 300—instead of resp.status_code < 500, preserving the timeout and existing RequestException handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/rig/runtime.py`:
- Around line 288-293: The health check in _responds currently treats any
status_code < 500 as healthy, which accepts 4xx responses; update the function
(inside _responds where requests_mod.get is called and exceptions are caught) to
return only true for successful 2xx responses—use resp.ok or explicitly check
200 <= resp.status_code < 300—instead of resp.status_code < 500, preserving the
timeout and existing RequestException handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9868dd1c-57b8-4c78-8ef6-d886af0692e6
📒 Files selected for processing (5)
tests/rig/cli.pytests/rig/critical_paths.pytests/rig/runtime.pytests/rig/tests/test_cli.pytests/rig/tests/test_critical_paths.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/rig/critical_paths.py
- tests/rig/tests/test_critical_paths.py
…rage
R4-A (important) — summary.md misrepresented regressions on corrupt baseline.
write_summary now accepts a baseline_corrupt flag; when set, summary.md (and
the sticky PR comment) emit a "Baseline cache was corrupt; regression
detection disabled this run" banner so reviewers don't read "gap" entries as
first-time gaps when they may actually be regressions hidden by the cache
failure. summary.json also exposes the flag.
R4-B (important) — added test_cmd_run_does_not_update_baseline_on_red_build:
proves the `--update-baseline` guard refuses to bake red-build state into
the cache. Without the guard, the next build wouldn't surface regressions.
R4-C (suggestion) — baseline corruption now flips overall_exit regardless of
whether the build was otherwise green. Previously red→corrupt was swallowed.
R4-D (suggestion) — added tests/rig/tests/test_runtime.py covering
InfraEndpoints partial-pair rejection (4 cases) and PlatformEndpoints.from_env
defaults. Closes the zero-coverage gap left by R5 deferral.
R4-E (suggestion) — monkeypatch rationale documented at top of test_cli.py
so the pattern's safety assumptions are explicit for future readers.
R4-F (suggestion) — simplified dead-code ternary in evaluate's
previously_covered initialization.
R4-G (suggestion) — scope_groups docstring "set" → "collection" to match
the R4 signature loosening.
R4-H (suggestion) — replaced internal review-tag references in test
docstrings ("guards N2/N9/R1") with invariant descriptions that stand on
their own post-merge.
R4-I (suggestion) — strengthened the corrupt-baseline test to assert the
summary.md content includes the banner, not just that the file exists; a
zero-byte file would no longer pass.
Verified:
- 40/40 self-tests pass (up from 34).
- `python -m tests.rig validate` → OK.
- End-to-end run with corrupt baseline emits the banner correctly.
- `uvx ruff check/format --line-length 90 tests/` → clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/rig/cli.py`:
- Around line 236-242: The current except block can hide the original exception
from runtime.up() if runtime.down() also raises; modify the handler around
runtime.up()/runtime.down() so you capture the original exception (e.g., store
it as exc_up), then call runtime.down() inside its own try/except, and always
re-raise the original exc_up regardless of whether runtime.down() fails
(optionally log the down() error). Specifically update the try/except that
invokes runtime.up() and runtime.down() to preserve and re-raise the original
up() exception while isolating any errors from runtime.down().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 31752086-90af-4796-aeab-f10b9e9a6e59
📒 Files selected for processing (5)
tests/rig/cli.pytests/rig/critical_paths.pytests/rig/reporting.pytests/rig/tests/test_cli.pytests/rig/tests/test_runtime.py
jaseemjaskp
left a comment
There was a problem hiding this comment.
Multi-agent review summary
Ran six review passes (Code Reviewer, Code Simplifier, Comment Analyzer, PR Test Analyzer, Silent Failure Hunter, Type Design Analyzer) against the rig. Existing coderabbitai/greptile threads cover most of the headline bugs; the inline comments below are net-new findings.
Top themes uncovered:
- Silent prep failures —
_prepare_group_envand synthetic-junit writes both fail-soft in ways that letpytestproduce a misleading red/green that does not reflect the actual state of the build. - Subprocess hygiene —
_spawntimeout does not kill the process group, andcoverageinvocations have no timeout. Both can hang or leak processes between groups. - Type invariants —
GroupDefinition.envis mutable inside a frozen dataclass;CriticalPathRegistrysilently last-wins on duplicate ids;CriticalPathStatusallows contradictorystate="covered"with emptycovering_groups_run. - Pre-commit hook regression — uses bare
pythondespitecombine-test-reports.shbeing explicitly fixed forpython3-only environments. - Docstring drift — three docstrings claim guarantees the surrounding code does not actually provide (URL sentinel, x2text wiring, scope-aware re-aggregation in
cmd_report).
Findings overlapping with existing bot threads (e.g. _responds() laxity, smoke < 500, previous-summary.json duplication, cmd_run complexity, JUnit defusedxml hardening) were deliberately dropped to avoid noise.
Low-impact items not posted inline:
runtime.py:67-73x2text_url is declared onPlatformEndpointsbut never exported incli.py:486-490. Either delete the field or wireenv.setdefault("UNSTRACT_X2TEXT_URL", ...).tests/README.md:144-149testcontainers row should warn thatdown()is best-effort and leaked containers can block the next run on port conflict.tests/critical_paths.yaml:22-24perpetual gap TODO has no tracking ticket; consider linking an issue.tox.inipassenv = UNSTRACT_*glob is wide enough to leak adapter API keys into CI logs via compose_run(nocapture_output=True).
No files were modified.
When the rig runs under tox, tox exports VIRTUAL_ENV=.tox/<env>. The rig's per-group `uv run` then warns "VIRTUAL_ENV=.tox/unit does not match the project environment path .venv and will be ignored" before ignoring it anyway — noise in CI logs, and a latent ambiguity since each group is meant to resolve its own workdir's project venv, never tox's. Strip VIRTUAL_ENV (and UV_PROJECT_ENVIRONMENT) from the environment passed to every uv/pytest/coverage subprocess via a shared `_subprocess_env()` in cli.py and `_clean_env()` in coverage.py. uv then does clean per-group project discovery with no warning. Verified: with VIRTUAL_ENV=.tox/unit set, a raw `uv run` emits the warning while `python -m tests.rig run unit-rig` emits zero. 40/40 self-tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Critical: - B1: XML-escape group name in synthetic JUnit. A group key with "/&/< would produce malformed XML that parse_junit reads as a phantom error on a green hurl run. - B2: pre-commit hook uses `$(command -v python3 || command -v python)` so the opt-in test hook works on stock Ubuntu/containers without a `python` symlink (matches the combine-script fix). Major: - B3: fold junit-reported errors/failures into overall_exit. A truncated junit (errors=1) with pytest exit 0 rendered ❌ but exited green. - B4: coverage subprocesses (combine/xml/html) now run with timeout=300. They run after the per-group loop so per-group timeouts don't apply; a slow `uv run --with` resolve could otherwise hang to the CI ceiling. - B5: --changed-only detects HEAD == base (push-to-main) and falls back to HEAD~1...HEAD with a clear message instead of selecting nothing and exiting 2. - dry-run no longer fails on critical-path gaps/regressions: no groups ran, so every covered path looks uncovered; a dry run is plan-only. - Q1: duplicate critical-path ids now raise at load instead of silent last-wins. - Q2: CriticalPathStatus enforces covered⇒non-empty / gap|regression⇒empty covering_groups_run, making the contradictory state unrepresentable. - Q3: GroupDefinition.env coerced to MappingProxyType in __post_init__ so the frozen record can't be mutated through its dict. - Q5: synthetic-junit write wrapped in try/except OSError, matching exit.txt. Minor: - Q4: reworded _rig_session_id docstring — the sentinel proves the rig ran; URL ownership is intentionally cooperative via setdefault. Tests: 40 → 46. New coverage: synthetic-junit escaping + exit-5, duplicate-id rejection, status-invariant enforcement, frozen env, and cmd_report re-aggregation (previously zero coverage). Deferred to follow-up (replied on threads): _spawn process-group kill on timeout (heavy Popen refactor), cmd_run cognitive-complexity split. Verified: 46/46 self-tests pass; validate OK; end-to-end emits coverage.xml + htmlcov; ruff check/format clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PlatformEndpoints collects x2text_url from UNSTRACT_X2TEXT_URL but the rig never forwarded it to group subprocesses and never probed it for readiness: - _execute_group now setdefaults UNSTRACT_X2TEXT_URL into the group env, so e2e tests hitting x2text use the rig's URL instead of a stale/default value. - _wait_ready now includes x2text /health, so ComposeRuntime.up() waits for it rather than returning early and letting the first x2text call hit ConnectionRefusedError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Unstract test resultsPer-group results
Critical paths
|
| scope_groups = set(ordered) | ||
|
|
||
| reports_dir: Path = args.reports_dir |
There was a problem hiding this comment.
scope_groups is set from ordered before optional-placeholder groups are filtered out into skipped. If a group was previously run green (and thus appears in the baseline), is now marked optional: true, and its paths have since been removed, it will remain in scope_groups but never enter green. On the next CI run that includes this group via dep-expansion, evaluate() sees: path.id in previously_covered AND in_scope AND not covered → state = "regression". Because regressions set overall_exit = 1, --update-baseline is permanently blocked (its guard is overall_exit == 0), creating a stuck state that can only be resolved by manually deleting the cache. Filtering scope_groups to exclude the groups the rig actually intends to run fixes this.
| scope_groups = set(ordered) | |
| reports_dir: Path = args.reports_dir | |
| # scope_groups must exclude optional placeholders that were skipped: | |
| # if a skipped group was previously in the baseline, keeping it in scope | |
| # would classify its covered paths as regressions even though the group | |
| # never ran, blocking --update-baseline and creating a stuck state. | |
| runnable_pre = {n for n in ordered if not _is_missing_placeholder(manifest.get(n))} | |
| scope_groups = runnable_pre | |
| reports_dir: Path = args.reports_dir |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/rig/cli.py
Line: 345-347
Comment:
`scope_groups` is set from `ordered` **before** optional-placeholder groups are filtered out into `skipped`. If a group was previously run green (and thus appears in the baseline), is now marked `optional: true`, and its paths have since been removed, it will remain in `scope_groups` but never enter `green`. On the next CI run that includes this group via dep-expansion, `evaluate()` sees: `path.id in previously_covered AND in_scope AND not covered` → `state = "regression"`. Because regressions set `overall_exit = 1`, `--update-baseline` is permanently blocked (its guard is `overall_exit == 0`), creating a stuck state that can only be resolved by manually deleting the cache. Filtering `scope_groups` to exclude the groups the rig actually intends to run fixes this.
```suggestion
# scope_groups must exclude optional placeholders that were skipped:
# if a skipped group was previously in the baseline, keeping it in scope
# would classify its covered paths as regressions even though the group
# never ran, blocking --update-baseline and creating a stuck state.
runnable_pre = {n for n in ordered if not _is_missing_placeholder(manifest.get(n))}
scope_groups = runnable_pre
reports_dir: Path = args.reports_dir
```
How can I resolve this? If you propose a fix, please make it concise.

Summary
This is the foundation only — no new tests beyond a single e2e smoke. Existing co-located unit tests stay where they are; the rig just discovers and runs them.
What's new
tests/groups.yaml— single source of truth for 15 groups (9 unit, 1 integration placeholder, 5 e2e) with adepends_ongraph.tests/critical_paths.yaml— 10 critical flows registered, each pointing at the group(s) that cover it.tests/rig/— the dispatcher (python -m tests.rig run|list-groups|expand|validate|platform|report).tests/e2e/—platformsession fixture + one smoke test exercising the e2e plumbing end-to-end.tox.ini— rewritten with tier-shaped envs (unit,integration,e2e,groups,rig); legacyrunner|sdk1|prompt-serviceenvs preserved as thin aliases.ci-test.yamlrewritten to call the rig; newci-test-e2e.yamlslow lane (main +run-e2e-labeled PRs + nightly).unstract-testsfires only when developers opt in by creating.test-selection.Usage
Full docs:
tests/README.md.Reports
reports/<group>/{junit.xml, report.md, exit.txt}reports/{summary.md, summary.json, combined-test-report.md}— samecombined-test-report.mdpath keeps the existing sticky-PR-comment workflow working.--coverage):reports/{coverage.xml, htmlcov/}—coverage.xmlis consumable by the existing SonarCloud integration.Critical paths
CI on
mainruns with--fail-on-critical-gap --update-baseline, so:Test plan
python -m tests.rig validatesucceeds (OK — 15 groups, 10 critical paths).python -m tests.rig list-groupsandexpand <group>print correct dep-expanded ordering.validatecorrectly fails on an injected unknown-group reference incritical_paths.yaml.python -m tests.rig run --no-coverage --no-parallel unit-coreend-to-end: creates per-group venv, installs rig pytest plugins, runs pytest, writesjunit.xml+report.md+exit.txt, and aggregates intosummary.md/summary.jsonwith critical-path gaps section.bash .github/scripts/combine-test-reports.shshim works (delegates topython -m tests.rig report combineand preserves the legacycombined-test-report.mdpath).ci-test.yamlon this PR and confirm the sticky comment renders.run-e2eto smoke-testci-test-e2e.yamlagainst docker compose.Notes for reviewers
unstract/core/tests/(LogHelpernot exported;account_servicessubmodule missing). Those are not rig issues — the rig correctly captured them as errors. Fixing them is out of scope here.unstract/connectors/uv.lockhad unrelated churn (litellm source switch); intentionally left out of this PR.tests/.🤖 Generated with Claude Code