DEMO ONLY: wire faultcat-explore into multi-node-tests CI#731
Draft
UtkarshBhatthere wants to merge 9 commits into
Draft
DEMO ONLY: wire faultcat-explore into multi-node-tests CI#731UtkarshBhatthere wants to merge 9 commits into
UtkarshBhatthere wants to merge 9 commits into
Conversation
⚠️ DO NOT MERGE. This branch deliberately breaks the Tests workflow to exercise faultcat's new CI-side explore mode end-to-end. Two demo-only changes, both clearly tagged FAULTCAT-DEMO: 1. Disable the "Add 2 OSDs" step with `if: false`. Downstream steps that expect 3 OSDs ("Test 3 osds present", "Test crush rules") then fail naturally — this is a real diagnostics_gap shape (state expected by a later step was never set up by an earlier step). 2. Append a `if: failure()` step right after the existing "Print logs for failure" step that calls the new canonical/faultcat/.github/actions/faultcat-explore@feat/explore-mode composite action, passing hints=microceph,lxd and the OPENROUTER_API_KEY repo secret. The action installs pi-coding-agent, faultcat, runs `faultcat explore`, and uploads a scrubbed evidence bundle as the `faultcat-probe-multi-node` artifact. Verifying that: - The composite action installs cleanly on ubuntu-22.04 runners. - Pi (DeepSeek V3 via OpenRouter) inspects the live failed multi-node test environment and emits schema-valid evidence + findings + suggestions. - The scrubber leaves the artifact safe to publish on a public repo. - The action does not fail the job (fail-on-explore-error=false), so the original test failure remains the surfaced one. Before merging this PR (do not merge for the demo cycle): - Remove the `if: false` from "Add 2 OSDs" so it runs normally. - Remove the entire "FAULTCAT-DEMO faultcat explore" step. - Once the action is stable, a follow-up PR can add the explore step back in as a permanent if: failure() hook across all integration jobs. Requires `OPENROUTER_API_KEY` to be set as a repo secret in canonical/microceph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 13, 2026
GitHub Actions evaluates ${{ ... }} expressions even when they appear
inside input description fields, and inside a composite action the
`secrets` context is not available — only the caller workflow can read
secrets. The OpenRouter API key input description used
`${{ secrets.OPENROUTER_API_KEY }}` purely as documentation, but GHA
rejected the manifest at workflow setup with:
Unrecognized named-value: 'secrets'. Located at position 1 within
expression: secrets.OPENROUTER_API_KEY
Surfaced by canonical/microceph#731 — the demo PR couldn't load the
action manifest at all, so no faultcat-explore step ever ran.
Fix: reword the description to not include expression syntax. The
runtime use of inputs.openrouter-api-key under env: is unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 13, 2026
Discovered in canonical/microceph#731 CI: the composite action ran on ubuntu-22.04 where the system Python is 3.10.12. faultcat's pyproject.toml has `requires-python = ">=3.11"`, so the pipx install step failed with: Package 'faultcat' requires a different Python: 3.10.12 not in '>=3.11' Two fixes: - Add `actions/setup-python@v5` step pinning Python to 3.12 (matches the faultcat development environment). - Pass `--python "$(command -v python)"` to `pipx install` so pipx uses the setup-python interpreter rather than the system /usr/bin/python3. Without this, pipx defaults to the system Python on ubuntu-22.04 even after setup-python has put 3.12 first on PATH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cycle of canonical/faultcat#8 demo CI surfaced two issues: - pipx install of faultcat failed because ubuntu-22.04 ships Python 3.10.12 and faultcat requires >=3.11. Fixed in faultcat 693ae4c (setup-python@v5 + pipx --python). - The composite action's `faultcat-ref` input defaulted to `main`, so the runner pip-installed the pre-merge main, not the explore-mode branch. Now passing `faultcat-ref: feat/explore-mode` explicitly. Also tighten the demo feedback loop by disabling every Tests job except build-microceph (needed for the snap artifact) and multi-node-tests (the only job exercising the explore action) with `if: false # FAULTCAT-DEMO ...`. This reduces a >20-job matrix to two so each retry cycle takes minutes, not hours. Both demo changes (the disabled jobs and the explore step) are tagged FAULTCAT-DEMO and must be removed before merging this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 13, 2026
… clusters First live cycle on canonical/microceph#731 produced a useless 199-byte evidence bundle: Pi made a single round-trip from the explore prompt to its final YAML answer, with zero tool calls. The "evidence" was nothing but a paraphrase of the GitHub env vars already given in the prompt. Three root causes, all addressed here: 1. The microceph and lxd skills did not tell Pi how to reach the actual cluster state. In CI, microceph daemons run INSIDE LXD containers (`node-wrk0..N`). The lxd skill now leads with that fact and gives `lxc exec <name> -- <cmd>` examples for running commands inside containers. The microceph skill says the same in its preamble: assume containerized, fall back to direct local commands only if `lxc list` is empty. 2. The microceph skill picked up cluster-state commands beyond `ceph -s` / `health detail`: `ceph osd tree` and `ceph osd pool ls detail` are the ones that surface missing OSDs and wrong crush rules — the exact shape of the microceph#731 demo failure (OSDs not added → "Test failure domain scale up" fails because the cluster is still on the pre-scaleup default crush rule). 3. The explore contract treated skills as advisory and never required any tool calls. Updated to require at least 3 bash commands before the final YAML, with explicit instructions to start with a situational sweep (`lxc list`, smallest cluster-state command per loaded product skill) and then targeted follow-ups. Observations without a backing command are now explicitly forbidden, and a genuinely-can't-run-tools scenario is required to be classified as diagnostics_gap rather than papered over with made-up observations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dd 2 OSDs'
Previous skip target ("Add 2 OSDs") put the demo failure at minute ~9
of the multi-node-tests job ("Test failure domain scale up"). Moving
the skip to "Setup cluster" fails the job at minute ~5-6 ("Verify
config" or "Add 2 OSDs" depending on which is reached first), which
shortens the demo feedback loop on each iteration.
The failure shape is also more interesting for explore mode:
- Bootstrap succeeds → node-wrk0 has microceph
- Setup cluster is skipped → node-wrk1..3 never join
- `lxc exec node-wrk0 -- microceph status` shows a 1-node cluster
where 4 were expected
- That is a real diagnostics_gap shape that the lxc/microceph skills
in canonical/faultcat#8 0a713e1 now explicitly know how to reach.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
multi-node-tests no longer depends on the build-microceph job. The
~3 min snap build is skipped; instead the multi-node-tests job uses
`install_store reef/stable` (the same helper several other jobs use)
to install the published microceph snap into each LXD container.
This trims another ~3 min off each demo iteration. The failure shape
under test ("Setup cluster" skipped → cluster never forms past
node-wrk0) is independent of which microceph version is installed,
so a stored release is equivalent to a freshly-built one for this
demo.
Like the other FAULTCAT-DEMO changes, this must be reverted before
merging:
- Restore `needs: build-microceph` on multi-node-tests.
- Restore the "Download snap" step.
- Restore "Install local microceph snap" → install_multinode.
- Remove the `if: false` on build-microceph.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 13, 2026
Pi wandered for 17+ min during the live canonical/microceph#731 demo cycle because the default `_wait_for_yaml_block` budget is 1800s and the composite action had no `timeout-minutes` on the explore step. Two-layer ceiling now: 1. `src/faultcat/explore/runner.py` calls `structured_call(contract, wait_seconds=270)` — 4.5 min of wall-clock for Pi to deliver a fenced YAML block, after which faultcat raises FaultcatError. 2. `.github/actions/faultcat-explore/action.yml` adds a new `explore-timeout-minutes` input (default `5`) and applies it as `timeout-minutes: ${{ fromJSON(inputs.explore-timeout-minutes) }}` on the Run-faultcat-explore step. GitHub kills the step at 5 min regardless of what picable/Pi do, so a runaway can never burn more than that even if the in-process wait somehow fails to fire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 13, 2026
GitHub Actions does not allow `timeout-minutes` on steps inside a composite action; the manifest parser rejects it with "Unexpected value 'timeout-minutes'". Surfaced live on canonical/microceph#731. Remove both the step-level setting and the explore-timeout-minutes input. Callers needing a hard wall-clock ceiling should set `timeout-minutes:` on the `uses:` step in their own workflow — that location IS supported. faultcat itself still enforces a 4.5-minute internal cap via `structured_call(contract, wait_seconds=270)` in `src/faultcat/explore/runner.py`, so the primary timeout guardrail is intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`timeout-minutes` is not allowed inside the composite action manifest (rejected by GHA). The wall-clock cap belongs on the caller's `uses:` step instead, which is supported. 6 minutes = the action's internal 4.5-min ceiling plus install/cleanup overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 14, 2026
GitHub Actions evaluates ${{ ... }} expressions even when they appear
inside input description fields, and inside a composite action the
`secrets` context is not available — only the caller workflow can read
secrets. The OpenRouter API key input description used
`${{ secrets.OPENROUTER_API_KEY }}` purely as documentation, but GHA
rejected the manifest at workflow setup with:
Unrecognized named-value: 'secrets'. Located at position 1 within
expression: secrets.OPENROUTER_API_KEY
Surfaced by canonical/microceph#731 — the demo PR couldn't load the
action manifest at all, so no faultcat-explore step ever ran.
Fix: reword the description to not include expression syntax. The
runtime use of inputs.openrouter-api-key under env: is unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 14, 2026
Discovered in canonical/microceph#731 CI: the composite action ran on ubuntu-22.04 where the system Python is 3.10.12. faultcat's pyproject.toml has `requires-python = ">=3.11"`, so the pipx install step failed with: Package 'faultcat' requires a different Python: 3.10.12 not in '>=3.11' Two fixes: - Add `actions/setup-python@v5` step pinning Python to 3.12 (matches the faultcat development environment). - Pass `--python "$(command -v python)"` to `pipx install` so pipx uses the setup-python interpreter rather than the system /usr/bin/python3. Without this, pipx defaults to the system Python on ubuntu-22.04 even after setup-python has put 3.12 first on PATH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 14, 2026
… clusters First live cycle on canonical/microceph#731 produced a useless 199-byte evidence bundle: Pi made a single round-trip from the explore prompt to its final YAML answer, with zero tool calls. The "evidence" was nothing but a paraphrase of the GitHub env vars already given in the prompt. Three root causes, all addressed here: 1. The microceph and lxd skills did not tell Pi how to reach the actual cluster state. In CI, microceph daemons run INSIDE LXD containers (`node-wrk0..N`). The lxd skill now leads with that fact and gives `lxc exec <name> -- <cmd>` examples for running commands inside containers. The microceph skill says the same in its preamble: assume containerized, fall back to direct local commands only if `lxc list` is empty. 2. The microceph skill picked up cluster-state commands beyond `ceph -s` / `health detail`: `ceph osd tree` and `ceph osd pool ls detail` are the ones that surface missing OSDs and wrong crush rules — the exact shape of the microceph#731 demo failure (OSDs not added → "Test failure domain scale up" fails because the cluster is still on the pre-scaleup default crush rule). 3. The explore contract treated skills as advisory and never required any tool calls. Updated to require at least 3 bash commands before the final YAML, with explicit instructions to start with a situational sweep (`lxc list`, smallest cluster-state command per loaded product skill) and then targeted follow-ups. Observations without a backing command are now explicitly forbidden, and a genuinely-can't-run-tools scenario is required to be classified as diagnostics_gap rather than papered over with made-up observations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 14, 2026
Pi wandered for 17+ min during the live canonical/microceph#731 demo cycle because the default `_wait_for_yaml_block` budget is 1800s and the composite action had no `timeout-minutes` on the explore step. Two-layer ceiling now: 1. `src/faultcat/explore/runner.py` calls `structured_call(contract, wait_seconds=270)` — 4.5 min of wall-clock for Pi to deliver a fenced YAML block, after which faultcat raises FaultcatError. 2. `.github/actions/faultcat-explore/action.yml` adds a new `explore-timeout-minutes` input (default `5`) and applies it as `timeout-minutes: ${{ fromJSON(inputs.explore-timeout-minutes) }}` on the Run-faultcat-explore step. GitHub kills the step at 5 min regardless of what picable/Pi do, so a runaway can never burn more than that even if the in-process wait somehow fails to fire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UtkarshBhatthere
added a commit
to canonical/faultcat
that referenced
this pull request
May 14, 2026
GitHub Actions does not allow `timeout-minutes` on steps inside a composite action; the manifest parser rejects it with "Unexpected value 'timeout-minutes'". Surfaced live on canonical/microceph#731. Remove both the step-level setting and the explore-timeout-minutes input. Callers needing a hard wall-clock ceiling should set `timeout-minutes:` on the `uses:` step in their own workflow — that location IS supported. faultcat itself still enforces a 4.5-minute internal cap via `structured_call(contract, wait_seconds=270)` in `src/faultcat/explore/runner.py`, so the primary timeout guardrail is intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plore-mode) Tests new explore-mode prompt: minimum 6 tool calls, fallback ladder when both lxc list and host PATH are empty, CI-context probes, no early exit on negative observations only.
…e-mode) Tests goal-driven stopping criterion + validator-enforced curiosity: no tool-call quota, info-gain heuristic, reject diagnostics_gap without CI-context probe, require alternatives_ruled_out for moderate/strong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Disables the "Add 2 OSDs" step in the multi-node-tests job with
if: false. The downstream "Test 3 osds present" / "Test crush rules" steps then fail naturally on the missing setup — this is a realdiagnostics_gapshape: state that a later step expects was never set up by an earlier step.Adds a new step right after the existing "Print logs for failure" step:
The composite action installs
@earendil-works/pi-coding-agent, faultcat, drives Pi (fast tier, DeepSeek V3 via OpenRouter) to inspect the live failed multi-node test environment read-only, validates the output against the sameevidence_bundle.v1/findings.v1/suggestions.v1schemas as faultcat M1, scrubs secrets from the entire output tree, and uploads it as afaultcat-probe-multi-nodeartifact.What I want to learn from this CI run
ubuntu-22.04runner used bymulti-node-tests?microceph status/ceph -s/snap.microceph.osdlogs, or does it get distracted?microceph,lxd,_default) tight enough that Pi stays focused, or does it bloat?Prerequisite
OPENROUTER_API_KEYneeds to be set as a repo secret incanonical/microceph. The action reads\${{ secrets.OPENROUTER_API_KEY }}— if the secret is missing, the action will run but Pi will return empty content (no fallback model in this first cut). I have not added that secret; would appreciate you doing so before kicking the workflow, or letting me know and I'll keep the PR in draft until it's in place.When this PR is no longer needed
After the explore action has been observed working over a few real CI failures, a follow-up PR can add the
if: failure()step into more (or all) integration test jobs as a permanent hook. This PR's role ends as soon as we have evidence the action behaves correctly.Test plan
FAULTCAT-DEMO faultcat explorestep runs to completion (does not fail the job either way thanks tofail-on-explore-error: false)faultcat-probe-multi-nodeartifact appears on the workflow runevidence.yaml,findings.yaml,suggestions.yamlwhose content reflects the missing-OSDs failure shape🤖 Generated with Claude Code