Skip to content

fix(snapshot): close corner-case snapshot edges G1-G5 (rollback/restore/autosnapshot/thick-pool)#100

Merged
Andrei Kvapil (kvaps) merged 8 commits into
mainfrom
corner/g-snapshot-edges
Jun 4, 2026
Merged

fix(snapshot): close corner-case snapshot edges G1-G5 (rollback/restore/autosnapshot/thick-pool)#100
Andrei Kvapil (kvaps) merged 8 commits into
mainfrom
corner/g-snapshot-edges

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps Andrei Kvapil (kvaps) commented Jun 4, 2026

Corner-case campaign — PKG-G (snapshot edges)

Closes plan items G1–G5 of the LINSTOR corner-case campaign (snapshot rollback / restore / autosnapshot / thick-pool semantics). Doc references into linstor-administration.adoc s-linstor-snapshots.

Rebased onto current main (#94#98). Delta row renumbered to #67 (docs/cli-parity-known-deltas.md).

Verdict table (with oracle + stand evidence)

Oracle = upstream LINSTOR 1.33.2 (piraeus-server, FILE_THIN pool) on the dev stand, captured on a healthy kernel. BS stand = deployed main image (does NOT yet carry this PR's fixes).

Item Surface Verdict Oracle 1.33.2 evidence BS / test evidence
G1 snapshot rollback while InUse/Primary MATCHES (pinned) Rollback of an Unused snapshot silently auto-creates a safety-snap-* and merges in place ("Resource already deployed as requested" x3 nodes) — destructive footgun confirmed. BS refuses in-place rollback entirely (501); InUse/Primary → 409 first, unknown snap → 404 first. Pinned TestSnapshotRollback*.
G2 snapshot rollback onto snapshot-less node DELIBERATE DELTA (#67) Resurrection footgun confirmed: rollback after a diskful replica was added on a snapshot-less node RE-ADDS the disk mid-rollback — (dev-worker-3) Volume number 0 ... [FILE THIN] created, Added disk on dev-worker-3, Resource adjusted. BS blanket 501 is the safe superset. Whitelisted (delta #67) + pinned TestSnapshotRollbackSnapshotlessNodeReaches501_G2.
G3a restore after r d of source resources MATCHES (pinned) Snapshot survives r d of every source replica (RD shell + snapshot persist). Restore from a surviving snapshot into a new RD works. Pinned L1.
G3b restore INTO an RD that already has VDs DIVERGED → FIXED (wire-shape) Reject envelope ret_code -4611686018409823753 = FAIL_EXISTS_VLM_DFN, A volume definition with the number 0 already exists in resource definition orc-g-r3. Control (empty RD) → ret_code 17563650 success. BS now mirrors the typed FAIL_EXISTS_VLM_DFN (502) envelope naming the offending volume number, pre-checked before any Store mutation. L6 cell PASS on deployed main (reject + target VD untouched), L7 replay PASS.
G4 AutoSnapshot Keep <= 0 DIVERGED → FIXED (controller-cron; no oracle CLI surface) Keep omitted or <= 0 → keep last 10 (was: keep everything). RunEvery + manual-snapshot exclusion already correct. Pinned L1 (stub-clock).
G5 snapshot create on a thick pool DIVERGED → FIXED Reject ret_code -4611686018409823262 = FAIL_SNAPSHOTS_NOT_SUPPORTED, Storage driver LVM does not support snapshots. (captured on a self-provisioned loop-backed thick LVM VG, torn down clean). BS handleSnapshotCreate now refuses when any diskful replica sits in a non-snapshot-capable pool (thick LVM / plain FILE / DISKLESS) with FAIL_SNAPSHOTS_NOT_SUPPORTED (994). Pinned L1; oracle-validated.

Validation classification

Artifact Tier Result on deployed main Class
snap-vd-restore-volume-conflict-rejected.sh L6 PASS (exit 0) — conflict rejected, target VD stays 128 MiB deployable-now
snap-vd-restore-volume-conflict-rejected.yaml L7 replay PASS (exit 0) — reject (exit 10) + control restore succeed deployable-now
snap-create-thick-pool-rejected.sh L6 SKIP (exit 0) — BS satellite only registers auto-discovered VGs, so a loop-backed thick pool cannot be registered via the CLI; cell self-skips by design infra-limited (gate validated at L1 + oracle)
G4 Keep<=0 fallback L1 green L1-only (controller-cron, not CLI-observable)
G5 create-gate L1 green L1 + oracle-validated

Note on G3b classification: the operator-observable contract (reject the clashing restore, leave the target untouched, allow restore into an empty RD) already holds on the deployed main image — so the L6/L7 artifacts PASS rather than reproduce a RED. This PR's net change for G3b is the typed FAIL_EXISTS_VLM_DFN envelope (vs a bare 409) and the pre-Store conflict pre-check, both pinned at L1.

Tests

  • L1: G5 create-gate (thick refused / thin allowed / unresolvable passes), G3b VD-restore conflict + empty-RD control, G3a restore-after-resource-delete, G2 snapshot-less-node 501 pin, G4 Keep=0 and Keep<0 fall back to default 10.
  • L6 cli-matrix: snap-vd-restore-volume-conflict-rejected.sh (G3b, stand PASS), snap-create-thick-pool-rejected.sh (G5, self-provisions a loop-backed thick LVM pool; SKIPs when the satellite won't register it).
  • L7 replay: snap-vd-restore-volume-conflict-rejected.yaml (G3b) — runs against thin pools, stand PASS.

go test ./... green; golangci-lint run 0 new issues (both re-verified post-rebase).

Stand validation outputs (deployed main image)

# L6 G3b
>> [G3b] target cli-matrix-g3b-tgt-conflict VD after rejected restore: ... 128 MiB ... ok
>> [G3b] OK — conflict rejected, target VD untouched
>> [G3b control] OK — VD-restore into empty RD succeeded
PASS: snap-vd-restore-volume-conflict-rejected      (exit 0)

# L7 G3b replay
  -> step: vd-restore-conflict-rejected :: ... (expect_exit 10) OK
  -> step: vd-restore-into-empty-ok :: ... OK
PASS: snap-vd-restore-volume-conflict-rejected       (exit 0)

# L6 G5
SKIP (G5): blockstor refused to register a thick LVM pool:
  no VG named "clim_g5_thickvg" — satellite advertised VGs: "blockstor-lvm"  (exit 0, self-skip)

Notes / leftovers

  • G4 is controller-cron behavior (1-min cadence) — L1-only deterministic stub-clock tests; no replay.
  • G5 L6 cannot drive a thick pool through the BS CLI on this stand: the BS satellite advertises only its auto-discovered blockstor-lvm VG, so storage-pool create lvm <loop-vg> is refused at the discovery check. The G5 gate is therefore validated by the L1 create-gate tests and the upstream-oracle capture (which produced the exact FAIL_SNAPSHOTS_NOT_SUPPORTED string BS mirrors), not by a stand L6 PASS. The cell self-skips rather than false-FAIL.
  • Oracle experiments (prefix orc-g-) and BS test resources (prefix cli-matrix-g*) were all torn down; no leftover RDs, storage pools, VGs, or loop devices on either controller (verified post-run).

Do NOT merge — draft for launcher sequencing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Warning

Review limit reached

@kvaps, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 17 minutes and 32 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 84cf610a-80e4-4f83-861e-cd50d4c3a208

📥 Commits

Reviewing files that changed from the base of the PR and between a315c12 and b187e2e.

📒 Files selected for processing (11)
  • docs/cli-parity-known-deltas.md
  • internal/controller/autosnapshot_controller.go
  • internal/controller/autosnapshot_controller_test.go
  • pkg/rest/api_call_rc.go
  • pkg/rest/snapshot_restore.go
  • pkg/rest/snapshot_restore_test.go
  • pkg/rest/snapshots.go
  • pkg/rest/snapshots_test.go
  • tests/e2e/cli-matrix/snap-create-thick-pool-rejected.sh
  • tests/e2e/cli-matrix/snap-vd-restore-volume-conflict-rejected.sh
  • tests/operator-harness/replay/snap-vd-restore-volume-conflict-rejected.yaml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch corner/g-snapshot-edges

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several fixes and enhancements to snapshot and restore behaviors, aligning them with upstream LINSTOR semantics. Specifically, it ensures that an auto-snapshot Keep value of <= 0 falls back to the default of 10, blocks snapshot restores when target volume definitions conflict, and rejects snapshot creation on thick storage pools that do not support copy-on-write. The review feedback suggests adding a defensive nil check for the resolved storage pool in nonSnapshotPoolLocations to prevent a potential nil pointer dereference.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/rest/snapshots.go
continue
}

if !pool.SupportsSnapshot {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defensive programming check: If pool is a pointer type, it is possible for it to be nil (for example, in mock environments or custom store implementations where Get might return nil, nil). To prevent a potential nil pointer dereference panic, please add a check to ensure pool is not nil before accessing pool.SupportsSnapshot.

		if pool == nil {
			continue
		}

		if !pool.SupportsSnapshot {

@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review June 4, 2026 13:59
Andrei Kvapil (kvaps) and others added 8 commits June 4, 2026 15:59
…ack deltas (corner G5/G1/G2)

Corner-case campaign items G5, G1, G2 (snapshot edges).

G5: handleSnapshotCreate now refuses a snapshot when any targeted
diskful replica lives in a storage pool whose provider cannot take
copy-on-write snapshots (thick LVM, plain FILE, DISKLESS), mirroring
upstream LINSTOR's controller-side FAIL_SNAPSHOTS_NOT_SUPPORTED (994)
refusal. Pre-fix the REST surface was ungated and blockstor's thick-LVM
provider would silently stage a 25%ORIGIN COW overlay (Bug 245 footgun).
The gate is best-effort: an unresolvable pool passes through.

G1/G2: pin the existing deliberate 501 rollback contract for the
snapshot-less-node edge (upstream >=1.31.2 resurrects deleted replicas
on rollback; blockstor refuses in-place rollback entirely and points at
snapshot-restore-resource). InUse/Primary still 409s first.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… restore-after-rd (corner G3)

Corner-case campaign item G3 (two-phase restore edge cases).

G3a: pin that snapshot restore works after the source RD's resources
were all deleted (r d) — the surviving RD shell + snapshot restore
into a new RD with replicas on the snapshot's recorded nodes. Matches
the upstream guide: restore works 'even when the original resource has
been removed from the nodes where the snapshots were taken'.

G3b: snapshot-restore-volume-definition now refuses up front with a
typed FAIL_EXISTS_VLM_DFN (502) envelope when the target RD already
carries a volume-definition whose number collides with the snapshot's,
naming the offending number — rather than letting the per-VD hydrate
Create surface a bare 409 only after a partial mutation.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…ner G4)

Corner-case campaign item G4 (AutoSnapshot semantics).

The upstream administration guide is explicit: 'If AutoSnapshot/Keep is
omitted (or <= 0), LINSTOR will keep the last 10 snapshots by default'
(linstor-administration.adoc ~2467). Pre-fix blockstor treated a Keep
value <= 0 as 'disable cleanup, keep everything', letting auto-snapshots
accumulate unbounded — a silent divergence from upstream.

pruneOldAutoSnapshots now falls back to DefaultAutoSnapshotKeep (10) on
any non-positive Keep instead of returning early. RunEvery cadence,
manual-snapshot exclusion, and the absent-Keep default were already
correct and remain pinned.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… G1/G2)

Add row 56 to the parity accept-list: BS deliberately does not expose
in-place snapshot rollback (501 + actionable text pointing at the safe
snapshot-restore-resource path). zfs rollback / lvconvert --merge are
destructive and upstream >=1.31.2 additionally resurrects deleted
replicas on rollback — both footguns we refuse to make reachable.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…e conflict

L6/L7 for corner-case G3b. The cli-matrix cell drives the operator CLI
end-to-end: source RD + snapshot, target RD with a clashing volume 0,
asserts `snapshot volume-definition restore` is rejected (non-zero) and
the target's own VD is left untouched, then a control restore into a
fresh empty RD succeeds. The L7 replay codifies the same operator
sequence with the reject (exit 10) / accept contract.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…jection

L6 for corner-case G5. The cell self-provisions a throwaway loop-backed
thick LVM pool on worker-1 (the dev stand ships only thin pools),
registers it as a blockstor `LVM` pool, spawns a tiny RD on it, and
asserts `snapshot create` is rejected with a snapshots-not-supported
error and leaves no orphan Snapshot CRD. Skips (exit 0) when a thick
pool cannot be provisioned. No L7 replay: provisioning a thick pool
requires node-level loop+VG setup the CLI-only replay-runner cannot
drive — the L1 gate tests + this L6 cell cover the surface.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Scope refuseSnapshotOnNonSnapshotPool to the resolved snapshot target
set (snap.Nodes) — the same set the offline pre-check uses — so a caller
that snapshots only a snapshot-capable subset of an otherwise
mixed-provider RD is not falsely refused. Empty target set means 'all
diskful replicas' (the default-fan-out case). Adds a subset-scoping
regression test.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Both snapshot L6 cells provision a single-replica diskful source, but
called wait_uptodate (a 2-peer convergence helper that requires a third
positional $peer and trips `set -u` with only two args). Switch to the
single-node wait_disk_state poll. Found running on the dev stand.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) merged commit ab97163 into main Jun 4, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant