Skip to content

fix(store): surface DELETING during two-phase RD/resource deletion (corner-cases E1/E3/E4)#94

Merged
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
corner/e-deletion-semantics
Jun 4, 2026
Merged

fix(store): surface DELETING during two-phase RD/resource deletion (corner-cases E1/E3/E4)#94
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
corner/e-deletion-semantics

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps Andrei Kvapil (kvaps) commented Jun 3, 2026

Summary

Closes corner-case plan items E1, E3, E4 (deletion semantics) from the LINSTOR corner-case campaign. Verifies the documented LINSTOR deletion rules against blockstor, pins them with L1 + L6 + L7 tests, and fixes the one divergence (E4).

Item Verdict Evidence
E1rd d blocked by snapshots; r d of a single replica is NOT (snapshots survive) MATCHES (already correct, now pinned) rd d guard already present in handleRDDelete with upstream code 514|MASK_ERROR (HTTP 409). r d only drops one Resource and never touches Snapshots. Stand: rejection text Cannot delete resource definition '<rd>' because it has snapshots.; single-replica r d succeeds, snapshot survives.
E3rg delete with live RDs rejected; rd modify --resource-group escape; new group's props inherited retroactively MATCHES (already correct, now pinned) Rejection text Cannot delete resource group '<rg>' because it has existing resource definitions. (code 501|MASK_ERROR). Reassign works; rd lp flips from the old group's marker to the new group's after the reassign — inheritance is a live reference, not a create-time copy (answers the E3 ⚖️ question on the BS side).
E4 — two-phase RD deletion: rd l shows DELETING in the interim; a downed satellite blocks final removal DIVERGED → fixed The k8s store projections dropped metadata.DeletionTimestamp, so a finalizer-blocked RD/Resource showed its normal flags with no CLI-visible signal that a delete was pending. Now withDeletingFlag surfaces the upstream-canonical DELETE flag when DeletionTimestamp is set, so linstor rd l / r l render the State column as DELETING while a held finalizer (e.g. a downed satellite) blocks finalisation.

Changes

  • pkg/api/v1: add ResourceFlagDelete = "DELETE" (upstream-canonical token; distinct from the internal DELETING peer-forget constant).
  • pkg/store/k8s: crdToWireRD / crdToWireResource now stamp DELETE onto the wire Flags when the CRD carries a DeletionTimestamp (E4).
  • L1: withDeletingFlag + projection unit tests; a new REST test pinning that a single-replica r d leaves snapshots intact (E1 second half).
  • L6 cli-matrix: rd-d-blocked-by-snapshot.sh (E1), rg-delete-with-rds-rejected.sh (E3), rd-d-deleting-surface.sh (E4).
  • L7 replay: rd-d-blocked-by-snapshot.yaml (E1), rg-delete-with-rds-rejected.yaml (E3).

Test evidence

go test ./... green; golangci-lint run 0 issues on touched packages.

Stand validation (dev, blockstor v0.1.9 + this branch's YAMLs/cells):

PASS: rd-d-blocked-by-snapshot            (L7 replay, 13 steps)
PASS: rg-delete-with-rds-rejected         (L7 replay, 9 steps)
PASS: rd-d-blocked-by-snapshot.sh         (L6 cell)   — rd d refused, r d allowed + snapshot survives
PASS: rg-delete-with-rds-rejected.sh      (L6 cell)   — rejection + reassign + rd lp flips src→dst

The E4 cli-matrix cell (rd-d-deleting-surface.sh) asserts the new DELETE-flag wire surface and therefore requires this branch's controller build deployed; it is pinned at L1 against the store projection. It safely simulates the "stuck on-node teardown" with a test finalizer on a cce- Resource only — it never stops a satellite.

Notes / leftovers

  • E1/E3 match upstream-documented behavior, so no cli-parity-known-deltas.md row is added; E4 moves toward parity (a fix), not a deliberate delta.
  • Oracle (upstream LINSTOR) was not yet ready during this run; the exact upstream error wording was matched from LINSTOR source and confirmed against blockstor on the stand. The E3 retroactive-inheritance question was answered directly on BS (the rd lp flip).

Oracle (upstream LINSTOR 1.33.2) cross-check

Ran the same E1/E3 sequences against the upstream Java controller on the shared stand (FILE_THIN pool):

  • E1 envelope — byte-identical. Upstream rd d with a snapshot present: exit 10, Cannot delete resource definition 'orc-e-snap-rd' because it has snapshots. Snapshots ARE supported on FILE_THIN upstream; a single-replica r d succeeded and the snapshot survived (Successful in snapshot list) — matching blockstor exactly.
  • E3 envelope — byte-identical. Upstream rg d with a live RD: exit 10, Cannot delete resource group 'orc-e-src' because it has existing resource definitions. Reassign via rd modify --resource-group succeeded and re-stamped the new group's StorPoolName onto the RD (retroactive inheritance confirmed at the effective level).
  • E3 display nuance (pre-existing, not introduced here). Upstream rd lp shows No property map found for this entry. — it does NOT inline inherited RG Aux/* markers at the RD scope. blockstor DOES inline them (Bug 105, deliberate python-CLI parity), which is what lets the L6 cell observe the src→dst marker flip directly. The inheritance itself is live in both; only the rd lp rendering differs, and that divergence is independent of these deletion-semantics items.

Summary by CodeRabbit

  • New Features

    • Two-phase resource deletion now properly surfaced with a DELETE flag to track deletion state.
    • Single replica deletion now succeeds independently of snapshot existence.
    • Snapshots now block resource definition deletion while single replica deletion remains allowed.
    • Resource definitions can now be reassigned to other resource groups to enable deletion.
  • Tests

    • Added comprehensive e2e and operator-harness tests for deletion scenarios and snapshot management.

Andrei Kvapil (kvaps) and others added 3 commits June 4, 2026 01:17
Two-phase RD/Resource deletion (corner-case E4): when a CRD carries a
DeletionTimestamp but its satellite-resource finalizer is still held
(e.g. a downed satellite cannot drain DRBD), the object lingers in the
store. Upstream LINSTOR surfaces this interim as the DELETE flag, which
the CLI renders as DELETING in the State column of `rd l` / `r l`.

The k8s store projections (crdToWireRD / crdToWireResource) previously
dropped DeletionTimestamp on the floor, so a finalizer-blocked object
showed its normal flags and the operator had no CLI-visible signal that
a delete was pending and stuck. Add withDeletingFlag to stamp the
upstream-canonical DELETE token onto the wire Flags slice when
DeletionTimestamp is set.

Also pin the second half of corner-case E1: a single-replica `r d` must
NOT trip the rd-d snapshot guard — it drops one per-node Resource and
leaves the parent RD, surviving replicas, and every Snapshot intact.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
L6 cli-matrix cells:
  - rd-d-blocked-by-snapshot.sh (E1): rd d refused with a snapshot,
    r d of a single replica allowed and snapshots survive, recovery
    flow after the snapshot is dropped.
  - rg-delete-with-rds-rejected.sh (E3): rg d refused with live RDs,
    rd modify --resource-group escape, retroactive inheritance of the
    new group's props (rd lp flips), then the empty group deletes.
  - rd-d-deleting-surface.sh (E4): a finalizer-blocked replica surfaces
    DELETING in r l during the two-phase delete; the held finalizer
    blocks final removal. Simulates the stuck on-node teardown with a
    test finalizer on a cce- Resource only (never stops a satellite).

L7 replay YAMLs (exit-code + convergence contracts):
  - rd-d-blocked-by-snapshot.yaml (E1)
  - rg-delete-with-rds-rejected.yaml (E3)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
golinstor's `resource list -o json` is a doubly-nested array with the
per-replica resource flags under `rsc_flags` (not a flat `flags` key).
Align the DELETING-surface probe with the shape the other cli-matrix
cells already use.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34cda5fa-9680-412c-92fe-8364290fb38c

📥 Commits

Reviewing files that changed from the base of the PR and between 2ddd4de and f0c95a1.

📒 Files selected for processing (10)
  • pkg/api/v1/node.go
  • pkg/rest/resource_definitions_test.go
  • pkg/store/k8s/deleting_flag_internal_test.go
  • pkg/store/k8s/resource_definitions.go
  • pkg/store/k8s/resources.go
  • tests/e2e/cli-matrix/rd-d-blocked-by-snapshot.sh
  • tests/e2e/cli-matrix/rd-d-deleting-surface.sh
  • tests/e2e/cli-matrix/rg-delete-with-rds-rejected.sh
  • tests/operator-harness/replay/rd-d-blocked-by-snapshot.yaml
  • tests/operator-harness/replay/rg-delete-with-rds-rejected.yaml

📝 Walkthrough

Walkthrough

This PR implements a two-phase resource deletion mechanism for Blockstor by introducing a DELETE wire-level flag. The flag is projected from Kubernetes deletion timestamps into both ResourceDefinitions and Resources via a new withDeletingFlag helper. The implementation is validated through unit tests, REST integration tests, and comprehensive e2e CLI and operator-harness test scenarios.

Changes

Two-phase delete implementation and verification

Layer / File(s) Summary
DELETE flag wire-level contract
pkg/api/v1/node.go
Introduces ResourceFlagDelete = "DELETE" constant documenting the wire-level two-phase delete state distinct from the internal deletion handshake token.
K8s deletion timestamp to DELETE flag projection
pkg/store/k8s/resources.go, pkg/store/k8s/resource_definitions.go
Implements withDeletingFlag helper that appends DELETE when DeletionTimestamp is non-nil, integrates into crdToWireRD and crdToWireResource to surface deletion state in wire flags without mutating input slices.
Projection helper and conversion boundary tests
pkg/store/k8s/deleting_flag_internal_test.go
Unit tests verify withDeletingFlag pass-through, append, idempotency, and no-mutation contracts; validates crdToWireRD and crdToWireResource surface DELETE for CRDs with DeletionTimestamp but not for live CRDs.
REST API single-replica deletion test
pkg/rest/resource_definitions_test.go
Integration test confirms per-replica deletion succeeds when resource definition has snapshots, leaving the parent RD, remaining replicas, and RD-scoped snapshots intact.
E2E CLI test: snapshot-based deletion blocking
tests/e2e/cli-matrix/rd-d-blocked-by-snapshot.sh
Validates RD deletion is blocked while snapshot exists, single-replica deletion succeeds despite snapshot, and RD deletion succeeds after snapshot removal; includes polling and cleanup.
E2E CLI test: DELETE flag visibility during two-phase delete
tests/e2e/cli-matrix/rd-d-deleting-surface.sh
Verifies child Resource replicas surface DELETE/DELETING state via flags when RD deletion is triggered even if held by finalizer; confirms state clears once finalizer is removed; uses finalizer-based simulation.
E2E CLI test: resource group deletion with RD reference constraints
tests/e2e/cli-matrix/rg-delete-with-rds-rejected.sh
Validates resource group deletion is blocked while RD references it, reassigning RD unblocks deletion, and property inheritance is retroactively updated after reassignment.
Operator harness replay scenarios
tests/operator-harness/replay/rd-d-blocked-by-snapshot.yaml, tests/operator-harness/replay/rg-delete-with-rds-rejected.yaml
YAML replay test scenarios codifying two-phase delete contracts for snapshots and resource group references at the orchestration level with orphan invariants.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hopped through deletion's dance,
Where flags now surface two-phase chance.
Timestamps projected, finalizers held,
Two-step delete—the story well-told! 🐰✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch corner/e-deletion-semantics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements and tests several deletion semantics and corner cases in blockstor, specifically focusing on two-phase deletion visibility (surfacing the DELETE flag on resources and resource definitions with a DeletionTimestamp), snapshot-blocked resource definition deletion, and resource group deletion constraints. The changes include new helper functions, unit tests, and comprehensive E2E CLI-matrix and replay tests. The review feedback suggests optimizing the shell scripts by replacing subshell-spawning $(date +%s) calls with the Bash built-in $SECONDS variable inside wait loops to improve test execution efficiency in CI environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +79 to +81
deadline=$(( $(date +%s) + 60 ))
snap_seen=false
while (( $(date +%s) < deadline )); do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using $(date +%s) in a loop spawns a subshell and forks an external date process on every single iteration. This is highly inefficient and can slow down test execution under resource-constrained CI environments.

Using the Bash built-in $SECONDS variable is much more efficient as it tracks the elapsed time since shell startup entirely in-memory without spawning any external processes.

Suggested change
deadline=$(( $(date +%s) + 60 ))
snap_seen=false
while (( $(date +%s) < deadline )); do
deadline=$(( SECONDS + 60 ))
snap_seen=false
while (( SECONDS < deadline )); do

Comment on lines +160 to +161
deadline=$(( $(date +%s) + 60 ))
while (( $(date +%s) < deadline )); do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using $(date +%s) in a loop spawns a subshell and forks an external date process on every single iteration. This is highly inefficient and can slow down test execution under resource-constrained CI environments.

Using the Bash built-in $SECONDS variable is much more efficient as it tracks the elapsed time since shell startup entirely in-memory without spawning any external processes.

Suggested change
deadline=$(( $(date +%s) + 60 ))
while (( $(date +%s) < deadline )); do
deadline=$(( SECONDS + 60 ))
while (( SECONDS < deadline )); do

Comment on lines +181 to +182
deadline=$(( $(date +%s) + 120 ))
while (( $(date +%s) < deadline )); do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using $(date +%s) in a loop spawns a subshell and forks an external date process on every single iteration. This is highly inefficient and can slow down test execution under resource-constrained CI environments.

Using the Bash built-in $SECONDS variable is much more efficient as it tracks the elapsed time since shell startup entirely in-memory without spawning any external processes.

Suggested change
deadline=$(( $(date +%s) + 120 ))
while (( $(date +%s) < deadline )); do
deadline=$(( SECONDS + 120 ))
while (( SECONDS < deadline )); do

Comment on lines +132 to +134
deadline=$(( $(date +%s) + 60 ))
deleting_seen=false
while (( $(date +%s) < deadline )); do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using $(date +%s) in a loop spawns a subshell and forks an external date process on every single iteration. This is highly inefficient and can slow down test execution under resource-constrained CI environments.

Using the Bash built-in $SECONDS variable is much more efficient as it tracks the elapsed time since shell startup entirely in-memory without spawning any external processes.

Suggested change
deadline=$(( $(date +%s) + 60 ))
deleting_seen=false
while (( $(date +%s) < deadline )); do
deadline=$(( SECONDS + 60 ))
deleting_seen=false
while (( SECONDS < deadline )); do

Comment on lines +180 to +182
deadline=$(( $(date +%s) + 120 ))
cleared=false
while (( $(date +%s) < deadline )); do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using $(date +%s) in a loop spawns a subshell and forks an external date process on every single iteration. This is highly inefficient and can slow down test execution under resource-constrained CI environments.

Using the Bash built-in $SECONDS variable is much more efficient as it tracks the elapsed time since shell startup entirely in-memory without spawning any external processes.

Suggested change
deadline=$(( $(date +%s) + 120 ))
cleared=false
while (( $(date +%s) < deadline )); do
deadline=$(( SECONDS + 120 ))
cleared=false
while (( SECONDS < deadline )); do

Comment on lines +135 to +137
deadline=$(( $(date +%s) + 30 ))
flipped=false
while (( $(date +%s) < deadline )); do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using $(date +%s) in a loop spawns a subshell and forks an external date process on every single iteration. This is highly inefficient and can slow down test execution under resource-constrained CI environments.

Using the Bash built-in $SECONDS variable is much more efficient as it tracks the elapsed time since shell startup entirely in-memory without spawning any external processes.

Suggested change
deadline=$(( $(date +%s) + 30 ))
flipped=false
while (( $(date +%s) < deadline )); do
deadline=$(( SECONDS + 30 ))
flipped=false
while (( SECONDS < deadline )); do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant