fix(rest): walk RG StoragePoolList in fresh-create pool resolution (Bug 364)#67
Conversation
…ug 364) `linstor r c <node> <rd>` without `--storage-pool` against an RG that pins its default via `select_filter.storage_pool_list` (not `select_filter.storage_pool`) created a Resource with empty `Props["StorPoolName"]`. The satellite reconciler then had no pool to bind to and the replica wedged at "Provisioning" — visible to the operator only as a phantom replica that never reached UpToDate. linstor-csi is the canonical caller for this path: it posts no body to the per-node resource-create endpoint and relies on RG-side propagation for the pool name. When the StorageClass sets `linstor.csi.linbit.com/storagePool: <p>`, linstor-csi's RGCreate path lands the value under SelectFilter.StoragePoolList[0] (not .StoragePool), so every Cozystack volume hits this code path. Pre-fix `resolveTakeoverStorPool` (the fallback chain `resolveStorPoolForFreshCreate` walks before the satellite-side provisioning starts) only checked `rg.SelectFilter.StoragePool`, ignoring `rg.SelectFilter.StoragePoolList`. The matching `resolveGatePoolName` helper that gates per-pool capacity already tolerated the list tier; this fix brings the takeover resolver in line with the gate's existing semantics. Extends the helper with the `StoragePoolList[0]` fallback after the single-StoragePool check, mirroring upstream LINSTOR's CtrlRscCrtApiHelper.resolveStorPoolName tier ordering and the existing `resolveGatePoolName` walk. ~5 lines of intent. Unit tests cover the canonical reproducer (REST round-trip through the live handler), direct probes against `resolveTakeoverStorPool` for the StoragePoolList branch, the single-StoragePool precedence over a list, and the "no pool anywhere" empty-string fallthrough. The new e2e catcher (tests/e2e/resource-create-pool-resolve.sh) pins the fix on a live cluster: an RG with only storage_pool_list must drive Props["StorPoolName"] to the list's first entry and the satellite must converge to UpToDate. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR fixes Bug 364 by updating the storage pool fallback chain in ChangesBug 364: Storage Pool Resolution from SelectFilter List
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses Bug 364 by updating resolveTakeoverStorPool to fall back to StoragePoolList[0] when StoragePool is empty, ensuring that resources created without an explicit storage pool can correctly resolve their pool name from the Resource Group's defaults. Unit and end-to-end tests have been added to verify this behavior. The review feedback highlights two issues in the new e2e test script: first, the cleanup function terminates the port-forwarding process before sending the deletion requests, preventing proper cleanup; second, the script should fail early with a clear error message if port-forwarding fails to start.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| cleanup() { | ||
| kill "$PF_PID" 2>/dev/null || true | ||
| wait "$PF_PID" 2>/dev/null || true | ||
| curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-definitions/bug364-rd" >/dev/null 2>&1 || true | ||
| curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-groups/bug364-rg" >/dev/null 2>&1 || true | ||
| } |
There was a problem hiding this comment.
In the cleanup function, kill "$PF_PID" is called before the curl -s -X DELETE commands. This terminates the port-forwarding process, causing the subsequent curl deletion requests to fail with "Connection refused". As a result, the test resources (bug364-rd and bug364-rg) are not cleaned up from the cluster.
To fix this, perform the curl deletion requests before killing the port-forward process.
| cleanup() { | |
| kill "$PF_PID" 2>/dev/null || true | |
| wait "$PF_PID" 2>/dev/null || true | |
| curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-definitions/bug364-rd" >/dev/null 2>&1 || true | |
| curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-groups/bug364-rg" >/dev/null 2>&1 || true | |
| } | |
| cleanup() { | |
| if [[ -n "${PF_PORT:-}" ]]; then | |
| curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-definitions/bug364-rd" >/dev/null 2>&1 || true | |
| curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-groups/bug364-rg" >/dev/null 2>&1 || true | |
| fi | |
| if [[ -n "${PF_PID:-}" ]]; then | |
| kill "$PF_PID" 2>/dev/null || true | |
| wait "$PF_PID" 2>/dev/null || true | |
| fi | |
| } |
| for _ in $(seq 1 30); do | ||
| if curl -sf -m1 "http://localhost:$PF_PORT/v1/nodes" >/dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 0.5 | ||
| done |
There was a problem hiding this comment.
If the port-forwarding fails to start or bind, the wait loop will silently exhaust all 30 attempts and the script will continue, only to fail later with a less clear connection error. It is better to explicitly check if the port-forwarding succeeded and fail with a clear message and the port-forward logs if it didn't.
| for _ in $(seq 1 30); do | |
| if curl -sf -m1 "http://localhost:$PF_PORT/v1/nodes" >/dev/null 2>&1; then | |
| break | |
| fi | |
| sleep 0.5 | |
| done | |
| for i in {1..30}; do | |
| if curl -sf -m1 "http://localhost:$PF_PORT/v1/nodes" >/dev/null 2>&1; then | |
| break | |
| fi | |
| if [[ $i -eq 30 ]]; then | |
| echo "FAIL: port-forward failed to start" | |
| cat /tmp/bug364-pf.log | |
| exit 1 | |
| fi | |
| sleep 0.5 | |
| done |
Summary
linstor r c <node> <rd>without--storage-poolagainst an RG that pins its default viaselect_filter.storage_pool_list(notselect_filter.storage_pool) created a Resource with emptyProps["StorPoolName"]. The satellite reconciler then had no pool to bind to and the replica wedged at "Provisioning" forever.linstor-csi is the canonical caller for this path: it posts no body to the per-node resource-create endpoint and relies on RG-side propagation for the pool name. When the StorageClass sets
linstor.csi.linbit.com/storagePool: <p>, linstor-csi's RGCreate path lands the value underSelectFilter.StoragePoolList[0](not.StoragePool), so every Cozystack volume hits this code path.Pre-fix
resolveTakeoverStorPoolonly checkedrg.SelectFilter.StoragePool. The matchingresolveGatePoolNamehelper already tolerated the list tier; this fix brings the takeover resolver in line. ~5 lines of intent, mirroring upstream LINSTOR'sCtrlRscCrtApiHelper.resolveStorPoolNametier ordering.Reproducer
Live dev stand, Round 7 bug-hunt 2026-06-02:
Test plan
go test ./pkg/rest/ -timeout 240spassestests/e2e/resource-create-pool-resolve.shpins the fix on a live cluster (Props stamped + replica converges to UpToDate)Summary by CodeRabbit
Release Notes
Bug Fixes
Tests