Skip to content

fix(rest): walk RG StoragePoolList in fresh-create pool resolution (Bug 364)#67

Merged
Andrei Kvapil (kvaps) merged 1 commit into
mainfrom
fix/bug-364-resource-create-pool-resolve
Jun 2, 2026
Merged

fix(rest): walk RG StoragePoolList in fresh-create pool resolution (Bug 364)#67
Andrei Kvapil (kvaps) merged 1 commit into
mainfrom
fix/bug-364-resource-create-pool-resolve

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps Andrei Kvapil (kvaps) commented Jun 2, 2026

Summary

linstor r c <node> <rd> without --storage-pool against an RG that pins its default via select_filter.storage_pool_list (not select_filter.storage_pool) created a Resource with empty Props["StorPoolName"]. The satellite reconciler then had no pool to bind to and the replica wedged at "Provisioning" forever.

linstor-csi is the canonical caller for this path: it posts no body to the per-node resource-create endpoint and relies on RG-side propagation for the pool name. When the StorageClass sets linstor.csi.linbit.com/storagePool: <p>, linstor-csi's RGCreate path lands the value under SelectFilter.StoragePoolList[0] (not .StoragePool), so every Cozystack volume hits this code path.

Pre-fix resolveTakeoverStorPool only checked rg.SelectFilter.StoragePool. The matching resolveGatePoolName helper already tolerated the list tier; this fix brings the takeover resolver in line. ~5 lines of intent, mirroring upstream LINSTOR's CtrlRscCrtApiHelper.resolveStorPoolName tier ordering.

Reproducer

Live dev stand, Round 7 bug-hunt 2026-06-02:

$ curl -sS -X PUT http://.../v1/resource-groups/testrg \
    -d '{"select_filter":{"storage_pool":"","storage_pool_list":["lvm-thin"]}}'
200

$ linstor r c dev-worker-1 testlist
SUCCESS: resource(s) created on resource-definition: testlist

$ curl -sS http://.../v1/resource-definitions/testlist/resources | jq '.[].props'
null

Test plan

  • Unit tests cover the REST round-trip reproducer, the helper directly, single-StoragePool precedence over a list, and the empty-fallthrough
  • go test ./pkg/rest/ -timeout 240s passes
  • New e2e catcher tests/e2e/resource-create-pool-resolve.sh pins the fix on a live cluster (Props stamped + replica converges to UpToDate)
  • CI green

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed resource provisioning failures that occurred when storage pools were specified using a list instead of a direct selection, which prevented resources from being properly assigned to the appropriate storage pool from the list.
  • Tests

    • Added comprehensive testing including unit tests for different storage pool configuration scenarios and priority rules, plus end-to-end tests validating the fix works correctly on live clusters.

…ug 364)

`linstor r c <node> <rd>` without `--storage-pool` against an RG that
pins its default via `select_filter.storage_pool_list` (not
`select_filter.storage_pool`) created a Resource with empty
`Props["StorPoolName"]`. The satellite reconciler then had no pool to
bind to and the replica wedged at "Provisioning" — visible to the
operator only as a phantom replica that never reached UpToDate.

linstor-csi is the canonical caller for this path: it posts no body
to the per-node resource-create endpoint and relies on RG-side
propagation for the pool name. When the StorageClass sets
`linstor.csi.linbit.com/storagePool: <p>`, linstor-csi's RGCreate
path lands the value under SelectFilter.StoragePoolList[0] (not
.StoragePool), so every Cozystack volume hits this code path.

Pre-fix `resolveTakeoverStorPool` (the fallback chain
`resolveStorPoolForFreshCreate` walks before the satellite-side
provisioning starts) only checked `rg.SelectFilter.StoragePool`,
ignoring `rg.SelectFilter.StoragePoolList`. The matching
`resolveGatePoolName` helper that gates per-pool capacity already
tolerated the list tier; this fix brings the takeover resolver in
line with the gate's existing semantics.

Extends the helper with the `StoragePoolList[0]` fallback after the
single-StoragePool check, mirroring upstream LINSTOR's
CtrlRscCrtApiHelper.resolveStorPoolName tier ordering and the
existing `resolveGatePoolName` walk. ~5 lines of intent.

Unit tests cover the canonical reproducer (REST round-trip through
the live handler), direct probes against `resolveTakeoverStorPool`
for the StoragePoolList branch, the single-StoragePool precedence
over a list, and the "no pool anywhere" empty-string fallthrough.
The new e2e catcher (tests/e2e/resource-create-pool-resolve.sh)
pins the fix on a live cluster: an RG with only storage_pool_list
must drive Props["StorPoolName"] to the list's first entry and the
satellite must converge to UpToDate.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab07396b-2c4e-4c2b-867e-069b8c500bcc

📥 Commits

Reviewing files that changed from the base of the PR and between 6abc0a0 and 45ac0e2.

📒 Files selected for processing (3)
  • pkg/rest/autoplace.go
  • pkg/rest/bug_364_resource_create_pool_resolve_test.go
  • tests/e2e/resource-create-pool-resolve.sh

📝 Walkthrough

Walkthrough

This PR fixes Bug 364 by updating the storage pool fallback chain in resolveTakeoverStorPool. When a ResourceGroup has no SelectFilter.StoragePool but does have SelectFilter.StoragePoolList, the function now returns the first list entry instead of leaving the pool unresolved. Unit tests and an end-to-end shell script validate the behavior across multiple scenarios.

Changes

Bug 364: Storage Pool Resolution from SelectFilter List

Layer / File(s) Summary
Storage pool resolution fallback logic
pkg/rest/autoplace.go
resolveTakeoverStorPool now consults StoragePoolList[0] when SelectFilter.StoragePool is empty, preventing provisioning stalls when pools are configured only via the list.
Unit tests for pool resolution behavior
pkg/rest/bug_364_resource_create_pool_resolve_test.go
Four test cases validate the fix: HTTP resource-create path stamps StorPoolName from StoragePoolList[0], direct helper call returns correct pool, precedence is correct when both fields are set, and empty-string behavior is preserved when neither is configured.
End-to-end cluster test for Bug 364
tests/e2e/resource-create-pool-resolve.sh
Shell test reproduces Bug 364 on a live cluster by seeding a resource group with StoragePoolList only, issuing resource-create with empty body, verifying StorPoolName is stamped to StoragePoolList[0], and polling until replica state reaches UpToDate.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cozystack/blockstor#51: Applies similar fallback-to-StoragePoolList[0] logic in the capacity-gate resolveGatePoolName path for pool-name resolution.

Poem

🐰 A pool by any other list shall flow,
When StoragePool's empty, we now know—
Pluck the first from the list below,
And watch those resources go! 🌊

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bug-364-resource-create-pool-resolve

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses Bug 364 by updating resolveTakeoverStorPool to fall back to StoragePoolList[0] when StoragePool is empty, ensuring that resources created without an explicit storage pool can correctly resolve their pool name from the Resource Group's defaults. Unit and end-to-end tests have been added to verify this behavior. The review feedback highlights two issues in the new e2e test script: first, the cleanup function terminates the port-forwarding process before sending the deletion requests, preventing proper cleanup; second, the script should fail early with a clear error message if port-forwarding fails to start.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +41 to +46
cleanup() {
kill "$PF_PID" 2>/dev/null || true
wait "$PF_PID" 2>/dev/null || true
curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-definitions/bug364-rd" >/dev/null 2>&1 || true
curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-groups/bug364-rg" >/dev/null 2>&1 || true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the cleanup function, kill "$PF_PID" is called before the curl -s -X DELETE commands. This terminates the port-forwarding process, causing the subsequent curl deletion requests to fail with "Connection refused". As a result, the test resources (bug364-rd and bug364-rg) are not cleaned up from the cluster.

To fix this, perform the curl deletion requests before killing the port-forward process.

Suggested change
cleanup() {
kill "$PF_PID" 2>/dev/null || true
wait "$PF_PID" 2>/dev/null || true
curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-definitions/bug364-rd" >/dev/null 2>&1 || true
curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-groups/bug364-rg" >/dev/null 2>&1 || true
}
cleanup() {
if [[ -n "${PF_PORT:-}" ]]; then
curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-definitions/bug364-rd" >/dev/null 2>&1 || true
curl -s -X DELETE "http://localhost:$PF_PORT/v1/resource-groups/bug364-rg" >/dev/null 2>&1 || true
fi
if [[ -n "${PF_PID:-}" ]]; then
kill "$PF_PID" 2>/dev/null || true
wait "$PF_PID" 2>/dev/null || true
fi
}

Comment on lines +49 to +54
for _ in $(seq 1 30); do
if curl -sf -m1 "http://localhost:$PF_PORT/v1/nodes" >/dev/null 2>&1; then
break
fi
sleep 0.5
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the port-forwarding fails to start or bind, the wait loop will silently exhaust all 30 attempts and the script will continue, only to fail later with a less clear connection error. It is better to explicitly check if the port-forwarding succeeded and fail with a clear message and the port-forward logs if it didn't.

Suggested change
for _ in $(seq 1 30); do
if curl -sf -m1 "http://localhost:$PF_PORT/v1/nodes" >/dev/null 2>&1; then
break
fi
sleep 0.5
done
for i in {1..30}; do
if curl -sf -m1 "http://localhost:$PF_PORT/v1/nodes" >/dev/null 2>&1; then
break
fi
if [[ $i -eq 30 ]]; then
echo "FAIL: port-forward failed to start"
cat /tmp/bug364-pf.log
exit 1
fi
sleep 0.5
done

@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review June 2, 2026 11:57
@kvaps Andrei Kvapil (kvaps) merged commit dc8799c into main Jun 2, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant