Skip to content

fix(linstor): verify resource deletion completes; warn if stuck in DELETING#13076

Open
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/linstor-delete-verify
Open

fix(linstor): verify resource deletion completes; warn if stuck in DELETING#13076
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/linstor-delete-verify

Conversation

@jmsperu
Copy link
Copy Markdown
Contributor

@jmsperu jmsperu commented Apr 28, 2026

Summary

The LINSTOR plugin treats a successful HTTP response from resourceDefinitionDelete as proof the resource is gone and immediately drops the volume from CloudStack's accounting. In practice LINSTOR can return success while the resource lingers in DELETING state — for example when a DRBD peer is unreachable, quorum was lost, or a satellite is down. The plugin has no retry, no verification, and no sweeper. We've found hundreds of stuck DELETING resources accumulated over weeks because nothing surfaces the divergence between the CS view and the LINSTOR view.

What this PR does

Adds two helpers to LinstorUtil:

  • isResourceDefinitionGone(api, rscName) — existence check via resourceDefinitionList
  • waitForResourceDefinitionDeleted(api, rscName, timeoutMillis) — polls every 1s until the resource is gone OR timeout elapses; returns true on confirmed-gone, false on timeout

Calls waitForResourceDefinitionDeleted from both delete paths:

  • LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition (driver / management-server side)
  • LinstorStorageAdaptor.deRefOrDeleteResource (host / KVM agent side)

Default timeout 30 seconds (DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS).

Behaviour

  • Resource deletes within 30s (the normal case) — no log change, no behaviour change.
  • Resource still in DELETING after 30s — emits a WARN naming the resource and pointing the operator at linstor resource list. Returns success to the caller (the CS-side accounting has already moved on; throwing here would create a different inconsistency).
  • Controller transiently unreachable during poll — debug-logs each failed poll, keeps trying until deadline.

What this does NOT do (deferred)

  • Tier-2 sweeper that periodically re-attempts delete on long-stuck resources is not in this PR. The Tier-1 surface here is the minimum needed to make the problem visible in operator logs. A sweeper can land separately if maintainers want it.

Test plan

  • CI build + unit tests pass (no public API changes; only behaviour additions in private methods)
  • Manual: delete a volume on a healthy LINSTOR cluster — no new logs, ~30s shorter than wait would otherwise be (immediate return after first poll confirms gone)
  • Manual: delete a volume on a cluster with a downed satellite — observe the new WARN line in the agent log and confirm the resource is still in linstor resource list until manually cleared

…stuck

The LINSTOR plugin treats a successful HTTP response from
resourceDefinitionDelete as proof the resource is gone, then drops
the volume from CloudStack's accounting. In practice LINSTOR can
return success while the resource lingers in DELETING state — for
example when a DRBD peer is unreachable, quorum was lost, or a
satellite is down. The plugin had no retry, no verification, and no
sweeper. Operators have been finding hundreds of stuck DELETING
resources accumulated over weeks because nothing surfaced the
divergence between the CS view and the LINSTOR view.

This change adds two helpers to LinstorUtil:

  isResourceDefinitionGone(api, rscName)
    - quick existence check via resourceDefinitionList

  waitForResourceDefinitionDeleted(api, rscName, timeoutMillis)
    - polls every second until the resource is gone OR timeout elapses
    - returns true on confirmed-gone, false on timeout

and calls waitForResourceDefinitionDeleted from both delete sites
(driver: LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition;
adaptor: LinstorStorageAdaptor.deRefOrDeleteResource) with a 30s
default timeout. On timeout the plugin logs a WARN with the resource
name and a hint pointing at `linstor resource list`. We deliberately
do NOT throw on timeout: the CS-side accounting has already moved
on, and throwing would create a different inconsistency.

This is the minimal Tier-1 fix that surfaces the problem in the
operator's view. A follow-up could add a periodic sweeper that
attempts force-delete on long-stuck DELETING resources.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds post-delete verification for LINSTOR resource definitions to surface cases where LINSTOR returns success but the resource definition remains stuck in DELETING, improving operational visibility while keeping CloudStack-side deletion semantics unchanged.

Changes:

  • Add isResourceDefinitionGone and waitForResourceDefinitionDeleted helpers (with a 30s default timeout) to poll for confirmed deletion.
  • Invoke the delete-verification helper from both management-server and KVM agent delete paths.
  • Emit WARN logs when the resource definition still exists after the verification timeout.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
plugins/storage/volume/linstor/src/main/java/org/apache/cloudstack/storage/datastore/util/LinstorUtil.java Adds delete-verification helpers and a default timeout constant.
plugins/storage/volume/linstor/src/main/java/org/apache/cloudstack/storage/datastore/driver/LinstorPrimaryDataStoreDriverImpl.java Verifies RD deletion after successful API delete; warns if still present.
plugins/storage/volume/linstor/src/main/java/com/cloud/hypervisor/kvm/storage/LinstorStorageAdaptor.java Verifies RD deletion after successful API delete in agent path; warns if still present.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +432 to +452
public static boolean waitForResourceDefinitionDeleted(DevelopersApi api, String rscName, long timeoutMillis) {
final long deadline = System.currentTimeMillis() + timeoutMillis;
while (true) {
try {
if (isResourceDefinitionGone(api, rscName)) {
return true;
}
} catch (ApiException e) {
LOGGER.debug("LINSTOR delete-verify poll failed for {}: {}", rscName, e.getMessage());
// Keep polling — controller may be transiently unavailable.
}
if (System.currentTimeMillis() >= deadline) {
return false;
}
try {
Thread.sleep(1_000L);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
return false;
}
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New delete-verification behavior in waitForResourceDefinitionDeleted / isResourceDefinitionGone isn’t covered by the existing LinstorUtilTest suite. Please add unit tests for (a) confirmed-gone case, and (b) timeout/exception paths (ideally without introducing real sleeps) so regressions in the polling logic are caught in CI.

Copilot uses AI. Check for mistakes.
Comment on lines +418 to +422
List<ResourceDefinition> all = api.resourceDefinitionList(null, false, null, null, null);
if (all == null) {
return true;
}
return all.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName()));
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isResourceDefinitionGone currently calls resourceDefinitionList(null, ...), which fetches all resource definitions and then streams to find a match. This can be expensive (and will be repeated every second in waitForResourceDefinitionDeleted). Use the API's name filter instead (e.g., resourceDefinitionList(Collections.singletonList(rscName), ...)) and treat an empty result as "gone" to reduce controller load and payload size.

Suggested change
List<ResourceDefinition> all = api.resourceDefinitionList(null, false, null, null, null);
if (all == null) {
return true;
}
return all.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName()));
List<ResourceDefinition> matches = api.resourceDefinitionList(Collections.singletonList(rscName), false, null, null, null);
if (matches == null || matches.isEmpty()) {
return true;
}
return matches.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName()));

Copilot uses AI. Check for mistakes.
/**
* Default per-call timeout for {@link #waitForResourceDefinitionDeleted}. Long enough for a
* healthy LINSTOR controller to finish a normal delete; short enough not to block the calling
* agent thread for too long if the delete is genuinely stuck.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS Javadoc mentions "the calling agent thread", but this constant is also used on the management-server side (e.g., LinstorPrimaryDataStoreDriverImpl). Consider rewording to "calling thread" (or mention both contexts) to keep the comment accurate.

Suggested change
* agent thread for too long if the delete is genuinely stuck.
* thread for too long if the delete is genuinely stuck.

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.02%. Comparing base (6f4445c) to head (21c6110).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...cloudstack/storage/datastore/util/LinstorUtil.java 0.00% 24 Missing ⚠️
.../hypervisor/kvm/storage/LinstorStorageAdaptor.java 0.00% 4 Missing ⚠️
...tore/driver/LinstorPrimaryDataStoreDriverImpl.java 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13076      +/-   ##
============================================
- Coverage     18.02%   18.02%   -0.01%     
+ Complexity    16621    16620       -1     
============================================
  Files          6029     6029              
  Lines        542184   542213      +29     
  Branches      66451    66457       +6     
============================================
- Hits          97740    97737       -3     
- Misses       433428   433460      +32     
  Partials      11016    11016              
Flag Coverage Δ
uitests 3.52% <ø> (ø)
unittests 19.18% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants