Skip to content

kvm: fix direct-download live storage migration#13084

Open
Kunalbehbud wants to merge 2 commits intoapache:4.22from
Kunalbehbud:fix-kvm-direct-download-live-migration
Open

kvm: fix direct-download live storage migration#13084
Kunalbehbud wants to merge 2 commits intoapache:4.22from
Kunalbehbud:fix-kvm-direct-download-live-migration

Conversation

@Kunalbehbud
Copy link
Copy Markdown

@Kunalbehbud Kunalbehbud commented Apr 29, 2026

Description

This PR fixes live storage migration for KVM instances whose migrated volumes are backed by direct-download templates.

The problematic path is linked-clone live storage migration. For that mode, libvirt expects the rest of the backing chain to already exist on the destination and to match the source. That assumption is not safe for direct-download backed volumes, where the template may have been bypassed or staged directly on primary storage. When such a volume is part of the actual migration set, this PR forces the KVM migration request to use full-clone storage migration instead.

A few related edge cases are handled in the same path:

  • the full-clone decision ignores volumes that are skipped from the migration request, such as same-pool PowerFlex volumes or volumes rejected by shouldMigrateVolume
  • VolumeDataFactoryImpl now carries the template directDownload flag consistently across its getVolume(...) variants, including volumes whose template was removed later
  • copied template references are updated instead of blindly persisted, and missing source references now fail with a useful error instead of a later NPE
  • ModifyTargetsCommand failures preserve the agent error details and empty connected-path answers fail with a clearer exception

This keeps the change scoped to the KVM storage migration bug. It does not try to mix linked-clone and full-clone per disk, since the current KVM/libvirt migration command chooses the storage migration mode for the VM migration request as a whole.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

N/A. This is a backend KVM storage migration fix.

How Has This Been Tested?

Targeted unit tests:

mvn -pl engine/storage/datamotion,engine/storage/volume -am \
  -Dtest=AncientDataMotionStrategyTest,KvmNonManagedStorageSystemDataMotionTest,StorageSystemDataMotionStrategyTest,VolumeDataFactoryImplTest \
  -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false test

Result: 74 tests run, 0 failures, build success.

Manual verification was also done in a two-host KVM 4.22 test environment:

  • registered a direct-download Rocky Linux template
  • deployed an instance from that template
  • live migrated the instance with storage from host A to host B
  • live migrated it back from host B to host A
  • confirmed the VM stayed running after each migration
  • confirmed management logs showed the expected full-clone fallback for the direct-download backed volume

How did you try to break this feature and the system with this change?

Covered the boundary cases that are most likely to regress this path:

  • mixed migration requests with a direct-download volume and a normal volume
  • direct-download volumes that are skipped and should not force the rest of the request to full clone
  • existing and missing copied template references
  • destination template references whose backing path differs from the source backing path
  • ModifyTargetsCommand returning a generic failed Answer
  • ModifyTargetsAnswer returning no connected paths
  • volumes with no template ID, to avoid unnecessary template lookups

Direct-download backed volumes should not use the linked-clone live storage migration path because the backing chain is not guaranteed on the destination host. Force the KVM migration request to full clone when a migrated volume is direct-download backed, while ignoring volumes that are skipped from the migration request.

Keep direct-download volume metadata consistent across VolumeDataFactory paths and make target-connection/template-reference failures explicit. Add unit coverage for forced full-clone, skipped-volume boundaries, copied template references, ModifyTargets answers, and VolumeDataFactory propagation.
@Kunalbehbud Kunalbehbud force-pushed the fix-kvm-direct-download-live-migration branch from 7f25ad0 to 4b6de36 Compare April 29, 2026 13:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes KVM live storage migration failures when migrating volumes backed by direct-download templates by forcing full-clone storage migration in cases where linked-clone backing chains cannot be assumed to exist on the destination.

Changes:

  • Force full-clone live storage migration when any migrated volume in the request is backed by a direct-download template (while ignoring volumes skipped from the request).
  • Ensure VolumeDataFactoryImpl consistently sets the directDownload flag on VolumeObject via findByIdIncludingRemoved(...).
  • Improve robustness around template reference updates and ModifyTargets/connect-path error handling, with expanded unit test coverage.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
engine/storage/datamotion/src/main/java/org/apache/cloudstack/storage/motion/StorageSystemDataMotionStrategy.java Implements the full-clone fallback decision for direct-download-backed volumes; improves ModifyTargets handling and copied-template reference updates.
engine/storage/volume/src/main/java/org/apache/cloudstack/storage/volume/VolumeDataFactoryImpl.java Centralizes/standardizes setting directDownload on returned VolumeObject instances (including templates removed later).
engine/storage/datamotion/src/test/java/org/apache/cloudstack/storage/motion/StorageSystemDataMotionStrategyTest.java Adds unit tests for the new migration decision logic and updated ModifyTargets/template-reference behaviors.
engine/storage/datamotion/src/test/java/org/apache/cloudstack/storage/motion/KvmNonManagedStorageSystemDataMotionTest.java Adds a unit test covering the “direct-download template already staged” behavior.
engine/storage/volume/src/test/java/org/apache/cloudstack/storage/volume/VolumeDataFactoryImplTest.java Adds unit tests validating directDownload flag propagation across getVolume(...) variants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Map<VolumeInfo, DataStore> volumeMap = new HashMap<>();

configurePoolLookup(skippedDirectDownloadVolume, samePowerFlexStore, 1L, 1L, StoragePoolType.PowerFlex, StoragePoolType.PowerFlex);
configurePoolLookup(regularVolume, destPrimary, 3L, 4L, StoragePoolType.NetworkFilesystem, StoragePoolType.NetworkFilesystem);
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In shouldForceFullCloneMigrationIgnoresSkippedDirectDownloadVolumes, skippedDirectDownloadVolume is never stubbed to return true from isDirectDownload(), so the assertion would still pass even if the skip logic were removed. To actually verify the “ignored direct-download volume” behavior, set Mockito.when(skippedDirectDownloadVolume.isDirectDownload()).thenReturn(true) and keep the expectation false due to the PowerFlex same-pool skip.

Suggested change
configurePoolLookup(regularVolume, destPrimary, 3L, 4L, StoragePoolType.NetworkFilesystem, StoragePoolType.NetworkFilesystem);
configurePoolLookup(regularVolume, destPrimary, 3L, 4L, StoragePoolType.NetworkFilesystem, StoragePoolType.NetworkFilesystem);
Mockito.when(skippedDirectDownloadVolume.isDirectDownload()).thenReturn(true);

Copilot uses AI. Check for mistakes.
Comment on lines +2017 to +2023
protected boolean shouldForceFullCloneMigration(Map<VolumeInfo, DataStore> volumeDataStoreMap, Host destHost) {
for (Map.Entry<VolumeInfo, DataStore> entry : volumeDataStoreMap.entrySet()) {
VolumeInfo srcVolumeInfo = entry.getKey();
DataStore destDataStore = entry.getValue();
StoragePoolVO sourceStoragePool = _storagePoolDao.findById(srcVolumeInfo.getPoolId());
StoragePoolVO destStoragePool = _storagePoolDao.findById(destDataStore.getId());

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldForceFullCloneMigration performs two _storagePoolDao.findById(...) lookups per volume, and copyAsync immediately repeats those same lookups in the main loop. This doubles DAO calls for every live migration request. Consider computing the “force full clone” decision inside the existing copyAsync loop (before decideMigrationType...) or refactoring to reuse the already-fetched sourceStoragePool/destStoragePool objects to avoid redundant database access.

Copilot uses AI. Check for mistakes.
Avoid repeating storage pool lookups while deciding and preparing KVM live storage migration requests.

Also make the skipped-volume test assert that skipped direct-download volumes do not affect the full-clone decision.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 73.75000% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.71%. Comparing base (ffebe8e) to head (20ee0aa).

Files with missing lines Patch % Lines
...torage/motion/StorageSystemDataMotionStrategy.java 74.28% 12 Missing and 6 partials ⚠️
...oudstack/storage/volume/VolumeDataFactoryImpl.java 70.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.22   #13084      +/-   ##
============================================
+ Coverage     17.67%   17.71%   +0.03%     
- Complexity    15789    15820      +31     
============================================
  Files          5922     5922              
  Lines        533094   533150      +56     
  Branches      65210    65220      +10     
============================================
+ Hits          94246    94440     +194     
+ Misses       428208   428054     -154     
- Partials      10640    10656      +16     
Flag Coverage Δ
uitests 3.69% <ø> (ø)
unittests 18.79% <73.75%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants